This implements Anthropic's Constitutional AI training method, the two-phase approach that powers Claude's safety system. You generate responses, have the model critique itself against a constitution (set of principles), revise based on critiques, then fine-tune. The RL phase uses RLAIF where AI evaluators replace human labelers for preference data. The real win is scalable safety alignment without needing humans to label harmful outputs, though you'll need to iterate on your constitution principles to avoid overly evasive refusals. Requires meaningful compute for the RL phase (2x A100s for 7B models) since you're training both policy and reward models. Good fit if you want explainable safety decisions and have clear principles to encode.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill constitutional-ai