This is a mechanistic interpretability toolkit that surgically removes refusal behaviors from language models by identifying and projecting out "refusal directions" in the model's hidden states. You get three methods: basic SVD extraction, an advanced pipeline with whitened SVD and iterative refinement, and an informed mode that auto-configures itself using geometry analysis. Ships with a Gradio UI, CLI commands for sweeps and benchmarks, and Python APIs for custom probe datasets. The analysis suite includes things like concept cone geometry (how many distinct refusal mechanisms exist) and alignment imprint detection (was this DPO or RLHF trained). If you need to understand or modify safety tuning at the weight level rather than just prompt level, this gives you the full stack.
npx skills add https://github.com/aradotso/trending-skills --skill obliteratus-abliteration