Obliteratus Abliteration

Editor's Note

This is a mechanistic interpretability toolkit that surgically removes refusal behaviors from language models by identifying and projecting out "refusal directions" in the model's hidden states. You get three methods: basic SVD extraction, an advanced pipeline with whitened SVD and iterative refinement, and an informed mode that auto-configures itself using geometry analysis. Ships with a Gradio UI, CLI commands for sweeps and benchmarks, and Python APIs for custom probe datasets. The analysis suite includes things like concept cone geometry (how many distinct refusal mechanisms exist) and alignment imprint detection (was this DPO or RLHF trained). If you need to understand or modify safety tuning at the weight level rather than just prompt level, this gives you the full stack.

Install

npx skills add https://github.com/aradotso/trending-skills --skill obliteratus-abliteration

Votes

Installs687

GitHub Stars7

Install

npx skills add https://github.com/aradotso/trending-skills --skill obliteratus-abliteration

Obliteratus Abliteration

Install

Obliteratus Abliteration

Install

Related Backend & APIs Skills

Related Backend & APIs Skills