Token Compressor

8STDIOregistry active

Summary

Compresses prompts by 40–60% using a local two-stage pipeline: llama3.2:1b rewrites text to semantic minimum, then nomic-embed-text validates via cosine similarity (default 0.85 threshold). If validation fails, original text passes through unchanged. Exposes a single compress_prompt tool that takes text and returns compressed output plus token stats. Requires Ollama running locally with both models pulled. Built for reducing token costs in long or repetitive workflows without sacrificing conditionals or negations. Skips compression automatically below 80 tokens. Works well as a pre-processing layer before expensive API calls or when operating under strict context budgets.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

token-compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

mcp-name: io.github.base76-research-lab/token-compressor

Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.

Built by Base76 Research Lab — research into epistemic AI architecture.

Live demo

Intent Compiler MVP is now live and uses this project as part of the idea -> spec -> compressed output flow:

Live: https://intent-compiler-mvp.pages.dev
Product repo: https://github.com/base76-research-lab/token-compressor

What it does

token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:

LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.85) — if not, the original is sent unchanged

The result: shorter prompts, lower costs, same intent.

Input prompt (300 tokens)
        ↓
  LLM compresses
        ↓
  Embedding validates (cosine ≥ 0.85?)
        ↓
  Pass → compressed (120 tokens)   Fail → original (300 tokens)

Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.

Requirements

Python 3.10+
Ollama running locally
Two models pulled:

ollama pull llama3.2:1b
ollama pull nomic-embed-text

Python dependencies:

pip install ollama numpy

Quick start

from compressor import LLMCompressEmbedValidate

pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")

print(result.output_text)   # compressed (or original if validation failed)
print(result.report())      # MODE / COVERAGE / TOKENS saved

Result object:

Field	Description
`output_text`	Text to send to your LLM
`mode`	`compressed` / `raw_fallback` / `skipped`
`coverage`	Cosine similarity (0.0–1.0)
`tokens_in`	Estimated input tokens
`tokens_out`	Estimated output tokens
`tokens_saved`	Difference

CLI usage

echo "Your long prompt here..." | python3 cli.py

Output: compressed text on stdout, stats on stderr.

Claude Code hook (recommended setup)

Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:

{
  "type": "command",
  "command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}

This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.

MCP server

The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.

Install:

pip install token-compressor-mcp

Tool: compress_prompt

Input: text (string)
Output: compressed text + stats footer

Claude Code MCP config (~/.claude/settings.json):

{
  "mcpServers": {
    "token-compressor": {
      "command": "uvx",
      "args": ["token-compressor-mcp"]
    }
  }
}

Or from source:

{
  "mcpServers": {
    "token-compressor": {
      "command": "python3",
      "args": ["-m", "token_compressor_mcp"],
      "cwd": "/path/to/token-compressor"
    }
  }
}

Configuration

pipeline = LLMCompressEmbedValidate(
    threshold=0.85,          # cosine similarity floor (lower = more aggressive)
    min_tokens=80,           # skip pipeline below this (not worth compressing)
    compress_model="llama3.2:1b",
    embed_model="nomic-embed-text",
)

How it works

Stage 1 — LLM compression

The compression prompt instructs the model to:

Preserve all conditionals (if, only if, unless, when, but only)
Preserve all negations
Remove filler, hedging, redundancy
Target 40–60% of original length

Stage 2 — Embedding validation

Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.

Results

Tested across Swedish and English prompts, technical and natural language:

Input	Tokens in	Tokens out	Saved
Research abstract (EN)	89	38	57%
Session intent (SV)	32	18	44%
Technical instruction	47	22	53%
Short command (<80t)	—	—	skipped

Research background

This tool implements the architecture from:

Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535

Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.

License

MIT — Base76 Research Lab, Sweden

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

token-compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

mcp-name: io.github.base76-research-lab/token-compressor

Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.

Built by Base76 Research Lab — research into epistemic AI architecture.

Live demo

Intent Compiler MVP is now live and uses this project as part of the idea -> spec -> compressed output flow:

Live: https://intent-compiler-mvp.pages.dev
Product repo: https://github.com/base76-research-lab/token-compressor

What it does

token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:

LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.85) — if not, the original is sent unchanged

The result: shorter prompts, lower costs, same intent.

Input prompt (300 tokens)
        ↓
  LLM compresses
        ↓
  Embedding validates (cosine ≥ 0.85?)
        ↓
  Pass → compressed (120 tokens)   Fail → original (300 tokens)

Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.

Requirements

Python 3.10+
Ollama running locally
Two models pulled:

ollama pull llama3.2:1b
ollama pull nomic-embed-text

Python dependencies:

pip install ollama numpy

Quick start

from compressor import LLMCompressEmbedValidate

pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")

print(result.output_text)   # compressed (or original if validation failed)
print(result.report())      # MODE / COVERAGE / TOKENS saved

Result object:

Field	Description
`output_text`	Text to send to your LLM
`mode`	`compressed` / `raw_fallback` / `skipped`
`coverage`	Cosine similarity (0.0–1.0)
`tokens_in`	Estimated input tokens
`tokens_out`	Estimated output tokens
`tokens_saved`	Difference

CLI usage

echo "Your long prompt here..." | python3 cli.py

Output: compressed text on stdout, stats on stderr.

Claude Code hook (recommended setup)

Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:

{
  "type": "command",
  "command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}

This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.

MCP server

The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.

Install:

pip install token-compressor-mcp

Tool: compress_prompt

Input: text (string)
Output: compressed text + stats footer

Claude Code MCP config (~/.claude/settings.json):

{
  "mcpServers": {
    "token-compressor": {
      "command": "uvx",
      "args": ["token-compressor-mcp"]
    }
  }
}

Or from source:

{
  "mcpServers": {
    "token-compressor": {
      "command": "python3",
      "args": ["-m", "token_compressor_mcp"],
      "cwd": "/path/to/token-compressor"
    }
  }
}

Configuration

pipeline = LLMCompressEmbedValidate(
    threshold=0.85,          # cosine similarity floor (lower = more aggressive)
    min_tokens=80,           # skip pipeline below this (not worth compressing)
    compress_model="llama3.2:1b",
    embed_model="nomic-embed-text",
)

How it works

Stage 1 — LLM compression

The compression prompt instructs the model to:

Preserve all conditionals (if, only if, unless, when, but only)
Preserve all negations
Remove filler, hedging, redundancy
Target 40–60% of original length

Stage 2 — Embedding validation

Results

Tested across Swedish and English prompts, technical and natural language:

Input	Tokens in	Tokens out	Saved
Research abstract (EN)	89	38	57%
Session intent (SV)	32	18	44%
Technical instruction	47	22	53%
Short command (<80t)	—	—	skipped

Research background

This tool implements the architecture from:

Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535

Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.

License

MIT — Base76 Research Lab, Sweden

Token Compressor

token-compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

Live demo

What it does

Requirements

Quick start

CLI usage

Claude Code hook (recommended setup)

MCP server

Configuration

How it works

Results

Research background

License

Token Compressor

token-compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

Live demo

What it does

Requirements

Quick start

CLI usage

Claude Code hook (recommended setup)

MCP server

Configuration

How it works

Results

Research background

License

Related AI & LLM Tools MCP Servers

Related AI & LLM Tools MCP Servers