Internet Archive Mcp Server

STDIO, HTTPregistry active

Summary

Connects Claude to the Internet Archive's Wayback Machine and 40M+ item library through five tools covering snapshot discovery, content retrieval, metadata queries, and OCR text extraction. You can search the CDX API for capture histories with date and MIME filters, fetch archived page content stripped of banner injections, run filtered searches across books and media, pull complete file manifests with download URLs, and page through long OCR documents. Built on the public Availability, CDX, Solr, and Metadata APIs with no authentication required. Useful when you need to verify historical content, trace how pages changed over time, or retrieve public domain texts and documents programmatically.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

@cyanheads/internet-archive-mcp-server

Search the Wayback Machine and IA library (40M+ items), fetch archived snapshots, retrieve item metadata and full text via MCP. STDIO or Streamable HTTP.

5 Tools • 1 Resource

Tools

Five tools covering two Internet Archive pillars — Wayback Machine snapshot discovery and retrieval, and IA library search and content access:

Tool	Description
`ia_find_snapshots`	Find Wayback Machine snapshots of a URL. Mode `closest` returns the nearest capture to a given timestamp. Mode `history` returns the full capture list via CDX with date range, status, and MIME filters, collapsed by default to one capture per day. Supports resume-key pagination for large histories.
`ia_get_snapshot`	Fetch the archived content of a URL at a specific Wayback timestamp. Strips HTML to readable text and returns the canonical replay URL.
`ia_search_items`	Search the IA library (40M+ items). Filter by media type, collection, creator, date range, and language. Sort by relevance, date, or downloads. Returns identifiers, titles, types, and pagination context (`total_found`, `page`, `rows`).
`ia_get_item`	Retrieve full metadata and the file manifest for an Archive item by identifier — title, creator, description, subjects, collections, license, and every file with its format, size, and direct download URL.
`ia_get_text`	Retrieve readable OCR text (DjVuTXT or plain-text) from a text item. Length-aware truncation with continuation pointer (`char_offset`) for paging through large documents.

`ia_find_snapshots`

Discover what the Wayback Machine has captured for any URL.

closest mode: single fast lookup via the Availability API — returns the nearest capture to a given timestamp
history mode: full capture list via the CDX API, filterable by date range (from/to), HTTP status (status_filter), and MIME type
Default collapse of timestamp:8 (one capture per day) keeps responses tractable for popular URLs; adjust with the collapse parameter (timestamp:N, N=1–14)
Resume-key pagination (resume_key) for stepping through large CDX histories without re-scanning

`ia_get_snapshot`

Retrieve what a page actually said at a point in time.

Resolves to the nearest available capture when the exact timestamp has no snapshot
Strips Wayback banner injections and extracts readable text — returns clean content alongside the canonical replay URL for browser access
Useful for fact-checking, citation verification, and tracing how content changed over time

`ia_search_items`

Search across 40M+ Archive items by keyword and metadata filters.

Full-text Solr query syntax plus structured filters: mediatype (texts, audio, video, software, image), collection, creator, language, and date range
Sort by relevance, date added, or download count
Pagination via page and rows; output includes total_found and current page/rows so agents can paginate correctly without guessing

`ia_get_item`

Fetch the complete metadata and file manifest for any Archive item.

Returns structured fields: title, creator, description, subjects, collections, date, license, and more
files[] includes every file in the item with its format, size, and direct download URL — the primary way to act on a search result
metadata response {} on unknown identifier → typed item_not_found error

`ia_get_text`

Read the OCR text of public-domain books, documents, and transcripts.

Locates the best available text file in the item's manifest (DjVuTXT preferred, falls back to plain text)
max_chars and char_offset enable efficient paging through long documents without re-fetching
Surfaces download_forbidden (HTTP 403) as a typed error for restricted collections rather than failing silently

Resource

Type	Name	Description
Resource	`ia://item/{identifier}`	Metadata snapshot for an Archive item — title, creator, mediatype, description, subjects, collections, date, license, and file count. Stable URIs for injectable context.

All resource data is also reachable via ia_get_item. The resource provides a stable, injectable URI for referencing a specific item across workflows.

Features

Built on @cyanheads/mcp-ts-core:

Declarative tool, resource, and prompt definitions — single file per primitive, framework handles registration and validation
Unified error handling — handlers throw, framework catches, classifies, and formats
Pluggable auth: none, jwt, oauth
Swappable storage backends: in-memory, filesystem, Supabase, Cloudflare KV/R2/D1
Structured logging with optional OpenTelemetry tracing
STDIO and Streamable HTTP transports

Internet Archive-specific:

No credentials required — all four APIs are public
Three service layers: WaybackService (Availability + CDX), ArchiveSearchService (Solr), ArchiveMetadataService (Metadata + downloads)
CDX collapse-by-day default and configurable limit keep responses tractable for high-capture URLs
Identifies User-Agent on every request as required by IA's terms; configurable via IA_USER_AGENT

Agent-friendly output:

Pagination context on every list response — total_found, page, rows (search) and resume_key (CDX history) so agents never have to guess whether results are complete
Typed error reasons (no_snapshots, no_snapshot_available, item_not_found, no_text_file, download_forbidden) with recovery hints so callers can retry or explain to users without parsing text
Structured file manifests — every ia_get_item response includes file-level metadata (format, size, URL) enabling agents to select the right file without a follow-up call

Getting started

No API key required — the Internet Archive's APIs are fully public.

Add the following to your MCP client configuration file:

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "bunx",
      "args": ["@cyanheads/internet-archive-mcp-server@latest"],
      "env": {
        "MCP_TRANSPORT_TYPE": "stdio",
        "MCP_LOG_LEVEL": "info"
      }
    }
  }
}

Or with npx (no Bun required):

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@cyanheads/internet-archive-mcp-server@latest"],
      "env": {
        "MCP_TRANSPORT_TYPE": "stdio",
        "MCP_LOG_LEVEL": "info"
      }
    }
  }
}

Or with Docker:

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "MCP_TRANSPORT_TYPE=stdio",
        "ghcr.io/cyanheads/internet-archive-mcp-server:latest"
      ]
    }
  }
}

For Streamable HTTP, set the transport and start the server:

MCP_TRANSPORT_TYPE=http MCP_HTTP_PORT=3010 bun run start:http
# Server listens at http://localhost:3010/mcp

Prerequisites

Bun v1.3.2 or higher (or Node.js v24+).
No external accounts or API keys required.

Installation

Clone the repository:

git clone https://github.com/cyanheads/internet-archive-mcp-server.git

Navigate into the directory:

cd internet-archive-mcp-server

Install dependencies:

bun install

Configure environment:

cp .env.example .env
# Optional: edit .env for custom User-Agent, timeouts, etc.

Configuration

All configuration is validated at startup via Zod schemas in src/config/server-config.ts.

Variable	Description	Default
`MCP_TRANSPORT_TYPE`	Transport: `stdio` or `http`	`stdio`
`MCP_HTTP_PORT`	HTTP server port	`3010`
`MCP_AUTH_MODE`	Auth mode: `none`, `jwt`, or `oauth`	`none`
`MCP_LOG_LEVEL`	Log level (`debug`, `info`, `notice`, `warning`, `error`)	`info`
`LOGS_DIR`	Directory for log files (Node.js only)	`<project-root>/logs`
`STORAGE_PROVIDER_TYPE`	Storage backend	`in-memory`
`OTEL_ENABLED`	Enable OpenTelemetry instrumentation	`false`
`IA_USER_AGENT`	Custom User-Agent for IA API requests	`internet-archive-mcp-server/{version} (github.com/cyanheads/internet-archive-mcp-server)`
`IA_REQUEST_TIMEOUT_MS`	HTTP request timeout in milliseconds	`30000`
`IA_MAX_SNAPSHOT_CHARS`	Default character cap for `ia_get_text` responses	`50000`

See .env.example for the full list of optional overrides.

Running the server

Local development

Build and run:

# One-time build
bun run rebuild

# Run the built server
bun run start:stdio
# or
bun run start:http

Run checks and tests:

bun run devcheck   # Lint, format, typecheck, security
bun run test       # Vitest test suite
bun run lint:mcp   # Validate MCP definitions against spec

Docker

docker build -t internet-archive-mcp-server .
docker run --rm -p 3010:3010 internet-archive-mcp-server

The Dockerfile defaults to HTTP transport, stateless session mode, and logs to /var/log/internet-archive-mcp-server. OpenTelemetry peer dependencies are installed by default — build with --build-arg OTEL_ENABLED=false to omit them.

Project structure

Directory	Purpose
`src/index.ts`	`createApp()` entry point — registers tools, resource, and inits services.
`src/config`	Server-specific environment variable parsing and validation with Zod.
`src/mcp-server/tools`	Tool definitions (`*.tool.ts`). Five tools across Wayback and IA library.
`src/mcp-server/resources`	Resource definitions. `ia://item/{identifier}` item metadata resource.
`src/services/wayback`	`WaybackService` — Availability API + CDX API client.
`src/services/archive-search`	`ArchiveSearchService` — Solr Advanced Search client.
`src/services/archive-metadata`	`ArchiveMetadataService` — Metadata API + file download client.
`tests/`	Unit and integration tests mirroring `src/`.

Development guide

See CLAUDE.md for development guidelines and architectural rules. The short version:

Handlers throw, framework catches — no try/catch in tool logic
Use ctx.log for request-scoped logging, ctx.state for tenant-scoped storage
Register new tools and resources via the barrels in src/mcp-server/*/definitions/index.ts
Wrap external API calls: validate raw → normalize to domain type → return output schema; never fabricate missing fields

Contributing

Issues and pull requests are welcome. Run checks and tests before submitting:

bun run devcheck
bun run test

License

Apache-2.0 — see LICENSE for details.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Configuration

MCP_LOG_LEVELdefault: info

Sets the minimum log level for output (e.g., 'debug', 'info', 'warn').

MCP_HTTP_HOSTdefault: 127.0.0.1

The hostname for the HTTP server.

MCP_HTTP_PORTdefault: 3010

The port to run the HTTP server on.

MCP_HTTP_ENDPOINT_PATHdefault: /mcp

The endpoint path for the MCP server.

MCP_AUTH_MODEdefault: none

Authentication mode to use: 'none', 'jwt', or 'oauth'.

@cyanheads/internet-archive-mcp-server

Search the Wayback Machine and IA library (40M+ items), fetch archived snapshots, retrieve item metadata and full text via MCP. STDIO or Streamable HTTP.

5 Tools • 1 Resource

Tools

Five tools covering two Internet Archive pillars — Wayback Machine snapshot discovery and retrieval, and IA library search and content access:

Tool	Description
`ia_find_snapshots`	Find Wayback Machine snapshots of a URL. Mode `closest` returns the nearest capture to a given timestamp. Mode `history` returns the full capture list via CDX with date range, status, and MIME filters, collapsed by default to one capture per day. Supports resume-key pagination for large histories.
`ia_get_snapshot`	Fetch the archived content of a URL at a specific Wayback timestamp. Strips HTML to readable text and returns the canonical replay URL.
`ia_search_items`	Search the IA library (40M+ items). Filter by media type, collection, creator, date range, and language. Sort by relevance, date, or downloads. Returns identifiers, titles, types, and pagination context (`total_found`, `page`, `rows`).
`ia_get_item`	Retrieve full metadata and the file manifest for an Archive item by identifier — title, creator, description, subjects, collections, license, and every file with its format, size, and direct download URL.
`ia_get_text`	Retrieve readable OCR text (DjVuTXT or plain-text) from a text item. Length-aware truncation with continuation pointer (`char_offset`) for paging through large documents.

`ia_find_snapshots`

Discover what the Wayback Machine has captured for any URL.

closest mode: single fast lookup via the Availability API — returns the nearest capture to a given timestamp
history mode: full capture list via the CDX API, filterable by date range (from/to), HTTP status (status_filter), and MIME type
Default collapse of timestamp:8 (one capture per day) keeps responses tractable for popular URLs; adjust with the collapse parameter (timestamp:N, N=1–14)
Resume-key pagination (resume_key) for stepping through large CDX histories without re-scanning

`ia_get_snapshot`

Retrieve what a page actually said at a point in time.

Resolves to the nearest available capture when the exact timestamp has no snapshot
Strips Wayback banner injections and extracts readable text — returns clean content alongside the canonical replay URL for browser access
Useful for fact-checking, citation verification, and tracing how content changed over time

`ia_search_items`

Search across 40M+ Archive items by keyword and metadata filters.

Full-text Solr query syntax plus structured filters: mediatype (texts, audio, video, software, image), collection, creator, language, and date range
Sort by relevance, date added, or download count
Pagination via page and rows; output includes total_found and current page/rows so agents can paginate correctly without guessing

`ia_get_item`

Fetch the complete metadata and file manifest for any Archive item.

Returns structured fields: title, creator, description, subjects, collections, date, license, and more
files[] includes every file in the item with its format, size, and direct download URL — the primary way to act on a search result
metadata response {} on unknown identifier → typed item_not_found error

`ia_get_text`

Read the OCR text of public-domain books, documents, and transcripts.

Locates the best available text file in the item's manifest (DjVuTXT preferred, falls back to plain text)
max_chars and char_offset enable efficient paging through long documents without re-fetching
Surfaces download_forbidden (HTTP 403) as a typed error for restricted collections rather than failing silently

Resource

Type	Name	Description
Resource	`ia://item/{identifier}`	Metadata snapshot for an Archive item — title, creator, mediatype, description, subjects, collections, date, license, and file count. Stable URIs for injectable context.

All resource data is also reachable via ia_get_item. The resource provides a stable, injectable URI for referencing a specific item across workflows.

Features

Built on @cyanheads/mcp-ts-core:

Declarative tool, resource, and prompt definitions — single file per primitive, framework handles registration and validation
Unified error handling — handlers throw, framework catches, classifies, and formats
Pluggable auth: none, jwt, oauth
Swappable storage backends: in-memory, filesystem, Supabase, Cloudflare KV/R2/D1
Structured logging with optional OpenTelemetry tracing
STDIO and Streamable HTTP transports

Internet Archive-specific:

No credentials required — all four APIs are public
Three service layers: WaybackService (Availability + CDX), ArchiveSearchService (Solr), ArchiveMetadataService (Metadata + downloads)
CDX collapse-by-day default and configurable limit keep responses tractable for high-capture URLs
Identifies User-Agent on every request as required by IA's terms; configurable via IA_USER_AGENT

Agent-friendly output:

Pagination context on every list response — total_found, page, rows (search) and resume_key (CDX history) so agents never have to guess whether results are complete
Typed error reasons (no_snapshots, no_snapshot_available, item_not_found, no_text_file, download_forbidden) with recovery hints so callers can retry or explain to users without parsing text
Structured file manifests — every ia_get_item response includes file-level metadata (format, size, URL) enabling agents to select the right file without a follow-up call

Getting started

No API key required — the Internet Archive's APIs are fully public.

Add the following to your MCP client configuration file:

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "bunx",
      "args": ["@cyanheads/internet-archive-mcp-server@latest"],
      "env": {
        "MCP_TRANSPORT_TYPE": "stdio",
        "MCP_LOG_LEVEL": "info"
      }
    }
  }
}

Or with npx (no Bun required):

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@cyanheads/internet-archive-mcp-server@latest"],
      "env": {
        "MCP_TRANSPORT_TYPE": "stdio",
        "MCP_LOG_LEVEL": "info"
      }
    }
  }
}

Or with Docker:

{
  "mcpServers": {
    "internet-archive-mcp-server": {
      "type": "stdio",
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "MCP_TRANSPORT_TYPE=stdio",
        "ghcr.io/cyanheads/internet-archive-mcp-server:latest"
      ]
    }
  }
}

For Streamable HTTP, set the transport and start the server:

MCP_TRANSPORT_TYPE=http MCP_HTTP_PORT=3010 bun run start:http
# Server listens at http://localhost:3010/mcp

Prerequisites

Bun v1.3.2 or higher (or Node.js v24+).
No external accounts or API keys required.

Installation

Clone the repository:

git clone https://github.com/cyanheads/internet-archive-mcp-server.git

Navigate into the directory:

cd internet-archive-mcp-server

Install dependencies:

bun install

Configure environment:

cp .env.example .env
# Optional: edit .env for custom User-Agent, timeouts, etc.

Configuration

All configuration is validated at startup via Zod schemas in src/config/server-config.ts.

Variable	Description	Default
`MCP_TRANSPORT_TYPE`	Transport: `stdio` or `http`	`stdio`
`MCP_HTTP_PORT`	HTTP server port	`3010`
`MCP_AUTH_MODE`	Auth mode: `none`, `jwt`, or `oauth`	`none`
`MCP_LOG_LEVEL`	Log level (`debug`, `info`, `notice`, `warning`, `error`)	`info`
`LOGS_DIR`	Directory for log files (Node.js only)	`<project-root>/logs`
`STORAGE_PROVIDER_TYPE`	Storage backend	`in-memory`
`OTEL_ENABLED`	Enable OpenTelemetry instrumentation	`false`
`IA_USER_AGENT`	Custom User-Agent for IA API requests	`internet-archive-mcp-server/{version} (github.com/cyanheads/internet-archive-mcp-server)`
`IA_REQUEST_TIMEOUT_MS`	HTTP request timeout in milliseconds	`30000`
`IA_MAX_SNAPSHOT_CHARS`	Default character cap for `ia_get_text` responses	`50000`

See .env.example for the full list of optional overrides.

Running the server

Local development

Build and run:

# One-time build
bun run rebuild

# Run the built server
bun run start:stdio
# or
bun run start:http

Run checks and tests:

bun run devcheck   # Lint, format, typecheck, security
bun run test       # Vitest test suite
bun run lint:mcp   # Validate MCP definitions against spec

Docker

docker build -t internet-archive-mcp-server .
docker run --rm -p 3010:3010 internet-archive-mcp-server

Project structure

Directory	Purpose
`src/index.ts`	`createApp()` entry point — registers tools, resource, and inits services.
`src/config`	Server-specific environment variable parsing and validation with Zod.
`src/mcp-server/tools`	Tool definitions (`*.tool.ts`). Five tools across Wayback and IA library.
`src/mcp-server/resources`	Resource definitions. `ia://item/{identifier}` item metadata resource.
`src/services/wayback`	`WaybackService` — Availability API + CDX API client.
`src/services/archive-search`	`ArchiveSearchService` — Solr Advanced Search client.
`src/services/archive-metadata`	`ArchiveMetadataService` — Metadata API + file download client.
`tests/`	Unit and integration tests mirroring `src/`.

Development guide

See CLAUDE.md for development guidelines and architectural rules. The short version:

Handlers throw, framework catches — no try/catch in tool logic
Use ctx.log for request-scoped logging, ctx.state for tenant-scoped storage
Register new tools and resources via the barrels in src/mcp-server/*/definitions/index.ts
Wrap external API calls: validate raw → normalize to domain type → return output schema; never fabricate missing fields

Contributing

Issues and pull requests are welcome. Run checks and tests before submitting:

bun run devcheck
bun run test

License

Apache-2.0 — see LICENSE for details.

Internet Archive Mcp Server

@cyanheads/internet-archive-mcp-server

Tools

ia_find_snapshots

ia_get_snapshot

ia_search_items

ia_get_item

ia_get_text

Resource

Features

Getting started

Prerequisites

Installation

Configuration

Running the server

Local development

Docker

Project structure

Development guide

Contributing

License

Configuration

Internet Archive Mcp Server

@cyanheads/internet-archive-mcp-server

Tools

ia_find_snapshots

ia_get_snapshot

ia_search_items

ia_get_item

ia_get_text

Resource

Features

Getting started

Prerequisites

Installation

Configuration

Running the server

Local development

Docker

Project structure

Development guide

Contributing

License

Configuration

Related Search & Web Crawling MCP Servers

Related Search & Web Crawling MCP Servers

`ia_find_snapshots`

`ia_get_snapshot`

`ia_search_items`

`ia_get_item`

`ia_get_text`

`ia_find_snapshots`

`ia_get_snapshot`

`ia_search_items`

`ia_get_item`

`ia_get_text`