Connects Claude to the Internet Archive's Wayback Machine and 40M+ item library through five tools covering snapshot discovery, content retrieval, metadata queries, and OCR text extraction. You can search the CDX API for capture histories with date and MIME filters, fetch archived page content stripped of banner injections, run filtered searches across books and media, pull complete file manifests with download URLs, and page through long OCR documents. Built on the public Availability, CDX, Solr, and Metadata APIs with no authentication required. Useful when you need to verify historical content, trace how pages changed over time, or retrieve public domain texts and documents programmatically.
MCP_LOG_LEVELdefault: infoSets the minimum log level for output (e.g., 'debug', 'info', 'warn').
MCP_HTTP_HOSTdefault: 127.0.0.1The hostname for the HTTP server.
MCP_HTTP_PORTdefault: 3010The port to run the HTTP server on.
MCP_HTTP_ENDPOINT_PATHdefault: /mcpThe endpoint path for the MCP server.
MCP_AUTH_MODEdefault: noneAuthentication mode to use: 'none', 'jwt', or 'oauth'.
Search the Wayback Machine and IA library (40M+ items), fetch archived snapshots, retrieve item metadata and full text via MCP. STDIO or Streamable HTTP.
Five tools covering two Internet Archive pillars — Wayback Machine snapshot discovery and retrieval, and IA library search and content access:
| Tool | Description |
|---|---|
ia_find_snapshots | Find Wayback Machine snapshots of a URL. Mode closest returns the nearest capture to a given timestamp. Mode history returns the full capture list via CDX with date range, status, and MIME filters, collapsed by default to one capture per day. Supports resume-key pagination for large histories. |
ia_get_snapshot | Fetch the archived content of a URL at a specific Wayback timestamp. Strips HTML to readable text and returns the canonical replay URL. |
ia_search_items | Search the IA library (40M+ items). Filter by media type, collection, creator, date range, and language. Sort by relevance, date, or downloads. Returns identifiers, titles, types, and pagination context (total_found, page, rows). |
ia_get_item | Retrieve full metadata and the file manifest for an Archive item by identifier — title, creator, description, subjects, collections, license, and every file with its format, size, and direct download URL. |
ia_get_text | Retrieve readable OCR text (DjVuTXT or plain-text) from a text item. Length-aware truncation with continuation pointer (char_offset) for paging through large documents. |
ia_find_snapshotsDiscover what the Wayback Machine has captured for any URL.
closest mode: single fast lookup via the Availability API — returns the nearest capture to a given timestamphistory mode: full capture list via the CDX API, filterable by date range (from/to), HTTP status (status_filter), and MIME typetimestamp:8 (one capture per day) keeps responses tractable for popular URLs; adjust with the collapse parameter (timestamp:N, N=1–14)resume_key) for stepping through large CDX histories without re-scanningia_get_snapshotRetrieve what a page actually said at a point in time.
ia_search_itemsSearch across 40M+ Archive items by keyword and metadata filters.
mediatype (texts, audio, video, software, image), collection, creator, language, and date rangepage and rows; output includes total_found and current page/rows so agents can paginate correctly without guessingia_get_itemFetch the complete metadata and file manifest for any Archive item.
title, creator, description, subjects, collections, date, license, and morefiles[] includes every file in the item with its format, size, and direct download URL — the primary way to act on a search resultmetadata response {} on unknown identifier → typed item_not_found erroria_get_textRead the OCR text of public-domain books, documents, and transcripts.
max_chars and char_offset enable efficient paging through long documents without re-fetchingdownload_forbidden (HTTP 403) as a typed error for restricted collections rather than failing silently| Type | Name | Description |
|---|---|---|
| Resource | ia://item/{identifier} | Metadata snapshot for an Archive item — title, creator, mediatype, description, subjects, collections, date, license, and file count. Stable URIs for injectable context. |
All resource data is also reachable via ia_get_item. The resource provides a stable, injectable URI for referencing a specific item across workflows.
Built on @cyanheads/mcp-ts-core:
none, jwt, oauthin-memory, filesystem, Supabase, Cloudflare KV/R2/D1Internet Archive-specific:
WaybackService (Availability + CDX), ArchiveSearchService (Solr), ArchiveMetadataService (Metadata + downloads)limit keep responses tractable for high-capture URLsIA_USER_AGENTAgent-friendly output:
total_found, page, rows (search) and resume_key (CDX history) so agents never have to guess whether results are completeno_snapshots, no_snapshot_available, item_not_found, no_text_file, download_forbidden) with recovery hints so callers can retry or explain to users without parsing textia_get_item response includes file-level metadata (format, size, URL) enabling agents to select the right file without a follow-up callNo API key required — the Internet Archive's APIs are fully public.
Add the following to your MCP client configuration file:
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "bunx",
"args": ["@cyanheads/internet-archive-mcp-server@latest"],
"env": {
"MCP_TRANSPORT_TYPE": "stdio",
"MCP_LOG_LEVEL": "info"
}
}
}
}
Or with npx (no Bun required):
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@cyanheads/internet-archive-mcp-server@latest"],
"env": {
"MCP_TRANSPORT_TYPE": "stdio",
"MCP_LOG_LEVEL": "info"
}
}
}
}
Or with Docker:
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "MCP_TRANSPORT_TYPE=stdio",
"ghcr.io/cyanheads/internet-archive-mcp-server:latest"
]
}
}
}
For Streamable HTTP, set the transport and start the server:
MCP_TRANSPORT_TYPE=http MCP_HTTP_PORT=3010 bun run start:http
# Server listens at http://localhost:3010/mcp
git clone https://github.com/cyanheads/internet-archive-mcp-server.git
cd internet-archive-mcp-server
bun install
cp .env.example .env
# Optional: edit .env for custom User-Agent, timeouts, etc.
All configuration is validated at startup via Zod schemas in src/config/server-config.ts.
| Variable | Description | Default |
|---|---|---|
MCP_TRANSPORT_TYPE | Transport: stdio or http | stdio |
MCP_HTTP_PORT | HTTP server port | 3010 |
MCP_AUTH_MODE | Auth mode: none, jwt, or oauth | none |
MCP_LOG_LEVEL | Log level (debug, info, notice, warning, error) | info |
LOGS_DIR | Directory for log files (Node.js only) | <project-root>/logs |
STORAGE_PROVIDER_TYPE | Storage backend | in-memory |
OTEL_ENABLED | Enable OpenTelemetry instrumentation | false |
IA_USER_AGENT | Custom User-Agent for IA API requests | internet-archive-mcp-server/{version} (github.com/cyanheads/internet-archive-mcp-server) |
IA_REQUEST_TIMEOUT_MS | HTTP request timeout in milliseconds | 30000 |
IA_MAX_SNAPSHOT_CHARS | Default character cap for ia_get_text responses | 50000 |
See .env.example for the full list of optional overrides.
Build and run:
# One-time build
bun run rebuild
# Run the built server
bun run start:stdio
# or
bun run start:http
Run checks and tests:
bun run devcheck # Lint, format, typecheck, security
bun run test # Vitest test suite
bun run lint:mcp # Validate MCP definitions against spec
docker build -t internet-archive-mcp-server .
docker run --rm -p 3010:3010 internet-archive-mcp-server
The Dockerfile defaults to HTTP transport, stateless session mode, and logs to /var/log/internet-archive-mcp-server. OpenTelemetry peer dependencies are installed by default — build with --build-arg OTEL_ENABLED=false to omit them.
| Directory | Purpose |
|---|---|
src/index.ts | createApp() entry point — registers tools, resource, and inits services. |
src/config | Server-specific environment variable parsing and validation with Zod. |
src/mcp-server/tools | Tool definitions (*.tool.ts). Five tools across Wayback and IA library. |
src/mcp-server/resources | Resource definitions. ia://item/{identifier} item metadata resource. |
src/services/wayback | WaybackService — Availability API + CDX API client. |
src/services/archive-search | ArchiveSearchService — Solr Advanced Search client. |
src/services/archive-metadata | ArchiveMetadataService — Metadata API + file download client. |
tests/ | Unit and integration tests mirroring src/. |
See CLAUDE.md for development guidelines and architectural rules. The short version:
try/catch in tool logicctx.log for request-scoped logging, ctx.state for tenant-scoped storagesrc/mcp-server/*/definitions/index.tsIssues and pull requests are welcome. Run checks and tests before submitting:
bun run devcheck
bun run test
Apache-2.0 — see LICENSE for details.
io.github.pipeworx-io/brave-search
marcopesani/mcp-server-serper
brave/brave-search-mcp-server
com.mcparmory/google-search-console
acamolese/google-search-console-mcp
io.github.sarahpark/google-search-console