Paper Search (arXiv + Semantic Scholar + OpenAlex)

HTTPregistry active

Summary

Three academic search engines behind one MCP interface. You get arXiv with full-text retrieval in markdown, HTML, or raw LaTeX source; Semantic Scholar's citation graph, author lookup, and recommendations; and OpenAlex's 316M cross-discipline catalog with institution and topic filters. The unified search_all tool fans out to all three, deduplicates by DOI, and re-ranks with reciprocal rank fusion. Bonus: image to LaTeX OCR for turning formula or table screenshots back into source code, backed by DeepSeek-OCR and texify models. Reach for this when you need programmatic paper discovery plus the actual manuscript text, not just metadata. Hosted at latex-tools.online or run locally with FastMCP.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

paper-mcp

Remotely-callable MCP server for academic paper search, full-text retrieval & image→LaTeX, served at https://latex-tools.online/mcp.

Three corpora behind one normalized interface:

arxiv (default) — search, metadata, and full-text (HTML / markdown / LaTeX source)
semanticscholar (alias s2) — the full S2 API surface: citation graph, authors, recommendations, full-text snippets, bulk datasets
openalex (alias oa) — 316M all-field works: citation graph, authors with h-index, institutions, topics, influence metrics

Plus a unified search_all that fuses all three corpora, image→LaTeX OCR, and LaTeX lint + PDF→text tooling.

Tools (41)

Generic / source-agnostic (8)

Tool	Purpose
`search_all(query, max_results=10, sources='arxiv,semanticscholar,openalex')`	Unified search. Fans out to all three corpora concurrently, de-duplicates the same work (by DOI/title) and re-ranks with Reciprocal Rank Fusion. Each hit carries `sources` (who found it) + an `ids` map for follow-up calls. Prefer this for broad lookups.
`search_papers(query, source='arxiv', max_results=10, sort_by='relevance')`	Single-corpus search. arXiv `query` accepts plain text or field syntax (`ti:` `au:` `cat:cs.CL` `abs:` + AND/OR).
`get_paper(paper_id, source='arxiv')`	One paper's full record. S2 id accepts S2 id / `DOI:` / `ARXIV:` / `CorpusId:`.
`search_by_author(author, source='arxiv')`	Papers by author, newest first.
`list_recent(category, source='arxiv')`	Latest in a category (arXiv code or S2 field of study).
`list_categories(source='arxiv')`	Common category codes.
`read_paper(paper_id, format='markdown')`	FULL text (arXiv). `markdown` = body with formulas as $LaTeX$ ; `html` = raw LaTeXML page; `latex` = original manuscript `.tex` source.
`list_paper_sources()`	Available corpora.

read_paper fetch chain: arxiv.org/html/{id} → ar5iv fallback (markdown/html), or arxiv.org/e-print/{id} tarball main .tex (latex). Formulas are recovered from the LaTeXML alttext invariant.

Medical / evidence-graded (1)

Tool Purpose

search_medical(query, study_types='rct,meta-analysis,systematic-review', year_from=0, max_results=10, fetch_fulltext=True) Clinical literature search. Queries PubMed, filters by research type via Publication-Type tags and re-ranks by the evidence pyramid (meta-analysis / systematic review > RCT > cohort > ...), so real trials surface above high-cited reviews/guidelines that pure-citation ranking floats up. Open-access full text is attached from Europe PMC by PMID. If the type filter yields nothing it auto-relaxes (flagged filter_relaxed). query is English keyword/boolean text — do NL/multilingual query understanding upstream. Backed by NCBI E-utilities + Europe PMC (both free, no key required).

Image → LaTeX (3)

Turn a formula or table image back into LaTeX (e.g. a figure cropped from a paper) without needing your own vision model. Backed by the co-located recognize service (PaddleOCR-VL / DeepSeek-OCR / texify).

Tool	Purpose
`recognize_formula(image_url=... or image_base64=..., model='deepseek-ocr')`	Formula image → LaTeX. `image_url` is downloaded server-side (with SSRF guards). Returns `{latex, model, elapsed_ms}`.
`recognize_table(image_url=... or image_base64=..., model='deepseek-ocr')`	Table image → LaTeX `tabular`.
`list_ocr_models()`	Available OCR models (`deepseek-ocr`, `paddleocr-vl`, `texify`).

LaTeX tooling (3)

Companions to the LaTeX/PDF web tools at latex-tools.online — same backends, exposed over MCP.

Tool	Purpose
`lint_latex(code)`	Check a LaTeX snippet for errors and return an auto-fixed version. Returns `{errors, fixed_code, summary_en, summary_zh, elapsed_ms}`.
`extract_pdf(pdf_url=... or pdf_base64=..., formula=True, table=True)`	PDF → clean Markdown/LaTeX text via MinerU (useful for papers with no open-access full text). `pdf_url` is downloaded server-side (SSRF-guarded). Content-addressed + cached: a recently-seen or small PDF returns `content` in one call; a fresh PDF (MinerU is GPU-heavy, minutes) returns `status='running'` + a `task_id`.
`extract_pdf_result(task_id)`	Fetch an `extract_pdf` job by `task_id`. Returns `content` once `status='done'`; while `'running'`, `content` is null — call again shortly.

OpenAlex (8)

Works: get_openalex_work · get_openalex_citations · get_openalex_references · search_openalex_works (filters: year range, open-access, min-citations, institution)
Authors/Institutions: search_openalex_authors · search_openalex_institutions
Analytics: get_openalex_trends · list_openalex_topics

Semantic Scholar (18)

Graph: get_paper_citations · get_paper_references · get_paper_authors
Lookup: match_paper_title · autocomplete_papers
Bulk: search_papers_bulk (≤1000, sortable, token paging) · get_papers_batch
Authors: search_authors · get_author · get_author_papers · get_authors_batch
Full-text: search_snippets (search inside paper body)
Recommend: recommend_papers_for_paper · recommend_papers_from_examples
Datasets: list_dataset_releases · get_dataset_release · get_dataset_download_links · get_dataset_diffs

Layout

paper_mcp/
  server.py            FastMCP server (tool registrations + instructions)
  models.py            normalized Paper model
  aggregate.py         cross-source fusion (dedup + Reciprocal Rank Fusion)
  sources/
    base.py            source registry (get_source / list_sources)
    arxiv.py           arXiv Atom API + read_paper (HTML/markdown/latex)
    semanticscholar.py Semantic Scholar full API surface
    openalex.py        OpenAlex REST API (works/authors/institutions/topics)
    recognize.py       image→LaTeX client over the co-located recognize service
    latextools.py      lint + PDF-extract clients over the latex-tools services
pyproject.toml

Run locally

cd paper-mcp
python -m venv .venv && . .venv/bin/activate
pip install -e .
PAPER_MCP_PORT=9400 python -m paper_mcp.server
# MCP endpoint at http://127.0.0.1:9400/mcp (JSON-RPC; a plain GET returns 406)

Env

Var	Default	Notes
`PAPER_MCP_HOST`	`127.0.0.1`
`PAPER_MCP_PORT`	`9400`
`PAPER_MCP_PATH`	`/mcp`
`SEMANTIC_SCHOLAR_API_KEY`	—	optional; raises S2 rate limit. Set via `/etc/paper-mcp.env` in prod.
`MCP_MAX_PER_HOUR`	`300`	Direct-client JSON-RPC POST budget per IP.
`MCP_WORKER_MAX_PER_HOUR`	`300`	Trusted reverse-proxy Worker budget per HMAC-derived connection key. Raw keys are not retained.
`MCP_WORKER_SHARED_MAX_PER_HOUR`	`2400`	Shared ceiling across all trusted Worker connections.
`MCP_RATE_COOLDOWN_SEC`	`300`	Minimum fast-rejection cooldown after a bucket reaches its limit.

Deployment (latex-tools.online)

Runs as paper-mcp.service on tencent-us (43.130.32.180), WorkingDirectory /opt/paper-mcp, loopback port 9400.
nginx reverse-proxies https://latex-tools.online/mcp → 127.0.0.1:9400/mcp.
Worker-aware buckets activate only when a trusted reverse proxy overwrites X-MCP-Worker after validating the upstream platform. Never pass through a client-supplied value.
uvicorn access logging is disabled because legacy MCP clients may put connection keys and profiles in the endpoint URL. The reverse proxy must also log $uri, not $request, for the MCP route.
Secrets in /etc/paper-mcp.env (SEMANTIC_SCHOLAR_API_KEY).
Runtime systemd/nginx/env files are managed by the tencent-us operations backup, not by this source repository; never commit /etc/paper-mcp.env.

Update flow

This repo is the source of truth. The server runs an independent copy under /opt/paper-mcp (not auto-synced):

# edit here → push → deploy the complete canonical Python package
rsync -a --delete paper_mcp/ tencent-us:/opt/paper-mcp/paper_mcp/
ssh tencent-us 'systemctl restart paper-mcp'
ssh tencent-us 'curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1:9400/mcp'  # 406 = healthy (needs JSON-RPC handshake)

Production parity verified on 2026-07-23: main@07f6bbe8622aa063f56ee222a40d19c5d4264048 matches all 12 deployed Python source files byte-for-byte. The older copy embedded in latex-tools-deploy/paper-mcp/ is not a deployment source.

Notes

arXiv calls are politely rate-limited + retried (_USER_AGENT, backoff).
read_paper covers ~80%+ of papers via official HTML; older scan-only papers may have no full text.
Moved here from the docs repo on 2026-06-07; that copy is gone.

License

MIT © MCPServings. See LICENSE.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

paper-mcp

Remotely-callable MCP server for academic paper search, full-text retrieval & image→LaTeX, served at https://latex-tools.online/mcp.

Three corpora behind one normalized interface:

arxiv (default) — search, metadata, and full-text (HTML / markdown / LaTeX source)
semanticscholar (alias s2) — the full S2 API surface: citation graph, authors, recommendations, full-text snippets, bulk datasets
openalex (alias oa) — 316M all-field works: citation graph, authors with h-index, institutions, topics, influence metrics

Plus a unified search_all that fuses all three corpora, image→LaTeX OCR, and LaTeX lint + PDF→text tooling.

Tools (41)

Generic / source-agnostic (8)

Tool	Purpose
`search_all(query, max_results=10, sources='arxiv,semanticscholar,openalex')`	Unified search. Fans out to all three corpora concurrently, de-duplicates the same work (by DOI/title) and re-ranks with Reciprocal Rank Fusion. Each hit carries `sources` (who found it) + an `ids` map for follow-up calls. Prefer this for broad lookups.
`search_papers(query, source='arxiv', max_results=10, sort_by='relevance')`	Single-corpus search. arXiv `query` accepts plain text or field syntax (`ti:` `au:` `cat:cs.CL` `abs:` + AND/OR).
`get_paper(paper_id, source='arxiv')`	One paper's full record. S2 id accepts S2 id / `DOI:` / `ARXIV:` / `CorpusId:`.
`search_by_author(author, source='arxiv')`	Papers by author, newest first.
`list_recent(category, source='arxiv')`	Latest in a category (arXiv code or S2 field of study).
`list_categories(source='arxiv')`	Common category codes.
`read_paper(paper_id, format='markdown')`	FULL text (arXiv). `markdown` = body with formulas as $LaTeX$ ; `html` = raw LaTeXML page; `latex` = original manuscript `.tex` source.
`list_paper_sources()`	Available corpora.

Medical / evidence-graded (1)

Tool Purpose

Image → LaTeX (3)

Tool	Purpose
`recognize_formula(image_url=... or image_base64=..., model='deepseek-ocr')`	Formula image → LaTeX. `image_url` is downloaded server-side (with SSRF guards). Returns `{latex, model, elapsed_ms}`.
`recognize_table(image_url=... or image_base64=..., model='deepseek-ocr')`	Table image → LaTeX `tabular`.
`list_ocr_models()`	Available OCR models (`deepseek-ocr`, `paddleocr-vl`, `texify`).

LaTeX tooling (3)

Companions to the LaTeX/PDF web tools at latex-tools.online — same backends, exposed over MCP.

Tool	Purpose
`lint_latex(code)`	Check a LaTeX snippet for errors and return an auto-fixed version. Returns `{errors, fixed_code, summary_en, summary_zh, elapsed_ms}`.
`extract_pdf(pdf_url=... or pdf_base64=..., formula=True, table=True)`	PDF → clean Markdown/LaTeX text via MinerU (useful for papers with no open-access full text). `pdf_url` is downloaded server-side (SSRF-guarded). Content-addressed + cached: a recently-seen or small PDF returns `content` in one call; a fresh PDF (MinerU is GPU-heavy, minutes) returns `status='running'` + a `task_id`.
`extract_pdf_result(task_id)`	Fetch an `extract_pdf` job by `task_id`. Returns `content` once `status='done'`; while `'running'`, `content` is null — call again shortly.

OpenAlex (8)

Works: get_openalex_work · get_openalex_citations · get_openalex_references · search_openalex_works (filters: year range, open-access, min-citations, institution)
Authors/Institutions: search_openalex_authors · search_openalex_institutions
Analytics: get_openalex_trends · list_openalex_topics

Semantic Scholar (18)

Graph: get_paper_citations · get_paper_references · get_paper_authors
Lookup: match_paper_title · autocomplete_papers
Bulk: search_papers_bulk (≤1000, sortable, token paging) · get_papers_batch
Authors: search_authors · get_author · get_author_papers · get_authors_batch
Full-text: search_snippets (search inside paper body)
Recommend: recommend_papers_for_paper · recommend_papers_from_examples
Datasets: list_dataset_releases · get_dataset_release · get_dataset_download_links · get_dataset_diffs

Layout

paper_mcp/
  server.py            FastMCP server (tool registrations + instructions)
  models.py            normalized Paper model
  aggregate.py         cross-source fusion (dedup + Reciprocal Rank Fusion)
  sources/
    base.py            source registry (get_source / list_sources)
    arxiv.py           arXiv Atom API + read_paper (HTML/markdown/latex)
    semanticscholar.py Semantic Scholar full API surface
    openalex.py        OpenAlex REST API (works/authors/institutions/topics)
    recognize.py       image→LaTeX client over the co-located recognize service
    latextools.py      lint + PDF-extract clients over the latex-tools services
pyproject.toml

Run locally

cd paper-mcp
python -m venv .venv && . .venv/bin/activate
pip install -e .
PAPER_MCP_PORT=9400 python -m paper_mcp.server
# MCP endpoint at http://127.0.0.1:9400/mcp (JSON-RPC; a plain GET returns 406)

Env

Var	Default	Notes
`PAPER_MCP_HOST`	`127.0.0.1`
`PAPER_MCP_PORT`	`9400`
`PAPER_MCP_PATH`	`/mcp`
`SEMANTIC_SCHOLAR_API_KEY`	—	optional; raises S2 rate limit. Set via `/etc/paper-mcp.env` in prod.
`MCP_MAX_PER_HOUR`	`300`	Direct-client JSON-RPC POST budget per IP.
`MCP_WORKER_MAX_PER_HOUR`	`300`	Trusted reverse-proxy Worker budget per HMAC-derived connection key. Raw keys are not retained.
`MCP_WORKER_SHARED_MAX_PER_HOUR`	`2400`	Shared ceiling across all trusted Worker connections.
`MCP_RATE_COOLDOWN_SEC`	`300`	Minimum fast-rejection cooldown after a bucket reaches its limit.

Deployment (latex-tools.online)

Runs as paper-mcp.service on tencent-us (43.130.32.180), WorkingDirectory /opt/paper-mcp, loopback port 9400.
nginx reverse-proxies https://latex-tools.online/mcp → 127.0.0.1:9400/mcp.
Worker-aware buckets activate only when a trusted reverse proxy overwrites X-MCP-Worker after validating the upstream platform. Never pass through a client-supplied value.
uvicorn access logging is disabled because legacy MCP clients may put connection keys and profiles in the endpoint URL. The reverse proxy must also log $uri, not $request, for the MCP route.
Secrets in /etc/paper-mcp.env (SEMANTIC_SCHOLAR_API_KEY).
Runtime systemd/nginx/env files are managed by the tencent-us operations backup, not by this source repository; never commit /etc/paper-mcp.env.

Update flow

This repo is the source of truth. The server runs an independent copy under /opt/paper-mcp (not auto-synced):

# edit here → push → deploy the complete canonical Python package
rsync -a --delete paper_mcp/ tencent-us:/opt/paper-mcp/paper_mcp/
ssh tencent-us 'systemctl restart paper-mcp'
ssh tencent-us 'curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1:9400/mcp'  # 406 = healthy (needs JSON-RPC handshake)

Notes

arXiv calls are politely rate-limited + retried (_USER_AGENT, backoff).
read_paper covers ~80%+ of papers via official HTML; older scan-only papers may have no full text.
Moved here from the docs repo on 2026-06-07; that copy is gone.

Paper Search (arXiv + Semantic Scholar + OpenAlex)

paper-mcp

Tools (41)

Generic / source-agnostic (8)

Medical / evidence-graded (1)

Image → LaTeX (3)

LaTeX tooling (3)

OpenAlex (8)

Semantic Scholar (18)

Layout

Run locally

Env

Deployment (latex-tools.online)

Update flow

Notes

License

Paper Search (arXiv + Semantic Scholar + OpenAlex)

paper-mcp

Tools (41)

Generic / source-agnostic (8)

Medical / evidence-graded (1)

Image → LaTeX (3)

LaTeX tooling (3)

OpenAlex (8)

Semantic Scholar (18)

Layout

Run locally

Env

Deployment (latex-tools.online)

Update flow

Notes

License

Related Search & Web Crawling MCP Servers

Related Search & Web Crawling MCP Servers