Exposes web scraping through MCP tools that handle everything from plain crawls to validated structured extraction. You get markdown conversion, PDF and DOCX parsing, sitemap walking, and search-to-scrape via Google. The interesting piece is the contract system: pick real-estate-listing or product-page, point it at your LLM endpoint, and get back typed data plus a validation object that tells you which required fields are missing. Diagnostic mode scores source readiness before you burn tokens. Anti-bot detection flags challenge pages instead of returning garbage HTML, and adaptive crawling tries fast Cheerio first before escalating to Playwright. Structured error enums let agents branch on DNS failures versus rate limits versus 5xx without parsing strings. Runs on Apify at half a cent per page or self-host the engine from GitHub.
claude mcp add --transport stdio manchittlab-thecrawler -- npx -y thecrawler