This is a practical implementation for crawling documentation sites and blogs while respecting robots.txt and rate limits. It combines trafilatura for clean content extraction with BeautifulSoup for structure parsing, then converts everything to markdown for RAG ingestion. The code handles common doc frameworks like Docusaurus and Sphinx, extracts sidebar navigation and prev/next links, and includes sitemap discovery. Worth noting that it tracks visited URLs and content hashes for incremental updates, which matters when you're maintaining a knowledge base that needs to stay current. The robots.txt checker is solid, and the configurable rate limiting (defaults to 1 second between requests) keeps you from being a bad citizen.
npx skills add https://github.com/mindmorass/reflex --skill site-crawler