This is your go-to when you're writing Spark jobs or debugging why your pipeline is spilling to disk. It walks you through the full workflow from analyzing requirements to validating in Spark UI, with solid reference guides for DataFrames, RDDs, partitioning, and performance tuning. The examples are practical stuff you'll actually use: broadcast joins for dimension tables, salting patterns for skewed data, proper caching with unpersist. What I appreciate is the constraints section that tells you the real gotchas, like not using collect() on large datasets and why UDFs are 10-100x slower than built-ins. It assumes you know what Spark is but need help doing it right at scale.
npx skills add https://github.com/jeffallan/claude-skills --skill spark-engineer