This is a tier-1 reference for Apache Spark that covers the full stack: RDDs, DataFrames, Spark SQL, streaming, and MLlib. It walks through core concepts like lazy evaluation, partitioning strategies, and the Catalyst optimizer with practical guidance on when to cache data and how to choose between RDDs and DataFrames. The skill includes code examples for broadcast variables, accumulators, and connecting to various data sources. Useful if you're processing terabyte-scale datasets across a cluster or building ETL pipelines that need distributed computing. The source is honest about what Spark isn't good for, like small datasets under 100GB or ultra-low latency requirements under 10ms where you'd want a specialized stream processor instead.
npx skills add https://github.com/manutej/luxor-claude-marketplace --skill apache-spark-data-processing