A distributed data processing library built on Ray for ML workloads that actually need to scale beyond a single machine. If you're loading 100GB+ datasets for training, running batch inference across a cluster, or preprocessing multi-modal data (images, video, audio), this handles the streaming execution and GPU acceleration. It integrates cleanly with PyTorch, TensorFlow, and Ray Train, so you can read from S3, transform in parallel, and feed directly into training loops. The API feels like pandas but executes lazily across hundreds of nodes. Companies like Pinterest and Spotify use it for production ML pipelines. Below 100GB you're probably better off with pandas or Dask, but once you need real distributed processing for ML, this is the practical choice in the Ray ecosystem.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill ray-data