This handles the full pipeline for training vision models on Hugging Face Jobs cloud GPUs: object detection (D-FINE, RT-DETR, DETR, YOLOS), image classification (timm models like MobileNetV3, ResNet, ViT), and SAM/SAM2 segmentation. It includes dataset validation scripts that check format compatibility before you burn GPU credits, automatic bbox preprocessing that converts between COCO and Pascal VOC formats, and augmentation with Albumentations. The workflow is solid: validate your dataset structure, choose between quick test runs or full training, submit the job with proper token secrets, and get your model pushed to the Hub. One heads up: the default 30-minute timeout is way too short for real training, so you'll need to bump that based on dataset size.
npx skills add https://github.com/huggingface/skills --skill huggingface-vision-trainer