Runpod Flash wraps serverless GPU/CPU deployment into a decorator. You write a function, add `@Endpoint(name="my-worker", gpu=GpuGroup.AMPERE_80, workers=5)`, and it provisions remote hardware automatically. Three modes: decorator style for queue-based jobs, multi-route APIs that share a worker pool, or client bindings for pre-built Docker images. The CLI handles local testing at localhost:8888 before deploy. Useful when you need to scale inference or training without managing infrastructure, though the 500MB artifact limit and cloudpickle requirement (imports must live inside functions) add friction. CPU instances are available if you just need parallel pandas work without spinning up a cluster.
npx -y skills add runpod/skills --skill flash --agent claude-codeInstalls into .claude/skills of the current project.
Write code locally, test with flash run (dev server at localhost:8888), and flash automatically provisions and deploys to remote GPUs/CPUs in the cloud. Endpoint handles everything.
pip install runpod-flash # requires Python >=3.10
# auth option 1: browser-based login (saves token locally)
flash login
# auth option 2: API key via environment variable
export RUNPOD_API_KEY=your_key
flash init my-project # scaffold a new project in ./my-project
flash run # start local dev server at localhost:8888
flash run --auto-provision # same, but pre-provision endpoints (no cold start)
flash build # package artifact for deployment (500MB limit)
flash build --exclude pkg1,pkg2 # exclude packages from build
flash deploy # build + deploy (auto-selects env if only one)
flash deploy --env staging # build + deploy to "staging" environment
flash deploy --app my-app --env prod # deploy a specific app to an environment
flash deploy --preview # build + launch local preview in Docker
flash env list # list deployment environments
flash env create staging # create "staging" environment
flash env get staging # show environment details + resources
flash env delete staging # delete environment + tear down resources
flash undeploy list # list all active endpoints
flash undeploy my-endpoint # remove a specific endpoint
One function = one endpoint with its own workers.
from runpod_flash import Endpoint, GpuGroup
@Endpoint(name="my-worker", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"])
async def compute(data):
import torch # MUST import inside function (cloudpickle)
return {"sum": torch.tensor(data, device="cuda").sum().item()}
result = await compute([1, 2, 3])
Multiple HTTP routes share one pool of workers.
from runpod_flash import Endpoint, GpuGroup
api = Endpoint(name="my-api", gpu=GpuGroup.ADA_24, workers=(1, 5), dependencies=["torch"])
@api.post("/predict")
async def predict(data: list[float]):
import torch
return {"result": torch.tensor(data, device="cuda").sum().item()}
@api.get("/health")
async def health():
return {"status": "ok"}
Deploy a pre-built Docker image and call it via HTTP.
from runpod_flash import Endpoint, GpuGroup, PodTemplate
server = Endpoint(
name="my-server",
image="my-org/my-image:latest",
gpu=GpuGroup.AMPERE_80,
workers=1,
env={"HF_TOKEN": "xxx"},
template=PodTemplate(containerDiskInGb=100),
)
# LB-style
result = await server.post("/v1/completions", {"prompt": "hello"})
models = await server.get("/v1/models")
# QB-style
job = await server.run({"prompt": "hello"})
await job.wait()
print(job.output)
Connect to an existing endpoint by ID (no provisioning):
ep = Endpoint(id="abc123")
job = await ep.runsync({"input": "hello"})
print(job.output)
| Parameters | Mode |
|---|---|
name= only | Decorator (your code) |
image= set | Client (deploys image, then HTTP calls) |
id= set | Client (connects to existing, no provisioning) |
Endpoint(
name="endpoint-name", # required (unless id= set)
id=None, # connect to existing endpoint
gpu=GpuGroup.AMPERE_80, # single GPU type (default: ANY)
gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80], # or list for auto-select by supply
cpu=CpuInstanceType.CPU5C_4_8, # CPU type (mutually exclusive with gpu)
workers=5, # shorthand for (0, 5)
workers=(1, 5), # explicit (min, max)
idle_timeout=60, # seconds before scale-down (default: 60)
dependencies=["torch"], # pip packages for remote exec
system_dependencies=["ffmpeg"], # apt-get packages
image="org/image:tag", # pre-built Docker image (client mode)
env={"KEY": "val"}, # environment variables
volume=NetworkVolume(...), # persistent storage
gpu_count=1, # GPUs per worker
template=PodTemplate(containerDiskInGb=100),
flashboot=True, # fast cold starts
execution_timeout_ms=0, # max execution time (0 = unlimited)
)
gpu= and cpu= are mutually exclusiveworkers=5 means (0, 5). Default is (0, 1)idle_timeout default is 60 secondsflashboot=True (default) -- enables fast cold starts via snapshot restoregpu_count -- GPUs per worker (default 1), use >1 for multi-GPU modelsNetworkVolume(name="my-vol", size=100) # size in GB, default 100
PodTemplate(
containerDiskInGb=64, # container disk size (default 64)
dockerArgs="", # extra docker arguments
ports="", # exposed ports
startScript="", # script to run on start
)
Returned by ep.run() and ep.runsync() in client mode.
job = await ep.run({"data": [1, 2, 3]})
await job.wait(timeout=120) # poll until done
print(job.id, job.output, job.error, job.done)
await job.cancel()
| Enum | GPU | VRAM |
|---|---|---|
ANY | any | varies |
AMPERE_16 | RTX A4000 | 16GB |
AMPERE_24 | RTX A5000/L4 | 24GB |
AMPERE_48 | A40/A6000 | 48GB |
AMPERE_80 | A100 | 80GB |
ADA_24 | RTX 4090 | 24GB |
ADA_32_PRO | RTX 5090 | 32GB |
ADA_48_PRO | RTX 6000 Ada | 48GB |
ADA_80_PRO | H100 PCIe (80GB) / H100 HBM3 (80GB) / H100 NVL (94GB) | 80GB+ |
HOPPER_141 | H200 | 141GB |
| Enum | vCPU | RAM | Max Disk | Type |
|---|---|---|---|---|
CPU3G_1_4 | 1 | 4GB | 10GB | General |
CPU3G_2_8 | 2 | 8GB | 20GB | General |
CPU3G_4_16 | 4 | 16GB | 40GB | General |
CPU3G_8_32 | 8 | 32GB | 80GB | General |
CPU3C_1_2 | 1 | 2GB | 10GB | Compute |
CPU3C_2_4 | 2 | 4GB | 20GB | Compute |
CPU3C_4_8 | 4 | 8GB | 40GB | Compute |
CPU3C_8_16 | 8 | 16GB | 80GB | Compute |
CPU5C_1_2 | 1 | 2GB | 15GB | Compute (5th gen) |
CPU5C_2_4 | 2 | 4GB | 30GB | Compute (5th gen) |
CPU5C_4_8 | 4 | 8GB | 60GB | Compute (5th gen) |
CPU5C_8_16 | 8 | 16GB | 120GB | Compute (5th gen) |
from runpod_flash import Endpoint, CpuInstanceType
@Endpoint(name="cpu-work", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"])
async def process(data):
import pandas as pd
return pd.DataFrame(data).describe().to_dict()
from runpod_flash import Endpoint, GpuGroup, CpuInstanceType
@Endpoint(name="preprocess", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"])
async def preprocess(raw):
import pandas as pd
return pd.DataFrame(raw).to_dict("records")
@Endpoint(name="infer", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"])
async def infer(clean):
import torch
t = torch.tensor([[v for v in r.values()] for r in clean], device="cuda")
return {"predictions": t.mean(dim=1).tolist()}
async def pipeline(data):
return await infer(await preprocess(data))
import asyncio
results = await asyncio.gather(compute(a), compute(b), compute(c))
await.dependencies=[].image=/id= = client. Otherwise = decorator.gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80]) and set workers=5 or higher. The platform only auto-switches GPU types based on supply when max workers is at least 5.runsync timeout is 60s -- cold starts can exceed 60s. Use ep.runsync(data, timeout=120) for first requests or use ep.run() + job.wait() instead.juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills