Flash

748 installs31 stars

Summary

Runpod Flash wraps serverless GPU/CPU deployment into a decorator. You write a function, add `@Endpoint(name="my-worker", gpu=GpuGroup.AMPERE_80, workers=5)`, and it provisions remote hardware automatically. Three modes: decorator style for queue-based jobs, multi-route APIs that share a worker pool, or client bindings for pre-built Docker images. The CLI handles local testing at localhost:8888 before deploy. Useful when you need to scale inference or training without managing infrastructure, though the 500MB artifact limit and cloudpickle requirement (imports must live inside functions) add friction. CPU instances are available if you just need parallel pandas work without spinning up a cluster.

Install to Claude Code

npx -y skills add runpod/skills --skill flash --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

Put your SEO on autopilot

An agent that runs the SEO playbooks that move rankings and ships PRs you control.

Get founding access →

Vibe Prospecting MCP

Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

Put your SEO on autopilot

An agent that runs the SEO playbooks that move rankings and ships PRs you control.

Get founding access →

Vibe Prospecting MCP

Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Files

SKILL.mdView on GitHub

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

Put your SEO on autopilot

An agent that runs the SEO playbooks that move rankings and ships PRs you control.

Get founding access →

Vibe Prospecting MCP

Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

First SeenMay 16, 2026

View on GitHub

Runpod Flash

Write code locally, test with flash run (dev server at localhost:8888), and flash automatically provisions and deploys to remote GPUs/CPUs in the cloud. Endpoint handles everything.

Setup

pip install runpod-flash                 # requires Python >=3.10

# auth option 1: browser-based login (saves token locally)
flash login

# auth option 2: API key via environment variable
export RUNPOD_API_KEY=your_key

flash init my-project                    # scaffold a new project in ./my-project

CLI

flash run                                # start local dev server at localhost:8888
flash run --auto-provision               # same, but pre-provision endpoints (no cold start)
flash build                              # package artifact for deployment (500MB limit)
flash build --exclude pkg1,pkg2          # exclude packages from build
flash deploy                             # build + deploy (auto-selects env if only one)
flash deploy --env staging               # build + deploy to "staging" environment
flash deploy --app my-app --env prod     # deploy a specific app to an environment
flash deploy --preview                   # build + launch local preview in Docker
flash env list                           # list deployment environments
flash env create staging                 # create "staging" environment
flash env get staging                    # show environment details + resources
flash env delete staging                 # delete environment + tear down resources
flash undeploy list                      # list all active endpoints
flash undeploy my-endpoint               # remove a specific endpoint

Endpoint: Three Modes

Mode 1: Your Code (Queue-Based Decorator)

One function = one endpoint with its own workers.

from runpod_flash import Endpoint, GpuGroup

@Endpoint(name="my-worker", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"])
async def compute(data):
    import torch  # MUST import inside function (cloudpickle)
    return {"sum": torch.tensor(data, device="cuda").sum().item()}

result = await compute([1, 2, 3])

Mode 2: Your Code (Load-Balanced Routes)

Multiple HTTP routes share one pool of workers.

from runpod_flash import Endpoint, GpuGroup

api = Endpoint(name="my-api", gpu=GpuGroup.ADA_24, workers=(1, 5), dependencies=["torch"])

@api.post("/predict")
async def predict(data: list[float]):
    import torch
    return {"result": torch.tensor(data, device="cuda").sum().item()}

@api.get("/health")
async def health():
    return {"status": "ok"}

Mode 3: External Image (Client)

Deploy a pre-built Docker image and call it via HTTP.

from runpod_flash import Endpoint, GpuGroup, PodTemplate

server = Endpoint(
    name="my-server",
    image="my-org/my-image:latest",
    gpu=GpuGroup.AMPERE_80,
    workers=1,
    env={"HF_TOKEN": "xxx"},
    template=PodTemplate(containerDiskInGb=100),
)

# LB-style
result = await server.post("/v1/completions", {"prompt": "hello"})
models = await server.get("/v1/models")

# QB-style
job = await server.run({"prompt": "hello"})
await job.wait()
print(job.output)

Connect to an existing endpoint by ID (no provisioning):

ep = Endpoint(id="abc123")
job = await ep.runsync({"input": "hello"})
print(job.output)

How Mode Is Determined

Parameters	Mode
`name=` only	Decorator (your code)
`image=` set	Client (deploys image, then HTTP calls)
`id=` set	Client (connects to existing, no provisioning)

Endpoint Constructor

Endpoint(
    name="endpoint-name",                  # required (unless id= set)
    id=None,                               # connect to existing endpoint
    gpu=GpuGroup.AMPERE_80,               # single GPU type (default: ANY)
    gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80],  # or list for auto-select by supply
    cpu=CpuInstanceType.CPU5C_4_8,        # CPU type (mutually exclusive with gpu)
    workers=5,                             # shorthand for (0, 5)
    workers=(1, 5),                        # explicit (min, max)
    idle_timeout=60,                       # seconds before scale-down (default: 60)
    dependencies=["torch"],                # pip packages for remote exec
    system_dependencies=["ffmpeg"],        # apt-get packages
    image="org/image:tag",                 # pre-built Docker image (client mode)
    env={"KEY": "val"},                    # environment variables
    volume=NetworkVolume(...),             # persistent storage
    gpu_count=1,                           # GPUs per worker
    template=PodTemplate(containerDiskInGb=100),
    flashboot=True,                        # fast cold starts
    execution_timeout_ms=0,                # max execution time (0 = unlimited)
)

gpu= and cpu= are mutually exclusive
workers=5 means (0, 5). Default is (0, 1)
idle_timeout default is 60 seconds
flashboot=True (default) -- enables fast cold starts via snapshot restore
gpu_count -- GPUs per worker (default 1), use >1 for multi-GPU models

NetworkVolume

NetworkVolume(name="my-vol", size=100)  # size in GB, default 100

PodTemplate

PodTemplate(
    containerDiskInGb=64,    # container disk size (default 64)
    dockerArgs="",           # extra docker arguments
    ports="",                # exposed ports
    startScript="",          # script to run on start
)

EndpointJob

Returned by ep.run() and ep.runsync() in client mode.

job = await ep.run({"data": [1, 2, 3]})
await job.wait(timeout=120)        # poll until done
print(job.id, job.output, job.error, job.done)
await job.cancel()

GPU Types (GpuGroup)

Enum	GPU	VRAM
`ANY`	any	varies
`AMPERE_16`	RTX A4000	16GB
`AMPERE_24`	RTX A5000/L4	24GB
`AMPERE_48`	A40/A6000	48GB
`AMPERE_80`	A100	80GB
`ADA_24`	RTX 4090	24GB
`ADA_32_PRO`	RTX 5090	32GB
`ADA_48_PRO`	RTX 6000 Ada	48GB
`ADA_80_PRO`	H100 PCIe (80GB) / H100 HBM3 (80GB) / H100 NVL (94GB)	80GB+
`HOPPER_141`	H200	141GB

CPU Types (CpuInstanceType)

Enum	vCPU	RAM	Max Disk	Type
`CPU3G_1_4`	1	4GB	10GB	General
`CPU3G_2_8`	2	8GB	20GB	General
`CPU3G_4_16`	4	16GB	40GB	General
`CPU3G_8_32`	8	32GB	80GB	General
`CPU3C_1_2`	1	2GB	10GB	Compute
`CPU3C_2_4`	2	4GB	20GB	Compute
`CPU3C_4_8`	4	8GB	40GB	Compute
`CPU3C_8_16`	8	16GB	80GB	Compute
`CPU5C_1_2`	1	2GB	15GB	Compute (5th gen)
`CPU5C_2_4`	2	4GB	30GB	Compute (5th gen)
`CPU5C_4_8`	4	8GB	60GB	Compute (5th gen)
`CPU5C_8_16`	8	16GB	120GB	Compute (5th gen)

from runpod_flash import Endpoint, CpuInstanceType

@Endpoint(name="cpu-work", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"])
async def process(data):
    import pandas as pd
    return pd.DataFrame(data).describe().to_dict()

Common Patterns

CPU + GPU Pipeline

from runpod_flash import Endpoint, GpuGroup, CpuInstanceType

@Endpoint(name="preprocess", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"])
async def preprocess(raw):
    import pandas as pd
    return pd.DataFrame(raw).to_dict("records")

@Endpoint(name="infer", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"])
async def infer(clean):
    import torch
    t = torch.tensor([[v for v in r.values()] for r in clean], device="cuda")
    return {"predictions": t.mean(dim=1).tolist()}

async def pipeline(data):
    return await infer(await preprocess(data))

Parallel Execution

import asyncio
results = await asyncio.gather(compute(a), compute(b), compute(c))

Gotchas

Imports outside function -- most common error. Everything inside the decorated function.
Forgetting await -- all decorated functions and client methods need await.
Missing dependencies -- must list in dependencies=[].
gpu/cpu are exclusive -- pick one per Endpoint.
idle_timeout is seconds -- default 60s, not minutes.
10MB payload limit -- pass URLs, not large objects.
Client vs decorator -- image=/id= = client. Otherwise = decorator.
Auto GPU switching requires workers >= 5 -- pass a list of GPU types (e.g. gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80]) and set workers=5 or higher. The platform only auto-switches GPU types based on supply when max workers is at least 5.
runsync timeout is 60s -- cold starts can exceed 60s. Use ep.runsync(data, timeout=120) for first requests or use ep.run() + job.wait() instead.

Flash

Install to Claude Code

Flash

Install to Claude Code

Runpod Flash

Setup

CLI

Endpoint: Three Modes

Mode 1: Your Code (Queue-Based Decorator)

Mode 2: Your Code (Load-Balanced Routes)

Mode 3: External Image (Client)

How Mode Is Determined

Endpoint Constructor

NetworkVolume

PodTemplate

EndpointJob

GPU Types (GpuGroup)

CPU Types (CpuInstanceType)

Common Patterns

CPU + GPU Pipeline

Parallel Execution

Gotchas

Recommended

Runpod Flash

Setup

CLI

Endpoint: Three Modes

Mode 1: Your Code (Queue-Based Decorator)

Mode 2: Your Code (Load-Balanced Routes)

Mode 3: External Image (Client)

How Mode Is Determined

Endpoint Constructor

NetworkVolume

PodTemplate

EndpointJob

GPU Types (GpuGroup)

CPU Types (CpuInstanceType)

Common Patterns

CPU + GPU Pipeline

Parallel Execution

Gotchas

Recommended