jobd

STDIOregistry active

Summary

A job broker for running GPU workloads across a few personal machines without standing up Slurm or reaching for cloud orchestration. Exposes tools to submit jobs with VRAM requirements, check queue status, stream logs, and inspect completed runs. The broker routes each job to whichever worker has enough free VRAM, handles preemption with graceful checkpointing, and persists everything to SQLite so jobs survive across sessions. Workers poll the broker over HTTP rather than accepting inbound connections, which keeps the security surface small on a Tailscale mesh. Useful if you have a workstation and a server or two with GPUs and you want agents to be able to fire off training runs, watch them land on the right box, and get results back without manually ssh-ing around.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

jobd

Python

A self-hostable, GPU-aware job broker for your own machines — with native MCP/agent integration.

Like task-spooler or pueue, but across all your machines — and VRAM-aware.

jobd in action: job fleet status shows four workers and their versions; a GPU job routes to the worker with enough free VRAM and streams back; a stdin batch submits two jobs at once; job logs -f follows one to completion

You have a couple of boxes with GPUs — a workstation, a server, maybe a laptop — wired together over Tailscale or a LAN. You want to fire off training runs, data pipelines, and long batch jobs from anywhere, have them land on whichever machine actually has the VRAM free, survive across sessions, and get preempted cleanly when something more important shows up. You don't have a cloud, a Kubernetes cluster, or a Slurm install, and you don't want one.

jobd is that missing piece: a small broker that turns a handful of personal machines into a single queue — and an LLM agent can drive it directly.

# from any machine on your tailnet:
job submit --project myproj --gpu --vram-required 16 --wait -- python train.py
# → routed to whichever worker has ≥16 GB VRAM free, streamed back to your terminal

Why it exists

Most schedulers assume a datacenter. The lightweight ones that don't (a bare nohup, a tmux session, an ssh-and-pray script) give you nothing: no queue, no VRAM-aware routing, no preemption, no record of what ran where. jobd fills the gap between "ssh in and run it" and "stand up Slurm":

VRAM-fit routing. The broker matches each job against live worker capacity (free VRAM / RAM / CPUs, capability tags, arch/OS) and dispatches to a worker that actually fits — instead of you guessing which box is free.
Preempt + checkpoint. A higher-priority job can preempt a running one: the worker sends SIGTERM, the workload gets a grace window to checkpoint, then SIGKILL. A preempted job reaches a terminal preempted state with a durable checkpoint to resume from — it isn't silently re-run. (See docs/preemption.md.)
Survives sessions. Submit, close your laptop, check back tomorrow. Jobs live in the broker, not your shell.
Agent-native. Ships a first-class MCP server so an LLM agent (Claude Code, etc.) can submit, monitor, and babysit jobs as tool calls — the thing most schedulers bolt on as an afterthought, if at all.
Yours. One broker process you run on a machine you own. No accounts, no egress, no per-GPU-hour billing. Tailnet-bound by default.

Why not just use…?

Tool	What it gives you	Why jobd instead
`nohup` / `tmux` / ssh-and-pray	Runs a command on one box	No queue, no VRAM-aware routing, no preemption, no record of what ran where
task-spooler	A real job queue — on a single machine	jobd queues across all your machines and routes by live VRAM/CPU fit
Pueue	The best single-machine command queue daemon	Pueue's own README declares distributed execution out of scope — jobd is that missing layer, plus GPU awareness
HyperQueue	Multi-machine task scheduling with HPC roots, single binary	HQ counts GPUs but doesn't track VRAM, and has no preemption/checkpoint contract or agent interface
Slurm	Datacenter-grade scheduling	Heavy to stand up and operate for 2–3 personal boxes; jobd is one process + a poller per host
SkyPilot / Modal / dstack	Provision and run on clouds + your own machines	SkyPilot's "existing machines" mode installs a k3s cluster on your boxes; dstack wants Docker + passwordless sudo on every host. jobd is one process + a poller — no containers, no sudo, no K8s
Ray	A distributed-compute framework	jobd is a job queue, not a programming model — submit any command, no code changes, GPU-fit routing built in

Closest in spirit are Pueue and task-spooler (single-machine by design) and HyperQueue (multi-machine, HPC-shaped). jobd's niche is the 2–5-GPU homelab: multi-machine live VRAM-fit routing + preempt/checkpoint + a native MCP interface — a combination none of the above offers — with nothing heavier than a Python process per host.

Architecture

Architecture: job CLI, jobd-mcp MCP tools, and HTTP/SSE clients talk to the jobd broker (FastAPI — queue, VRAM matcher, priorities, SQLite) over one tailnet; the broker dispatches via long-poll claims and heartbeats to workers A (24 GB GPU), B (8 GB GPU), and C (CPU-only)

Diagram source (mermaid)

flowchart TD
    CLI["job CLI"]:::client --> B
    MCP["jobd-mcp<br/>MCP tools"]:::client --> B
    API["HTTP · SSE"]:::client --> B
    B["<b>jobd broker</b> — FastAPI<br/>queue · matcher · priorities · SQLite"]:::broker
    B <-->|poll · dispatch| WA["worker A<br/>24 GB GPU"]:::worker
    B <-->|poll · dispatch| WB["worker B<br/>8 GB GPU"]:::worker
    B <-->|poll · dispatch| WC["worker C<br/>CPU-only"]:::worker
    classDef client fill:#1f2937,stroke:#4b5563,color:#e5e7eb;
    classDef broker fill:#0e7490,stroke:#155e75,color:#ecfeff;
    classDef worker fill:#14532d,stroke:#166534,color:#dcfce7;

Workers poll the broker (pull model — no inbound connection to a worker); the broker matches each job against live capacity and hands it back on the poll. One broker process, one poller per host.

Broker — a FastAPI + SQLite service. Holds the queue, runs the matcher, resolves per-project priorities and defaults, exposes a small HTTP API and an SSE stream. Single source of truth.
Workers — lightweight polling agents, one per host. Each advertises live capacity via heartbeat, claims jobs it can run, executes them (shell=False, no shell-injection surface), streams logs back, and honors preemption signals.
Clients — the job CLI, the jobd-mcp MCP server, or anything that speaks the HTTP API.

Install

pip install jobd               # broker + CLI
pip install "jobd[mcp]"        # adds the MCP server
pip install "jobd[worker]"     # adds the worker daemon (jobd-worker)

Requires Python ≥ 3.11. Everything ships in the one jobd package: the broker (jobd), the CLI (job), the MCP server (jobd-mcp), and the worker (jobd-worker). The worker's runtime deps (httpx, psutil, pyyaml, nvidia-ml-py) live behind the [worker] extra since they're only needed on machines that actually run jobs. scripts/install-worker.sh sets a worker up under ~/jobd-worker with its own venv and a generated config.

Quickstart (single host)

# 1. start the broker (binds 127.0.0.1:8765 by default)
JOBD_ALLOW_NO_AUTH=1 jobd          # no-auth is fine for a loopback-only broker

# 2. in another shell, install + start a worker pointed at it
pip install "jobd[worker]"
JOBD_URL=http://127.0.0.1:8765 JOBD_WORKER_HOST=local jobd-worker

# 3. submit a job and wait for it
job submit --project demo --wait -- echo hello
job list
job logs <id>

For a real multi-host deployment (Docker broker + systemd workers, Tailscale binding, shared auth token), see docs/security.md and the templates in docker-compose.yml and scripts/. Adding a worker to a running fleet is one command:

job fleet add user@newbox      # ssh in, install pinned to the broker's version,
                               # wire systemd units + the self-update timer,
                               # verify it registers. `job fleet status` shows drift.

Day-2 operations (health, draining a worker, upgrades, token rotation, backups) are in docs/runbook.md.

Supported platforms

Python 3.11+ everywhere.

Component	Linux	macOS	Windows
Broker (`jobd`)	✅	☑️	☑️ (WSL recommended)
CLI (`job`) / MCP (`jobd-mcp`)	✅	☑️	☑️
Worker (`jobd-worker`)	✅ full	⚠️ degraded	⚠️ degraded

✅ = CI-tested (the test matrix runs on Linux). ☑️ = pure-Python and expected to work, but not exercised by CI — please file an issue if something is broken there.

The worker runs its best on Linux with a systemd user instance: memory caps, process reaping, and preemption use systemd-run --user scopes and cgroups. On non-systemd hosts the worker still executes jobs, but silently drops those guarantees — fine for a single trusted box, not for hard resource isolation. GPU features need NVIDIA + nvidia-ml-py. The broker, CLI, and MCP server are pure-Python and portable.

CLI

job submit -p PROJ [--gpu] [--vram-required N] [--needs TAG]... [--count N | --sweep K=v1,v2]... [--wait] -- CMD...
job list [--state STATE] [--project P] [--array A<id>]   # queue + recent jobs
job status ID | A<id> [--watch]             # one job, or an array's aggregate
job logs ID [-n BYTES]                      # tail captured output
job wait ID                                 # block until terminal
job cancel ID  /  job preempt ID            # stop a job
job workers                                 # fleet snapshot + health
job projects list | set NAME PRI | nudge NAME DELTA
job audit [--project P] [--since 24h]       # event history

job submit --explain dry-runs the resolution (priority, profile, project defaults, host pin) and prints the effective config without enqueuing anything.

Job arrays

Submit N jobs from one template with --count N. Each member is a normal job — it routes, runs, preempts, and checkpoints independently — and {i} in the command is replaced by the member's 0-based index:

job submit -p train --count 8 -- python train.py --fold {i}
# → Submitted array A42: 8 jobs (ids 42..49)

job list --array A42         # the members, with their index annotations
job status A42               # aggregate: state tally + per-member rollup

The array is identified as A<id> (the first member's job id). job status A42 exits non-zero if any member ended in a non-completed terminal state, so it composes with shell &&.

For a grid search, use --sweep KEY=v1,v2,v3 (repeatable) instead of --count. The broker fans out the cartesian product of all axes, substituting {KEY} per member; {i} (the flat member index) is also available:

job submit -p train --sweep lr=0.1,0.01 --sweep seed=1,2,3 \
  -- python train.py --lr {lr} --seed {seed} --out run-{i}
# → Submitted array A50: 6 jobs (ids 50..55)   # 2 × 3 = 6 members

--sweep and --count are mutually exclusive, the product is capped at 1000 members, and i is reserved as an axis key. Substitution is a literal {key} replace (not str.format), so JSON literals and shell braces in the command pass through untouched.

Coming from pueue or task-spooler?

The verbs map directly — what changes is that the queue spans every machine you own:

You ran…	With jobd
`tsp <cmd>` / `pueue add -- <cmd>`	`job submit -p <project> -- <cmd>`
`tsp -w` / `pueue follow <id>`	`job logs -f <id>` (or `job wait <id>`) — streams, exits with the job's own exit code
`tsp` / `pueue status`	`job list`
`pueue log <id>`	`job logs <id>`
commands piped to `simple_gpu_scheduler`	`... \| job submit -p <project> --stdin` — one job per line, fleet-wide
`pueue kill <id>`	`job cancel <id>`
`pueue group` / parallelism limits	projects + priorities (`projects.yaml`); per-worker slots via `JOBD_WORKER_MAX_CONCURRENT_JOBS`

What you gain on top: jobs route to whichever machine actually has the VRAM/CPU free, survive any single box rebooting, can be preempted with a checkpoint window instead of killed, and are drivable by an LLM agent over MCP. What you lose: nothing — a one-machine deployment (broker + one worker on the same host) behaves like a network-reachable pueue.

MCP / agent integration

jobd ships an MCP server (jobd-mcp) exposing the queue as nine tools — jobd_submit, jobd_status, jobd_logs, jobd_list, jobd_cancel, jobd_preempt, jobd_events, jobd_workers, jobd_worker_delete. docs/agent-cookbook.md is the worked tour: fire-and-babysit polling, surviving preemption with checkpoints, sweeps, and asking the broker why a job won't schedule.

One-liner for Claude Code:

claude mcp add jobd --env JOBD_URL=http://127.0.0.1:8765 --env JOBD_API_TOKEN=<your-token> -- jobd-mcp

Or point any other MCP client at it:

{
  "mcpServers": {
    "jobd": {
      "command": "jobd-mcp",
      "env": {
        "JOBD_URL": "http://127.0.0.1:8765",
        "JOBD_API_TOKEN": "<your-token>"
      }
    }
  }
}

JOBD_API_TOKEN must match the broker's token, or every call returns 401. Omit it only when the broker runs with JOBD_ALLOW_NO_AUTH=1.

Now an agent can "run this overnight," check on it next session, and route GPU work through the broker instead of colliding on a shared card. The examples/claude-code-hooks/ directory has optional Claude Code hooks that nudge (or hard-block) an agent toward submitting heavy commands through jobd — including a VRAM-aware GPU guard with # NO_GPU / # CONCURRENT_OK / # VRAM=NGB override markers.

Configuration

Three optional YAML files under JOBD_CONFIG_DIR (defaults shipped in config/):

projects.yaml — per-project base priority and submit defaults (preemptibility, wall/idle timeouts, host pins, capability requirements). See docs/plans/projects-yaml.md for the full resolution model.
profiles.yaml — named resource bundles (--profile gpu-train-large) the matcher uses to size a job.
classifier.yaml — rules that auto-suggest a profile from the command string.

All three are optional; with none present, every job runs at the global default priority.

Everything else is environment variables — the complete JOBD_* catalog (broker, worker, CLI/MCP, and the vars provided to workloads) lives in docs/configuration.md, and a CI test keeps it in lockstep with the source in both directions.

Concurrency (multislotting)

By default each worker runs one job at a time (JOBD_WORKER_MAX_CONCURRENT_JOBS=1). Raise it to let a worker bin-pack several jobs that fit side by side:

JOBD_WORKER_MAX_CONCURRENT_JOBS=3 jobd-worker

The matcher is resource-aware, so this is not blind N-up oversubscription. Each in-flight job reserves its vram_gb / ram_gb / cpus footprint, and the worker's heartbeat advertises only what's left (free_vram = raw − Σ in-flight). The broker won't place a job that doesn't fit the remaining headroom. The practical payoff: a CPU-only job and a GPU job run at the same time — the CPU job reserves 0 VRAM, so it never blocks the GPU slot, and vice-versa. Two GPU jobs co-run only if both fit live VRAM (the /next-job admission gate is the final safety net against an overstated ad).

job workers reports each worker's slot usage — running jobs out of max_concurrent — alongside the live resource ad:

// job workers
{ "host": "desktop", "state": "online", "running": 2, "max_concurrent": 3,
  "free_vram_gb": 9.1, "idle_cpus": 6, ... }

Set the limit per worker from its environment (systemd unit, shell, or worker.yaml env) — it's a worker-local knob, not a broker setting.

Retention

By default jobd keeps every job record and .log file forever — history is never lost. On a long-running broker, opt into pruning:

JOBD_JOB_RETENTION_DAYS=30 jobd   # delete terminal jobs + their logs after 30 days

The sweeper deletes jobs in a terminal state whose finished_at is older than the horizon, unlinks their per-job .log, and emits a jobs_pruned event. Freed SQLite pages are reused under WAL, so the DB file stays bounded without a global-locking VACUUM. The default (0) keeps everything; pruning old terminal parents is safe for any still-pending dependents.

Security

The broker has no TCP-layer auth beyond a shared bearer token, so it is meant to run on a trusted network (loopback or a Tailscale tailnet), never on a public interface. Two stacked controls:

Interface binding — JOBD_HOST must be 127.0.0.1 or a Tailscale CGNAT address (100.64.0.0/10), never 0.0.0.0. A CI lint (tests/test_deploy_lint.py) enforces this on the Docker deployment.
Bearer token — set JOBD_API_TOKEN (≥32 random bytes) on every broker/worker/CLI/MCP host. The broker refuses to start without it unless you explicitly set JOBD_ALLOW_NO_AUTH=1. JOBD_ALLOW_NO_AUTH=1 is for a loopback-only broker (JOBD_HOST=127.0.0.1) — for local dev/tests. Combined with a non-loopback JOBD_HOST it exposes an unauthenticated RCE endpoint to your whole tailnet; the broker logs a startup warning if you do this. Don't.

Three endpoints are exempt from both controls — /livez, /readyz and /metrics answer with no bearer token and no source-IP check, because a generic HTTP monitor cannot send a token. /metrics is the one that matters: it publishes the broker version, job counts by state, and every worker's hostname and version. No commands, cwd, env or project names — but it does fingerprint the fleet. That is why the JOBD_HOST bind above is load-bearing rather than defence-in-depth: port-forward the broker and you publish that inventory. Full table: Unauthenticated surface.

Full threat model, env-var reference, and token rotation: docs/security.md.

License

MIT — see LICENSE.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Give your AI the whole web as clean markdown

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

belt - the only tool your agent needs

belt cli automatically finds the best tools and skills for your agent. image, video, music, tts...

one prompt install →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Configuration

JOBD_URL

Base URL of the jobd broker the MCP server talks to. Defaults to http://127.0.0.1:8765.

Registryactive

Packagejobd

TransportSTDIO

UpdatedJun 7, 2026

View on GitHub

jobd

Python

A self-hostable, GPU-aware job broker for your own machines — with native MCP/agent integration.

Like task-spooler or pueue, but across all your machines — and VRAM-aware.

jobd is that missing piece: a small broker that turns a handful of personal machines into a single queue — and an LLM agent can drive it directly.

# from any machine on your tailnet:
job submit --project myproj --gpu --vram-required 16 --wait -- python train.py
# → routed to whichever worker has ≥16 GB VRAM free, streamed back to your terminal

Why it exists

VRAM-fit routing. The broker matches each job against live worker capacity (free VRAM / RAM / CPUs, capability tags, arch/OS) and dispatches to a worker that actually fits — instead of you guessing which box is free.
Preempt + checkpoint. A higher-priority job can preempt a running one: the worker sends SIGTERM, the workload gets a grace window to checkpoint, then SIGKILL. A preempted job reaches a terminal preempted state with a durable checkpoint to resume from — it isn't silently re-run. (See docs/preemption.md.)
Survives sessions. Submit, close your laptop, check back tomorrow. Jobs live in the broker, not your shell.
Agent-native. Ships a first-class MCP server so an LLM agent (Claude Code, etc.) can submit, monitor, and babysit jobs as tool calls — the thing most schedulers bolt on as an afterthought, if at all.
Yours. One broker process you run on a machine you own. No accounts, no egress, no per-GPU-hour billing. Tailnet-bound by default.

Why not just use…?

Tool	What it gives you	Why jobd instead
`nohup` / `tmux` / ssh-and-pray	Runs a command on one box	No queue, no VRAM-aware routing, no preemption, no record of what ran where
task-spooler	A real job queue — on a single machine	jobd queues across all your machines and routes by live VRAM/CPU fit
Pueue	The best single-machine command queue daemon	Pueue's own README declares distributed execution out of scope — jobd is that missing layer, plus GPU awareness
HyperQueue	Multi-machine task scheduling with HPC roots, single binary	HQ counts GPUs but doesn't track VRAM, and has no preemption/checkpoint contract or agent interface
Slurm	Datacenter-grade scheduling	Heavy to stand up and operate for 2–3 personal boxes; jobd is one process + a poller per host
SkyPilot / Modal / dstack	Provision and run on clouds + your own machines	SkyPilot's "existing machines" mode installs a k3s cluster on your boxes; dstack wants Docker + passwordless sudo on every host. jobd is one process + a poller — no containers, no sudo, no K8s
Ray	A distributed-compute framework	jobd is a job queue, not a programming model — submit any command, no code changes, GPU-fit routing built in

Architecture

Diagram source (mermaid)

flowchart TD
    CLI["job CLI"]:::client --> B
    MCP["jobd-mcp<br/>MCP tools"]:::client --> B
    API["HTTP · SSE"]:::client --> B
    B["<b>jobd broker</b> — FastAPI<br/>queue · matcher · priorities · SQLite"]:::broker
    B <-->|poll · dispatch| WA["worker A<br/>24 GB GPU"]:::worker
    B <-->|poll · dispatch| WB["worker B<br/>8 GB GPU"]:::worker
    B <-->|poll · dispatch| WC["worker C<br/>CPU-only"]:::worker
    classDef client fill:#1f2937,stroke:#4b5563,color:#e5e7eb;
    classDef broker fill:#0e7490,stroke:#155e75,color:#ecfeff;
    classDef worker fill:#14532d,stroke:#166534,color:#dcfce7;

Broker — a FastAPI + SQLite service. Holds the queue, runs the matcher, resolves per-project priorities and defaults, exposes a small HTTP API and an SSE stream. Single source of truth.
Workers — lightweight polling agents, one per host. Each advertises live capacity via heartbeat, claims jobs it can run, executes them (shell=False, no shell-injection surface), streams logs back, and honors preemption signals.
Clients — the job CLI, the jobd-mcp MCP server, or anything that speaks the HTTP API.

Install

pip install jobd               # broker + CLI
pip install "jobd[mcp]"        # adds the MCP server
pip install "jobd[worker]"     # adds the worker daemon (jobd-worker)

Quickstart (single host)

# 1. start the broker (binds 127.0.0.1:8765 by default)
JOBD_ALLOW_NO_AUTH=1 jobd          # no-auth is fine for a loopback-only broker

# 2. in another shell, install + start a worker pointed at it
pip install "jobd[worker]"
JOBD_URL=http://127.0.0.1:8765 JOBD_WORKER_HOST=local jobd-worker

# 3. submit a job and wait for it
job submit --project demo --wait -- echo hello
job list
job logs <id>

job fleet add user@newbox      # ssh in, install pinned to the broker's version,
                               # wire systemd units + the self-update timer,
                               # verify it registers. `job fleet status` shows drift.

Day-2 operations (health, draining a worker, upgrades, token rotation, backups) are in docs/runbook.md.

Supported platforms

Python 3.11+ everywhere.

Component	Linux	macOS	Windows
Broker (`jobd`)	✅	☑️	☑️ (WSL recommended)
CLI (`job`) / MCP (`jobd-mcp`)	✅	☑️	☑️
Worker (`jobd-worker`)	✅ full	⚠️ degraded	⚠️ degraded

✅ = CI-tested (the test matrix runs on Linux). ☑️ = pure-Python and expected to work, but not exercised by CI — please file an issue if something is broken there.

CLI

job submit -p PROJ [--gpu] [--vram-required N] [--needs TAG]... [--count N | --sweep K=v1,v2]... [--wait] -- CMD...
job list [--state STATE] [--project P] [--array A<id>]   # queue + recent jobs
job status ID | A<id> [--watch]             # one job, or an array's aggregate
job logs ID [-n BYTES]                      # tail captured output
job wait ID                                 # block until terminal
job cancel ID  /  job preempt ID            # stop a job
job workers                                 # fleet snapshot + health
job projects list | set NAME PRI | nudge NAME DELTA
job audit [--project P] [--since 24h]       # event history

job submit --explain dry-runs the resolution (priority, profile, project defaults, host pin) and prints the effective config without enqueuing anything.

Job arrays

job submit -p train --count 8 -- python train.py --fold {i}
# → Submitted array A42: 8 jobs (ids 42..49)

job list --array A42         # the members, with their index annotations
job status A42               # aggregate: state tally + per-member rollup

The array is identified as A<id> (the first member's job id). job status A42 exits non-zero if any member ended in a non-completed terminal state, so it composes with shell &&.

job submit -p train --sweep lr=0.1,0.01 --sweep seed=1,2,3 \
  -- python train.py --lr {lr} --seed {seed} --out run-{i}
# → Submitted array A50: 6 jobs (ids 50..55)   # 2 × 3 = 6 members

Coming from pueue or task-spooler?

The verbs map directly — what changes is that the queue spans every machine you own:

You ran…	With jobd
`tsp <cmd>` / `pueue add -- <cmd>`	`job submit -p <project> -- <cmd>`
`tsp -w` / `pueue follow <id>`	`job logs -f <id>` (or `job wait <id>`) — streams, exits with the job's own exit code
`tsp` / `pueue status`	`job list`
`pueue log <id>`	`job logs <id>`
commands piped to `simple_gpu_scheduler`	`... \| job submit -p <project> --stdin` — one job per line, fleet-wide
`pueue kill <id>`	`job cancel <id>`
`pueue group` / parallelism limits	projects + priorities (`projects.yaml`); per-worker slots via `JOBD_WORKER_MAX_CONCURRENT_JOBS`

MCP / agent integration

One-liner for Claude Code:

claude mcp add jobd --env JOBD_URL=http://127.0.0.1:8765 --env JOBD_API_TOKEN=<your-token> -- jobd-mcp

Or point any other MCP client at it:

{
  "mcpServers": {
    "jobd": {
      "command": "jobd-mcp",
      "env": {
        "JOBD_URL": "http://127.0.0.1:8765",
        "JOBD_API_TOKEN": "<your-token>"
      }
    }
  }
}

JOBD_API_TOKEN must match the broker's token, or every call returns 401. Omit it only when the broker runs with JOBD_ALLOW_NO_AUTH=1.

Configuration

Three optional YAML files under JOBD_CONFIG_DIR (defaults shipped in config/):

projects.yaml — per-project base priority and submit defaults (preemptibility, wall/idle timeouts, host pins, capability requirements). See docs/plans/projects-yaml.md for the full resolution model.
profiles.yaml — named resource bundles (--profile gpu-train-large) the matcher uses to size a job.
classifier.yaml — rules that auto-suggest a profile from the command string.

All three are optional; with none present, every job runs at the global default priority.

Concurrency (multislotting)

By default each worker runs one job at a time (JOBD_WORKER_MAX_CONCURRENT_JOBS=1). Raise it to let a worker bin-pack several jobs that fit side by side:

JOBD_WORKER_MAX_CONCURRENT_JOBS=3 jobd-worker

job workers reports each worker's slot usage — running jobs out of max_concurrent — alongside the live resource ad:

// job workers
{ "host": "desktop", "state": "online", "running": 2, "max_concurrent": 3,
  "free_vram_gb": 9.1, "idle_cpus": 6, ... }

Set the limit per worker from its environment (systemd unit, shell, or worker.yaml env) — it's a worker-local knob, not a broker setting.

Retention

By default jobd keeps every job record and .log file forever — history is never lost. On a long-running broker, opt into pruning:

JOBD_JOB_RETENTION_DAYS=30 jobd   # delete terminal jobs + their logs after 30 days

Security

The broker has no TCP-layer auth beyond a shared bearer token, so it is meant to run on a trusted network (loopback or a Tailscale tailnet), never on a public interface. Two stacked controls:

Interface binding — JOBD_HOST must be 127.0.0.1 or a Tailscale CGNAT address (100.64.0.0/10), never 0.0.0.0. A CI lint (tests/test_deploy_lint.py) enforces this on the Docker deployment.
Bearer token — set JOBD_API_TOKEN (≥32 random bytes) on every broker/worker/CLI/MCP host. The broker refuses to start without it unless you explicitly set JOBD_ALLOW_NO_AUTH=1. JOBD_ALLOW_NO_AUTH=1 is for a loopback-only broker (JOBD_HOST=127.0.0.1) — for local dev/tests. Combined with a non-loopback JOBD_HOST it exposes an unauthenticated RCE endpoint to your whole tailnet; the broker logs a startup warning if you do this. Don't.

Full threat model, env-var reference, and token rotation: docs/security.md.

License

MIT — see LICENSE.