Connects Claude to the Foresea forecasting API, which runs a 120B-parameter model trained on Metaculus-style questions. You get structured predictions with confidence scores, automatic evidence retrieval from GDELT and news sources, and market-vs-model edge analysis for Polymarket and Kalshi. The server exposes binary and multiple-choice forecasts, returns both the prediction and the ranked evidence articles that informed it, and calculates whether the model is bullish or bearish relative to current market prices. Useful when you need probabilistic forecasts grounded in recent news or want to scan prediction markets for potential mispricings. The underlying research studies how explicit reasoning instructions affect LLM forecast accuracy.
Conference artifact for studying how explicit rationale instructions affect LLM forecasting behavior on Metaculus-style binary forecasting questions. The codebase contains the prompt variants, batch inference runner, generated result tables, and plotting/analysis scripts used for the paper figures. The live Foresea API also supports prediction-market intelligence: typed forecasts, evidence retrieval, and model-vs-market edge analysis for binary and multiple-choice markets.
Deployed on Google Cloud Run — model gpt-oss-120b, variant variant0_neutral_baseline:
https://foresea.ink
(The URL is printed in the GitHub Actions deploy-step output after the first push to main.)
# Health check
curl https://foresea.ink/health
# Single-record prediction
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by date Y?",
"question_type": "binary",
"description": "Context here.",
"news_articles": [],
"attach_evidence": true,
"evidence_top_k": 5,
"market_platform": "Polymarket",
"market_probability": 0.42,
"variant": "variant0_neutral_baseline"
}'
When attach_evidence is true and no news_articles are supplied, /predict
fetches and ranks current news evidence from GDELT, Google News RSS, and Stooq by
default, injects it into the model prompt, and returns the selected
evidence_articles with the forecast. Supplying news_articles skips automatic
retrieval and uses the caller-provided evidence.
The response includes both the forecast and the evidence used by the model:
{
"question_type": "binary",
"predicted_answer": "Yes",
"confidence": 0.86,
"options": [],
"range_forecast": null,
"rationale": "Model-generated explanation for the forecast.",
"model_rationale": "Model-generated explanation for the forecast.",
"variant": "variant0_neutral_baseline",
"model_key": "gpt-oss-120b",
"evidence_sources": [
{
"source": "Reuters",
"title": "Article headline",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82
}
],
"evidence_articles": [
{
"title": "Article headline",
"summary": "Cleaned article summary.",
"source": "Reuters",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82,
"search_query": "query used for retrieval"
}
],
"evidence_error": null,
"market_analysis": {
"platform": "Polymarket",
"market_url": "https://example.com/market",
"outcome": "Yes",
"market_probability": 0.42,
"model_probability": 0.86,
"edge": 0.44,
"stance": "model_above_market",
"summary": "Foresea is 44 percentage points above the market on Yes."
}
}
Use evidence_sources when a client only needs the source list and links. Use
evidence_articles when a client needs the article-level details that were
attached to the model prompt. rationale and model_rationale are generated by
gpt-oss-120b and explain why the model chose its answer and confidence.
When market_probability is supplied, market_analysis is computed
deterministically from the model probability and the market-implied probability.
Production is served from the custom domain:
https://foresea.ink
The Cloud Run service name, project ID, and region are set at deploy time via gcloud run deploy.
Required runtime environment:
SCADS_AI_API_KEY: Secret Manager secret used by hosted model calls.MODEL_DEVICE=cpu: production Cloud Run runs the CPU image.CUSTOM_DOMAIN=foresea.ink: redirects *.run.app requests to the public domain.GOOGLE_CLIENT_ID: Google OAuth web client ID used by /auth/config.GITHUB_CLIENT_ID / GITHUB_CLIENT_SECRET: GitHub OAuth app credentials. The
OAuth app's callback URL must be the site origin (e.g. https://foresea.ink/).
When unset, the "Continue with GitHub" button is hidden and /auth/github
returns 503. Sign-in also works with Google and email/password.SESSION_SECRET: long random string used to sign browser session JWTs.The OAuth client must allow these JavaScript origins:
https://foresea.ink
https://www.foresea.ink
https://<cloud-run-service-url>.run.app
To update non-secret environment variables without replacing the existing
SESSION_SECRET, use --update-env-vars:
gcloud run services update <service-name> \
--region <region> \
--project <project-id> \
--update-env-vars MODEL_DEVICE=cpu,CUSTOM_DOMAIN=foresea.ink,GOOGLE_CLIENT_ID='<your-google-client-id>'
Verify the deployed auth config and health endpoint:
curl https://foresea.ink/auth/config
curl https://foresea.ink/health
The server is built to scale horizontally on Cloud Run:
/auth/register, /auth/login). Passwords are stored as salted
PBKDF2-HMAC-SHA256 hashes; accounts live in Cloud Datastore.REDIS_URL is set, so they are
shared across instances; otherwise they fall back to per-instance in-memory
state and fail open. /predict (non-personalised requests), evidence
retrieval, and /extract URL fetches are cached; public GETs send
Cache-Control.| Var | Default | Description |
|---|---|---|
REDIS_URL | unset | Memorystore/Redis URL. Shares cache + rate limits across instances. |
PREDICT_CACHE_TTL | 600 | Cache TTL (s) for non-personalised /predict responses. 0 disables. |
EVIDENCE_CACHE_TTL | 900 | Cache TTL (s) for evidence retrieval. |
EXTRACT_CACHE_TTL | 3600 | Cache TTL (s) for /extract URL fetches. |
LOCAL_CACHE_MAX | 1024 | Max entries in the in-memory fallback cache. |
SEARXNG_URL / TAVILY_API_KEY / SERPER_API_KEY / BRAVE_API_KEY | unset | Enable web search as an evidence source. A self-hosted SearXNG is preferred when set, then Tavily, Serper, Brave. Tavily/Serper have free no-card tiers. When none is set, evidence comes from GDELT, Google News, and RSS. |
NEWSAPI_KEY | unset | Enables NewsAPI as an evidence source. |
GET /track-record serves the public forecast track record. The heavy tick loop
does not run on Cloud Run: .github/workflows/track-record-tick.yml runs hourly
on GitHub Actions, updates data/track_record_store.json as the source-of-truth
entity store, writes the public aggregate to static/track_record_live.json, and
commits both files back to main. At runtime, Cloud Run fetches the committed
aggregate from raw GitHub, falling back to the bundled file and then the static
backtest in static/track_record.json.
The Action discovers short-to-medium-horizon Polymarket/Kalshi markets in
separate close-date bands (2-7, 7-14, 14-30, 30-60 days by default) and
calls /predict once per newly snapshotted market/model. If /predict is
protected, set the GitHub secret PREDICT_API_KEY; no server-side
/track-record/tick endpoint is required. TRACK_RECORD_TOKEN is optional and
only enables the agent-enrolled market bridge.
GET /radar serves the public niche-market radar used by the web app's Radar
view. .github/workflows/radar-tick.yml runs hourly on GitHub Actions, fetches
Polymarket/Kalshi/Reddit candidates, calls /predict for fresh Foresea
probabilities, writes static/radar.json, and commits it back to main.
Cloud Run only serves that committed JSON artifact, so Radar avoids OOMs without
avoiding inference.
Radar items include market price, Foresea probability, edge, credibility score,
evidence links, tracking status, and tags such as thin liquidity,
resolution-rule risk, news catalyst, crowded sports market, near-term,
and tracked live. The endpoint is cached with RADAR_TTL (default 900
seconds).
Raise the Cloud Run throughput ceiling (no idle cost while min-instances=0):
gcloud run services update analyzing-llm-rationale --region us-central1 \
--max-instances 20 --concurrency 40 --memory 1Gi
Once max-instances > 1, provision Memorystore for Redis (billable) and set
REDIS_URL so rate limiting and caching stay correct across instances:
gcloud services enable redis.googleapis.com vpcaccess.googleapis.com compute.googleapis.com
gcloud redis instances create foresea-cache --size=1 --region=us-central1 --tier=basic
gcloud compute networks vpc-access connectors create foresea-vpc \
--region=us-central1 --range=10.8.0.0/28
gcloud run services update analyzing-llm-rationale --region us-central1 \
--vpc-connector foresea-vpc \
--update-env-vars REDIS_URL=redis://<instance-host>:6379
The public Cloud Run API is the easiest integration target. It accepts forecasting questions and returns a typed forecast, model rationale, and optional evidence articles. It is built for resolvable forecasts, not general Q&A.
GET /health: service health check.GET /track-record: public live track record, falling back to the static backtest.GET /track-record/digest: shareable markdown summary of the live track record.GET /pr-agent: opt-in agent-to-agent outreach packet for Foresea discovery.POST /predict: public prediction endpoint.GET /markets/polymarket: fetch a live Polymarket quote (see below).GET /markets/kalshi: fetch a live Kalshi quote (see below).POST /agent/analyze: orchestrated end-to-end analysis of a live question (see below).GET /agent/scan: scan a venue for mispriced markets, ranked by edge (see below).GET /trading/accounts: authenticated trading-readiness status, no secrets returned.POST /trading/preview: authenticated dry-run order normalization.POST /trading/orders: authenticated live order submission with explicit confirmation.POST /agent/analyze runs the whole pipeline autonomously: resolve the market
(fetch a live Polymarket/Kalshi price when an identifier is given) → gather
evidence + forecast → price the edge → run any custom skills →
recommend. It returns one structured report.
curl -X POST https://foresea.ink/agent/analyze \
-H "Content-Type: application/json" \
-d '{
"platform": "polymarket",
"slug": "will-the-fed-cut-rates-in-2026",
"skills": [
{"name": "Base rate check", "instruction": "Compare to historical base rates."},
{"name": "Risk", "instruction": "What would most change this forecast?"}
]
}'
Custom skills are your own analysis steps — each runs as an extra model pass
over the question, forecast, and evidence, and comes back as a named section in
the report. Provide a question directly, or a platform + market identifier
(slug/market_id for Polymarket, ticker for Kalshi). Pass history (prior
turns) for multi-turn follow-ups — with history, short follow-ups like "why?" or
"what about June?" are answered in context. BYOK fields (openrouter_api_key,
openrouter_model, provider_base_url) apply here too.
The report includes recommendation (buy_yes/buy_no/hold/no_market_price),
edge, model_probability, market_probability, thesis, evidence_sources,
and pipeline (the ordered steps that ran).
GET /agent/scan lists live markets on a venue, forecasts each, and returns the
ones whose model-vs-market gap clears min_edge, ranked by |edge|.
curl "https://foresea.ink/agent/scan?platform=polymarket&limit=4&min_edge=0.1"
Params: platform (polymarket or kalshi), limit (markets to analyse, max 8),
min_edge (default 0.1), evidence_top_k. Each market runs a full forecast, so
it's bounded by limit and the result is cached briefly. Response: {platform, scanned, opportunities: [{question, market_url, market_probability, model_probability, edge, recommendation}]}. In the web app, the desk's
"⚡ Scan Polymarket for mispriced markets" button calls this.
Foresea exposes a public remote MCP server at:
https://foresea.ink/mcp/
It is advertised for discovery at:
https://foresea.ink/.well-known/mcp/server.json
The remote MCP server is a thin tool layer over the public API. It exposes:
foresea_forecast: calls POST /predict.foresea_analyze_market: calls POST /agent/analyze.foresea_scan_markets: calls GET /agent/scan.foresea_track_record: calls GET /track-record.foresea_edge_board: calls GET /edge-board — live model-vs-market disagreements ranked, each tagged with the resolved track record of gaps that size (by_edge calibration + lead_lag).foresea_pr_agent: calls GET /pr-agent — concise copy and install metadata for agents/catalogs that ask how to describe Foresea.foresea://track-record, foresea://pr-agent, and foresea://openapi.json.GET /pr-agent?audience=mcp returns an opt-in outreach packet that other agents,
MCP catalogs, and tool directories can quote when introducing Foresea. It includes
the one-liner, install command, MCP/OpenAPI links, talking points, and an explicit
no-spam policy.
For operator-run cold outreach to explicit agent endpoints, prepare a target list
and use the local runner. It dry-runs by default and only sends with --send:
python scripts/pr_agent_outreach.py --targets outreach-targets.json
python scripts/pr_agent_outreach.py --targets outreach-targets.json --send
Target file shape:
{
"targets": [
{
"name": "Example Agent Directory",
"endpoint": "https://agent-directory.example/inbox",
"audience": "catalog",
"headers": {"Authorization": "Bearer ..."}
}
]
}
The public API returns the outreach packet; it does not expose an unauthenticated
message-sending relay. The scheduled GitHub Action
.github/workflows/pr-agent-outreach.yml runs every 5 minutes against
data/pr_outreach_targets.json, sends with --send, and records contacted
targets in data/pr_outreach_state.json so repeated scheduled runs do not
re-contact the same agent. For a literal always-running local process, run:
python scripts/pr_agent_outreach.py \
--targets data/pr_outreach_targets.json \
--state data/pr_outreach_state.json \
--send --watch --interval-s 300
Header values can reference GitHub Actions secrets via environment variables, for
example "Authorization": "$PR_AGENT_TARGET_AUTH".
Seeded automated targets:
https://agentndx.ai/api/submit) — public MCP/A2A/x402 review form.https://mcp.directory/api/submit-server) — public JSON submit route.https://mcpub.dev/mcp) — public MCP JSON-RPC submit tool.Additional listing work that is not suitable for the scheduled HTTP sender lives
in data/pr_manual_targets.json. Current manual/GitHub target: mcp.so issue
https://github.com/daodao97/chatmcp/issues/213.
It's a remote, anonymous Streamable-HTTP server — no key, no install. Point any MCP client at the URL:
# Claude Code
claude mcp add --transport http foresea https://foresea.ink/mcp/
// Cursor / Cline / Claude Desktop (mcp.json)
{ "mcpServers": { "foresea": { "url": "https://foresea.ink/mcp/" } } }
# Python — official MCP SDK (3.10+)
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
async with streamablehttp_client("https://foresea.ink/mcp/") as (r, w, _):
async with ClientSession(r, w) as s:
await s.initialize()
print(await s.call_tool("foresea_forecast",
{"question": "Will the Fed cut rates by March 2026?", "market_probability": 0.4}))
# LangChain (langchain-mcp-adapters) — Foresea tools in any LangGraph agent
from langchain_mcp_adapters.client import MultiServerMCPClient
client = MultiServerMCPClient({"foresea": {"url": "https://foresea.ink/mcp/", "transport": "streamable_http"}})
tools = await client.get_tools() # foresea_forecast, foresea_analyze_market, ...
A runnable end-to-end demo (scan → forecast → edge) is in
examples/foresea_agent_demo.py.
Use https://foresea.ink/mcp/ directly in MCP clients that support remote
Streamable HTTP servers. For clients that still require a local stdio command,
run the wrapper locally.
The repo targets Python 3.10+ because the official MCP Python SDK requires it.
To create a repo-local Python 3.11 MCP environment with uv:
uv venv --python 3.11 .venv-mcp
uv pip install --python .venv-mcp/bin/python --no-deps -e .
uv pip install --python .venv-mcp/bin/python "mcp>=1.27.1" requests pyyaml pip
source .venv-mcp/bin/activate
analyze-llm-rationale mcp-server
That lightweight install avoids pulling the full inference dependency stack
(notably Torch/CUDA) when all you need is the MCP wrapper. In a full development
environment, pip install -e ".[mcp]" is also valid.
MCP client config example:
{
"mcpServers": {
"foresea": {
"url": "https://foresea.ink/mcp/"
}
}
}
For a local HTTP MCP endpoint:
.venv-mcp/bin/analyze-llm-rationale mcp-server \
--transport streamable-http \
--host 127.0.0.1 \
--port 8787
Connect MCP clients to http://127.0.0.1:8787/mcp. If a private deployment
requires auth, set FORESEA_API_KEY or pass --api-key; the wrapper forwards it
as X-API-Key.
Quick verification:
.venv-mcp/bin/python - <<'PY'
import importlib.metadata as md
from analyzing_llm_rationale.mcp_server import create_mcp_server
print(md.version("mcp"))
print(create_mcp_server().name)
PY
Pull the current market-implied probability straight from a venue, then feed it
into /predict as market_probability to compute an edge.
# Polymarket — by market slug (or ?id=<numeric id>)
curl "https://foresea.ink/markets/polymarket?slug=will-the-fed-cut-rates-in-2026"
# Kalshi — by market ticker
curl "https://foresea.ink/markets/kalshi?ticker=KXFED-26SEP-C"
Both return a normalised quote:
{
"platform": "Polymarket",
"question": "Will the Fed cut rates in 2026?",
"market_url": "https://polymarket.com/market/...",
"outcome": "Yes",
"probability": 0.54,
"outcomes": [
{"label": "Yes", "probability": 0.54},
{"label": "No", "probability": 0.46}
]
}
probability is null for unpriced/illiquid markets. Quotes are cached briefly
(MARKET_CACHE_TTL, default 30s).
Foresea can submit guarded prediction-market orders, but live execution is
disabled by default. Keep this separate from /agent/analyze: the agent can
recommend buy_yes/buy_no, but order submission requires a signed-in user,
server-side exchange credentials, FORESEA_ENABLE_TRADING=true, execute=true,
and the exact confirmation phrase PLACE REAL ORDER.
Credentials are read only from the server environment, so use Cloud Run Secret Manager mounts or environment secrets. Do not collect private keys in the browser or store exchange secrets in Datastore.
# Global guardrails
export FORESEA_ENABLE_TRADING=false # must be true for live orders
export FORESEA_MAX_ORDER_NOTIONAL=50 # local cap per order, USD
export FORESEA_ALLOW_MARKET_ORDERS=false # separate gate for IOC/FOK-style orders
# Kalshi authenticated REST (RSA-PSS signing)
export KALSHI_API_KEY_ID=<kalshi-key-id>
export KALSHI_PRIVATE_KEY_FILE=/secrets/kalshi-private-key.pem
export KALSHI_BASE_URL=https://external-api.kalshi.com/trade-api/v2
# Polymarket CLOB SDK
export POLYMARKET_PRIVATE_KEY=<wallet-private-key>
export POLYMARKET_API_KEY=<clob-api-key>
export POLYMARKET_API_SECRET=<clob-api-secret>
export POLYMARKET_API_PASSPHRASE=<clob-api-passphrase>
export POLYMARKET_FUNDER_ADDRESS=<optional-funder-address>
export POLYMARKET_SIGNATURE_TYPE=<optional-signature-type>
Install the optional SDKs in production with:
pip install -e ".[serve,trading]"
The Docker image installs trading, so Cloud Run only needs secrets/env vars.
Check configured venues:
curl https://foresea.ink/trading/accounts \
-H "Authorization: Bearer $FORESEA_SESSION"
Preview a Kalshi order without execution:
curl -X POST https://foresea.ink/trading/preview \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1
}'
Submit a live order only after reviewing the preview:
curl -X POST https://foresea.ink/trading/orders \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1,
"execute": true,
"confirmation": "PLACE REAL ORDER"
}'
For Polymarket, pass the CLOB token_id for the exact outcome, or pass
slug/market_id plus outcome and Foresea will resolve the token id from the
public market record. Limit orders use quantity as shares. Market-buy orders
use max_cost as USD spend when supplied and remain blocked unless
FORESEA_ALLOW_MARKET_ORDERS=true.
Required:
question: forecasting question, such as "Will X happen by date Y?",
"Who will win X?", "What will X be?", or "When will X happen?".Optional:
question_type: binary, multiple_choice, numeric, or date. If omitted,
the model attempts to infer the type.options: answer choices for multiple_choice questions.description: extra context for the question.resolution_criteria: how the question should resolve or be measured.categories: list of topic labels.news_articles: caller-supplied evidence articles. If provided, automatic
evidence retrieval is skipped.attach_evidence: defaults to true. When true and news_articles is empty,
the API fetches current evidence from GDELT, Google News RSS, and Stooq.evidence_top_k: number of evidence articles to attach, capped by the server.market_platform: prediction market venue such as Polymarket, Kalshi,
Manifold, or Metaculus.market_url: URL for the market being analyzed.market_outcome: outcome whose market price is supplied. Defaults to Yes
for binary markets.market_probability: current market-implied probability for
market_outcome. Use 0.42 or 42; the API normalizes percentages.variant: prompt variant. Defaults to variant0_neutral_baseline.created_time, publish_time, resolve_time, days_open: optional
forecasting metadata.openrouter_api_key + openrouter_model: run the forecast on your own model
instead of the server default (see "Bring your own model" below).provider_base_url: optional OpenAI-compatible /chat/completions endpoint to
use with your key/model instead of OpenRouter. Must be public HTTPS.By default /predict runs on the server's hosted model. To use your own:
openrouter_api_key and openrouter_model (e.g.
openai/gpt-4o, anthropic/claude-sonnet-4-5). The request is proxied through
OpenRouter.provider_base_url (e.g.
https://api.openai.com/v1/chat/completions) with the matching openrouter_model
(here just the provider's model ID, e.g. gpt-4o) and your key.For safety, provider_base_url must be public HTTPS; loopback, private,
link-local, and cloud-metadata hosts are rejected. In the web app, the sidebar's
"Use your own model" panel exposes the provider, endpoint, key, and model.
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by 2027?",
"question_type": "binary",
"openrouter_api_key": "YOUR_KEY",
"openrouter_model": "gpt-4o",
"provider_base_url": "https://api.openai.com/v1/chat/completions"
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"market_platform": "Polymarket",
"market_probability": 42
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Who will win the 2026 Formula 1 drivers championship?",
"question_type": "multiple_choice",
"options": ["Max Verstappen", "Lando Norris", "Charles Leclerc", "Lewis Hamilton", "Other"],
"attach_evidence": false
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "What will US CPI inflation be in December 2026?",
"question_type": "numeric",
"resolution_criteria": "Use the year-over-year CPI-U inflation rate for December 2026."
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will Company X report positive net income in Q4 2026?",
"description": "Resolve using the company earnings release.",
"resolution_criteria": "Yes if reported GAAP net income is positive.",
"attach_evidence": false,
"news_articles": [
{
"title": "Company X raises full-year guidance",
"source": "Example Business News",
"url": "https://example.com/company-x-guidance",
"publish_date": "2026-05-29",
"summary": "Company X raised revenue guidance and reported margin expansion."
}
]
}'
import requests
payload = {
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"attach_evidence": True,
"evidence_top_k": 3,
"market_platform": "Polymarket",
"market_probability": 42,
}
response = requests.post(
"https://foresea.ink/predict",
json=payload,
timeout=180,
)
response.raise_for_status()
prediction = response.json()
print(prediction["predicted_answer"], prediction["confidence"])
print(prediction["model_rationale"])
if prediction.get("market_analysis"):
print(prediction["market_analysis"]["summary"])
for source in prediction["evidence_sources"]:
print(source["source"], source["url"])
question_type: detected or requested type: binary, multiple_choice,
numeric, or date.predicted_answer: "Yes", "No", the top multiple-choice option, or the
median numeric/date estimate.confidence: model confidence as a number from 0 to 1 for binary and
multiple-choice forecasts; null for numeric/date forecasts.options: per-option probabilities for multiple-choice forecasts.range_forecast: p10, p50, p90, and optional unit for numeric/date
forecasts.rationale: model-generated explanation.model_rationale: alias for the model-generated explanation, intended for API
clients.evidence_sources: compact source list with article title, URL, publication
date, and relevance score.evidence_articles: full evidence records attached to the prompt.evidence_error: retrieval error message, or null when evidence retrieval
succeeds.market_analysis: optional comparison against a supplied market price:
market_probability, model_probability, edge, stance, and a short
summary. edge is model_probability - market_probability.src/analyzing_llm_rationale/: packaged inference, provider, validation, and CLI logic.configs/: model and rationale-variant definitions.prompts/: system prompt and the nine rationale-variant prompts.scripts/: evaluation, recovery, SHAP, plotting, and utility scripts.slurm/: HPC launchers for the variant/temperature sweeps.results/: model outputs and run metadata.analysis/: aggregate metric tables and rationale-analysis outputs.paper/: paper figures, Draw.io sources, PDFs, and qualitative case studies.tests/: unit tests for the package and metric parsing.See ARTIFACT_MANIFEST.md for the submission checklist and file-level notes.
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,analysis]"
Use .[dev] for the core runner and tests only. Use .[analysis] when
regenerating plots or SHAP analyses.
PYTHONPATH=src python -m analyzing_llm_rationale validate-dataset
python -m unittest discover -s tests
ruff check src tests scripts/*.py
PYTHONPATH=src is useful when the repository has not been installed yet or an
older user-local install shadows the working tree.
Run the variant 3 pipeline with the packaged CLI:
analyze-llm-rationale run-batch --variant variant3_reasoning_type
For a remote OpenAI-compatible provider:
export PROVIDER_API_KEY=your_token
analyze-llm-rationale run-batch --variant variant3_reasoning_type --model llama-3.3-70b-instruct
If you do not want to install the package into the environment, invoke it directly:
PYTHONPATH=src python -m analyzing_llm_rationale run-batch --variant variant3_reasoning_type
Useful options:
--variant variant6_step_by_step_reasoning: choose the prompt/output contract.--model qwen2.5-7b-instruct: choose a configured model definition.--temperature 0.7: control generation temperature and output directory.--max-records 10: process only a bounded number of records.--reprocess-nulls: rerun existing rows with predicted_answer = null.--drop-article-text: remove raw article text from prompts before inference.--device auto: select cuda when available, otherwise cpu.verify-results --variant ...: verify completeness, duplicates, malformed rows, and missing IDs.validate-dataset: validate the dataset schema before a run.Foresea has a Karpathy-style autoresearch harness for prompt experiments: edit
one candidate prompt, run a fixed benchmark slice, score one metric, and append
an auditable experiment log. The research surface is
autoresearch/candidate_prompt.txt; agent instructions live in
autoresearch/program.md. The default --model gpt-oss-120b uses the
SCADS-hosted OpenAI-compatible endpoint from configs/models.yaml
(SCADS_AI_API_KEY or SCADS_AI_API_KEY.txt).
Run one candidate experiment:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--max-records 50 \
--metric brier_score
Compare against a baseline and promote only if the candidate improves:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--baseline-results-path results/GPT-OSS-120B/temperature_00/results_variant0_neutral_baseline.json \
--promote-to prompts/variant0_neutral_baseline.txt \
--max-records 50 \
--metric brier_score \
--min-delta 0.001
Each run writes analysis/autoresearch/runs/<run_id>/score.json and appends a
machine-readable row to analysis/autoresearch/experiments.jsonl.
Validate an existing result file:
PYTHONPATH=src python -m analyzing_llm_rationale verify-results \
--model qwen2.5-7b-instruct \
--variant variant3_reasoning_type \
--temperature 0.0 \
--temperature-tag temperature_000
Regenerate aggregate metrics from results/:
python scripts/evaluate_metrics.py
Run the DuckDB SQL analytics suite over the real Metaculus-style dataset and saved model outputs:
python scripts/sql_analytics.py \
--db analysis/forecasting_analytics.duckdb \
--ingest --replace \
--output-dir analysis/sql_analytics
This writes a markdown report plus one CSV per query for 10 medium-level SQL problems: model accuracy, best variants, calibration bins, Brier score, consensus/disagreement cases, prompt lift over baseline, temperature sensitivity, overconfident errors, and category difficulty.
Run the LangChain-powered news retrieval wrapper:
PYTHONPATH=src analyze-llm-rationale fetch-and-rank \
--question "Will X happen by date Y?" \
--source gdelt \
--source google-news \
--source stooq \
--top-k 5
The news pipeline uses LangChain for a query-planning step, article
summarization, and embedding-based relevance ranking before inference. Evidence
sources are configurable with --source for the CLI and --evidence-source
when serving the API.
Run or schedule the Prefect DAG for RSS/news fetch, inference, and DuckDB logging:
# One question
python flows/forecasting_flow.py --question-id 124 --top-k 5
# Small batch from the dataset
python flows/forecasting_flow.py --limit 3 --top-k 5
# Daily scheduled deployment at 06:00 UTC
prefect server start
python flows/forecasting_flow.py --deploy --limit 3 --cron "0 6 * * *"
Regenerate paper figures after metrics are present:
python scripts/plot_model_variant_metric_heatmap.py
python scripts/plot_variant_delta_from_v0.py
python scripts/plot_temperature_frontier.py
python scripts/plot_frs_ablation_slopegraph.py
python scripts/plot_uncertainty_language_calibration_disconnect.py
python scripts/plot_shap_importance_attribute_gaps.py
Common runner and verification commands:
python scripts/run_variant.py --variant variant5_key_conditionspython scripts/run_variant.py --variant variant3_reasoning_type --temperature 0.7 --temperature-tag temperature_07python scripts/run_variant.py --variant variant4_credibility --model llama-3.3-70b-instructpython scripts/verify_results.py --variant variant3_reasoning_typepython download_qwen_model.pypython test_local_inference.pyRepo layout:
scripts/: modular runner entrypointslurm/: batch launchersAuditability:
run_metadata_<variant>.json next to the results file.python -m unittest discover -s tests
ruff check src tests scripts/*.py
The included dataset is forecasting_qa_news_metaculus_2025-02-01_to_today.metaculus_frs_format.json.
Model access is configured in configs/models.yaml. Open-weight Qwen models run
locally through Hugging Face; hosted models use OpenAI-compatible endpoints and
require API keys through environment variables or local key files.
Never commit key files or tokens. Large local caches (.cache/, envs/, .venv/)
are intentionally ignored and excluded from source archives.
If this repository supports a publication, cite the artifact with the metadata in
CITATION.cff and cite the upstream datasets/models according to their licenses.
io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage
io.github.mikerawsonnz/llm-orchestration-agent
io.github.mikerawsonnz/authenticated-llm-agent
labforgedev/copilot-memory-mcp
csoai-org/agent-prompt-injection-firewall-mcp
io.github.mikerawsonnz/authenticated-multi-llm-agent