LiteLLM gateway
Every LLM call from the template routes through an OpenAI-compatible endpoint (api_base: http://gateway:4000/v1). The template assumes there's a LiteLLM gateway somewhere on the network. Why?
The problem this solves
Fleet of agents, each wants to specify a model. Options:
- Each agent hardcodes
model="claude-opus-4-6"and importslangchain-anthropic. - Each agent reads a model name from env, but still imports the provider SDK.
- Each agent talks to a single OpenAI-compatible endpoint, and the endpoint routes.
Option 3 wins because:
- Model upgrades happen in one place (gateway config). No cascading PRs across every agent in the fleet.
- New provider support (Gemini, DeepSeek, local vLLM) doesn't require each agent to add an SDK.
- A/B testing a new model is a gateway-level config change with rollback.
- Per-agent cost / rate-limit policies are enforced at the gateway, not per-agent.
- The OpenAI-compatible surface is the lowest-common-denominator every agent framework understands.
The alias pattern
The template points at model.name: protolabs/agent. Two things to know:
protolabs/<name>is a gateway alias, not a real model. The gateway config mapsprotolabs/agent→ whichever real model (e.g.claude-opus-4-6,gpt-4o) you want.- Each agent gets its own alias. Quinn uses
protolabs/quinn, a researcher agent might useprotolabs/researcher. Same gateway, different underlying models, different rate limits, different cost tracking.
To swap a model for an agent:
# In the gateway's config.yaml
model_list:
- model_name: protolabs/agent
litellm_params:
model: anthropic/claude-opus-4-6 # ← was claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEYReload the gateway. No agent restart needed — the next request picks up the new mapping.
Why OpenAI-compatible specifically
LangChain has first-class support for ChatOpenAI with a base_url override. Pointing it at LiteLLM "just works" — no custom provider adapter needed on the agent side.
LangChain also has native provider clients (ChatAnthropic, ChatGoogleGenerativeAI, etc.), but using those re-couples the agent to a specific provider, which is exactly what the gateway is there to avoid.
What you trade off
Provider-specific features get harder. If Anthropic releases a new API feature that OpenAI's spec doesn't map to cleanly (prompt caching, computer use, extended thinking output), LiteLLM's translation layer may not expose it — or may expose it via a non-standard extension field that ChatOpenAI ignores.
For most agent work this doesn't matter. When it does, the escape hatch is to import the provider SDK directly for that one call, bypassing the gateway — losing the centralization for that call, but only for that call.
You pay a hop. LiteLLM → provider adds one network hop per request. In practice this is negligible (sub-10ms on a local docker network), but it's real. If you're building latency-critical real-time inference, you might route around the gateway.
What about usage_metadata?
LiteLLM is well-behaved about normalizing Anthropic's usage.input_tokens and OpenAI's usage.prompt_tokens into a single shape. The template's on_chat_model_end cost capture works identically whether the gateway is routing to Anthropic, OpenAI, or something self-hosted.
The one gotcha: stream_usage=True (passed in graph/llm.py) is required to get usage on streaming responses. See Cost & trace for why.
What about cost tracking at the gateway?
LiteLLM exposes per-call cost in its callback hooks. The template doesn't capture that today — cost-v1 emission includes token usage and duration, not USD. Forks that want to include costUsd on the cost-v1 payload can plumb response_cost from a LiteLLM callback into TaskRecord.usage. It's on the roadmap but not shipped.
Why not just use an OpenAI key directly?
Fine for a single agent. Breaks down when you have a fleet because:
- API keys proliferate. Every agent has its own, each rotated independently.
- Cost aggregation requires parsing N provider billing pages.
- Switching a single agent to a different model requires code + deploy, not config + reload.
- Rate limits hit individual agents in isolation; cross-agent orchestration of limited quota is impossible.
The gateway solves all of these centrally. For a fleet, it's worth the hop.
Related
- Configuration reference — the
model.*keys - Environment variables —
OPENAI_API_KEYpoints at the gateway - Quinn's README — example of a real gateway alias config