The chat agent β
protoBanana's /v1/chat/completions path is, by default, a tool-use chat agent β an LLM is the brain. It decides whether to respond conversationally, call an image tool, or chain multiple tools to satisfy a user request.
This was a deliberate architectural choice. The earlier dispatch was a deterministic keyword classifier: every chat turn ran one image op and returned the image. That worked for simple commands but couldn't handle:
- Conversational replies (
thanks!got an image of "thanks!" generated) - Clarifying questions (
make it redwas ambiguous β the what? β but had no way to ask) - Chained operations (
remove the bg, then put a sunset behindneeded two ops in one reply) - Natural feedback (
actually I prefer the previous onehad nowhere to land)
The agent owns the entire chat surface; image ops are just tools it can call.
How the loop works β
user msg
β
ProtoBanana.acompletion(messages)
β
build SYSTEM prompt + chat history (with image data URLs elided)
β
loop (max 3 iterations by default):
resp = LM.chat.completions.create(messages, tools=IMAGE_TOOLS)
if resp.tool_calls:
for each call:
result = execute the tool against ComfyUI
append { role: "tool", content: <{success, size}> } to messages
continue
else:
return resp.content + markdown embed of last image (if any tool ran)The LLM sees text only β never image bytes. Server-side state tracks "the most recent image" (and a list of all in-conversation images for multi_ref_compose). Tool results returned to the LLM are tiny dicts ({"success": true, "image_size_bytes": 487123}) so the conversation history stays cheap.
Tools β
| Tool | When the LLM should call it |
|---|---|
generate_image(prompt, size?) | New image from text. No input image needed. |
edit_image(instruction) | Whole-image transformation of the most recent image |
region_edit(region, edit_prompt) | Change a NAMED sub-region (e.g. "the man's tie") |
remove_background() | Sticker / alpha PNG of the most recent image |
multi_ref_compose(prompt) | Blend 2-3 reference images |
outpaint(left/top/right/bottom, fill_prompt) | Extend the canvas |
The full JSON schema for each tool lives in protobanana/tools.py β that's the source of truth the LLM reads at routing time.
Configuration β
| Env var | Purpose | Default |
|---|---|---|
PROTOBANANA_AGENT_BASE | OpenAI-compatible LLM URL | unset = agent disabled |
PROTOBANANA_AGENT_KEY | API key (or whatever your LM expects) | none |
PROTOBANANA_AGENT_MODEL | Model id at the LM endpoint | protolabs/fast |
PROTOBANANA_AGENT_MAX_ITERS | Cap on tool-call iterations per turn | 3 |
Install with the [agent] extra to pull in the OpenAI client dep:
pip install 'protobanana[agent]'When PROTOBANANA_AGENT_BASE is unset, the agent is disabled β the provider falls back to the deterministic keyword classifier path. The package keeps working without an LM, which matters for HF Spaces / dev / standalone use.
Choosing a model β
protolabs/fast is the recommended default β Qwen3.6-35B-A3B-FP8 (uncensored heretic), ~226 tok/s, ~500 ms to a tool-call decision on a typical 200-token prompt. No thinking tokens.
protolabs/smart (Qwen3.6-27B-FP8 thinking, ~150 tok/s with thinking) is a better choice when:
- The conversation requires planning across multiple operations (
make him taller, then put him in a winter coat) - Ambiguity needs careful resolution (
change his hatwhen there are multiple men) - You're willing to pay 1-2 s of thinking time for richer routing
Set PROTOBANANA_AGENT_MODEL=protolabs/smart to switch.
Fallback behavior β
The agent loop returns None from agent.run() in three cases:
PROTOBANANA_AGENT_BASEis unset (agent disabled)- The OpenAI client can't be imported (
pip install 'protobanana[agent]'not done) - The very first LM call raised (network, auth, model unavailable)
The provider checks for None and falls through to the keyword classifier path that ships baked into protoBanana. The user sees a slightly degraded experience (no conversational replies, image-only output) but the system stays up.
If the LM call fails AFTER tools have already run, the agent returns the last produced image with a soft "I had a problem mid-conversation" note rather than dropping the work.
Tracing β
When Langfuse tracing is on, every agent turn produces:
protobanana.acompletion [parent β has metadata.path = "agent" | "keyword"]
βββ protobanana.agent
βββ protobanana.agent.iter_0
β βββ protobanana.tool.<name> (when LLM called a tool)
βββ protobanana.agent.iter_1 (next iteration after tool result)
β βββ protobanana.tool.<name>
βββ protobanana.agent.iter_2Useful filters in the Langfuse UI:
metadata.path = "agent"β only chat turns that went through the agent (filter out keyword fallbacks)protobanana.agent.iter_*β sort by duration to find slow LM callsmetadata.hit_max_iterations = trueβ turns where the LLM got stuck and we cappedmetadata.agent_fallback = trueβ turns where the agent failed and we fell back
Multi-step examples β
The agent's value shows up most clearly on chained operations the keyword router can't express:
| User says | Agent does |
|---|---|
thanks! | reply: "you're welcome β anything else?" β no tools |
make it red (after a previous image) | reply: "the whole image red, or a specific part?" β no tools |
change his hat to red | call region_edit(region="his hat", edit_prompt="a red hat") |
remove the bg, then make a sunset behind | call remove_background(), then outpaint(fill_prompt="sunset sky") |
make her shirt blue and his tie green | call region_edit twice in one turn |
actually use the previous one | reply asking for clarification or recall the prior image β no tools |
These are the experiences the keyword router fundamentally couldn't ship.