JOURNEY β how protoBanana came to exist β
The full backfill from "let's research the OSS nano-banana" to a published standalone repo. Read this if you want the why behind the architecture decisions, or if you're tracing why a particular config is the way it is.
The premise β
After the compound-rlm research wrapped to a natural pause, the next experiment surface was chosen: chat-native image generation + editing.
The trigger was a release-notes generator that had its <think> block leak into a published artifact β exposing that our gateway routing for heretic 35B-A3B-FP8 (alias: protolabs/fast) wasn't stripping reasoning content properly. While fixing that, we noticed a bigger gap: we had a gateway alias for Google's nano-banana 2 (protolabs/nano-banana-2 β gemini/nano-banana-pro-preview), but nothing local equivalent.
Goal: build the OSS counterpart. One gateway alias. Same UX as nano-banana 2.
The research phase β
Mapped the OSS image gen + edit ecosystem (May 2026):
Top open-weight models (closest to nano-banana paradigm):
| Model | Notes |
|---|---|
| Qwen-Image-2512 (FP8) | Unified gen+edit, 20B MMDiT, native 2K, #1 AA Arena both gen+edit |
| HunyuanImage 3.0 Instruct | 80B MoE (13B active), CoT reasoning, top Elo β too big for our GPU |
| BAGEL-7B-MoT | ByteDance, 7B active / 14B total, has world-modeling |
| FLUX.1 Kontext | 7s/edit, best context preservation, weaker on text rendering |
Decision: Qwen-Image-2512 (gen) + Qwen-Image-Edit-2511 (edit). Already on disk via the protoLabs lab experiments/image-gen-eval. Best fit for our 96 GB / GPU.
Top chat UIs that natively render image-in-chat:
- Open WebUI (mature, already running on the ava node)
- LibreChat (alternative, more code-artifact focused)
Decision: Open WebUI (already deployed; supports inline image responses).
The first attempt β direct ComfyUI integration β
PR #49 β wire Open WebUI β ComfyUI β
Architecture: Open WebUI's built-in IMAGE_GENERATION_ENGINE=comfyui mode. Set COMFYUI_BASE_URL to the protolabs node, mount a workflow JSON, expect inline image generation in chat.
# stacks/ai/docker-compose.yml β open-webui service
- ENABLE_IMAGE_GENERATION=true
- IMAGE_GENERATION_ENGINE=comfyui
- COMFYUI_BASE_URL=http://protolabs:8188
- COMFYUI_WORKFLOW=/app/workflows/qwen_image_2512.json
- COMFYUI_WORKFLOW_NODES=[] # β bugPR merged. First test in chat: error.
PR #50 β fix the workflow JSON metadata key β
Found two bugs:
- The workflow JSON had a top-level
_metakey for documentation. ComfyUI iterates ALL top-level keys as node IDs and validates each has aclass_type._metatriggeredmissing_node_typeβ HTTP 500. Open WebUI then failed parsing the response withExpecting value: line 1 column 1 (char 0). COMFYUI_WORKFLOW_NODES=[]was empty. Open WebUI doesn't do string substitution; it edits nodeinputsfields by ID per the mapping. With[], every prompt rendered the workflow's static defaults ("a beautiful landscape").
PR #50 fixed both. Merged. Tested again: another error.
The schema-mismatch fix β
10 validation errors for ComfyUIWorkflow
nodes.0.node_ids
Input should be a valid list [type=list_type, input_value='6', input_type=str]Open WebUI v0.9.2 expects node_ids as a list[str], not a CSV string. Earlier docs described the CSV format. One-line fix: wrap each value in [].
Pushed. Tested again. Still errored β but with the original Expecting value symptom. Confusing.
The hallucination realization β
Logs showed no image-gen request reached ComfyUI. The error message in chat was the LLM hallucinating a system error because it was just a normal chat completion (no image-gen toggle had been clicked) and the model knew it can't actually draw images.
User had been typing "draw me a cat" expecting the system to infer image intent. Open WebUI's image-gen requires an explicit trigger (image button in input bar, or click on an assistant message's image icon). Without that trigger, no request hits the image pipeline.
This was the fork in the road. We could:
(a) Document that users must click the image button explicitly. (b) Build inference into the routing β make the gateway smart about what's an image request vs a chat request.
We chose (b). It's the actual nano-banana experience.
The pivot β gateway-routed image generation β
Why this is the right shape β
Open WebUI's native IMAGE_GENERATION_ENGINE=comfyui had three problems:
- Brittle contract. Workflow node-ID mapping mismatches between Open WebUI versions; the env-var schema kept shifting.
- No intent inference. Required explicit UI gesture to trigger.
- Single-client coupled. protoCLI, raw curl, any other OpenAI SDK couldn't get image gen without re-implementing the same dance.
The clean architecture: expose ComfyUI as an OpenAI-compatible image endpoint via a LiteLLM CustomLLM provider. One gateway alias, every client gets the same UX.
client β gateway (LiteLLM) β CustomLLM provider β ComfyUIPR #52 β initial CustomLLM provider β
Built comfyui_image.py with aimage_generation. LiteLLM's CustomLLM class natively supports custom image providers. Server-side workflow substitution (we mutate node IDs in Python) so clients don't need to know ComfyUI's prompt schema.
# stacks/ai/config/litellm/config.yaml
model_list:
- model_name: protolabs/qwen-image
litellm_params:
model: comfyui-qwen-image/qwen_image_2512
api_base: http://protolabs:8188
model_info: { mode: image_generation }
litellm_settings:
custom_provider_map:
- { provider: "comfyui-qwen-image", custom_handler: providers.comfyui_image.handler }Open WebUI swapped from broken IMAGE_GENERATION_ENGINE=comfyui (8 env vars) to standard =openai pointing at the gateway (4 env vars).
PR #53 β edit + chat-completions β
Extended the provider with aimage_edit (for /v1/images/edits) and acompletion β chat completions with image output. The latter is where the nano-banana UX lives:
- Walks chat history newest β oldest
- Finds latest user text (instruction)
- Finds latest image (user-attached or in a prior assistant turn's markdown-embedded data URL)
- Image found β edit mode; no image β gen mode
- Returns assistant message with markdown-embedded
data:image/png;base64,...
This bypasses Open WebUI's image-gen toggle entirely. You just chat with the model alias and follow-ups auto-edit the previous turn's image.
Plus: aspect-ratio inference from the prompt. Chat completions has no size field, so we extract intent from keywords: "landscape", "hero banner", "16:9", "portrait", etc. β mapped to Qwen-Image's native sweet-spot dimensions.
Tested via direct curl on the gateway: worked. Open WebUI: validated node_ids as list failure (pushed fix 6ba7ace). Then a separate user request that was NOT through the image button β still hallucinated a "system error" because the gateway path needs the protolabs/qwen-image-chat model selected, not text routing.
VRAM and the GPU planning fight β
ComfyUI on the protolabs node is pinned to GPU 1 (CUDA_VISIBLE_DEVICES=1). GPU 1 already hosts:
- vllm-fast.service (heretic 35B-A3B-FP8, alias
protolabs/fast) β was at 71.8 GB - Fish S2 TTS (protoVoice) β 19.8 GB
- Qwen3-Embedding-0.6B (embed server) β 2.0 GB
Total used: ~94 GB / 96 GB. Free: 3 GB. Qwen-Image-2512 needs ~30 GB peak. Won't fit.
The fix β
Dropped vllm-fast --gpu-memory-utilization from 0.73 β 0.42, dropped --max-num-seqs from 512 β 128 (the Mamba-cache-block limit said "max 137" at 0.42 util; 128 leaves margin).
| Service | Before | After |
|---|---|---|
| vllm-fast | 71.8 GB | 41.5 GB |
| Total used | 94 GB | 64 GB |
| Free | 3 GB | 33 GB |
Image gen verified end-to-end: Qwen-Image-2512 1024Γ1024 20-step generation in 28s, peak 91.5 GB / 96 GB on GPU 1 (5.5 GB headroom). First image saved at /mnt/data/comfyui/output/compound-test_00001_.png.
Trade-off: heretic's KV cache budget went from ~250K tokens β ~54K tokens. For typical chat (4-32K context), zero impact. For occasional long-context requests, chunked prefill handles it.
--kv-cache-dtype fp8 would unlock 2Γ capacity. It's still broken on Blackwell. β
Tried adding --kv-cache-dtype fp8 to vllm-fast. Failed:
RuntimeError: FlashInfer requires GPUs with sm75 or higherMisleading error β Blackwell is sm120, way higher than sm75. FlashInfer's sm-version check rejects sm120 specifically. Re-verified broken 2026-05-03 in the lab CLAUDE.md. Reverted.
The destructive-ops cleanup gotcha β
Before we settled on the model layout, downloaded Qwen-Image-2512 to /mnt/models/qwen-image-2512/ via huggingface-cli download --local-dir, not realizing it was already in the cache (symlinks from March 18). "Cleaned up" the duplicate with rm -rf β and broke 6 ComfyUI symlinks because huggingface-cli download --local-dir uses hardlink-shared blob inodes with the cache. Deleting the local-dir cascaded into the cache.
Lesson learned (now in lab CLAUDE.md):
huggingface-cli download --local-dir <path>shares blob inodes with the cache. The cache snapshots and your--local-dirfiles are hardlinks to the same underlying blob.rm -rfon either path can decrement inode refs and orphan the other. Two safe patterns:
- Cache-only download (
HF_HOME=/mnt/models/huggingface huggingface-cli download <repo> <files>β no--local-dir)- If using
--local-dir, also pass--local-dir-use-symlinks Falseso files are full copies, decoupled from the cache.
Restored from a fresh cache-only download.
The research-deep-and-wide pass β
After Phase 1 (gen + edit + chat-completions) shipped, the user asked: "how do we match GPT Image 2 / Nano Banana 2's full capability?"
Mapped each capability to OSS:
| Capability | OSS replica |
|---|---|
| Text-to-image | Qwen-Image-2512 β |
| Instruction edit | Qwen-Image-Edit-2511 β |
| Multi-reference compose | Qwen-Image-Edit-2511 (3-ref cap) |
| Background removal / sticker | BiRefNet (commercial) or RMBG-2.0 (NC) |
| Text-region edit ("change the man's tie") | Florence-2 + SAM 2.1 |
| Inpaint with brushed mask | LanPaint |
| Outpaint | Same engine + edge mask |
| Multi-image fusion | Qwen-Image-Edit-2511 (cap 3) |
Key discovery: ComfyUI-RMBG bundles RMBG-2.0, BiRefNet, BEN/BEN2, INSPYRENET, SDMatte, SAM/SAM2/SAM3, AND GroundingDINO under one custom-node install. One install lights up Phase 2, Phase 4, Phase 6 dependencies.
License gotcha: RMBG-2.0 is CC BY-NC 4.0 (non-commercial). BiRefNet is open. Defaulted to BiRefNet; offered RMBG-2.0 as opt-in.
Hard ceiling: Qwen-Image-Edit-2511 maxes at 3 reference images. Nano-Banana 2 supports 14. We can't compete on this axis until Qwen ships a higher-ref variant. Documented; offered cloud-fallback path for >3-ref.
The repo extraction β
The provider was originally inline in protoLabsAI/homelab-iac at stacks/ai/config/litellm/providers/comfyui_image.py. As Phases 2-7 came into scope, the inline approach hit limits:
- Hard to test (no isolated test suite)
- Hard to share (locked in private homelab-iac repo)
- Hard to publish (no clear artifact)
- Hard to track (commits mixed with infra changes)
Extracted to standalone protoLabsAI/protoBanana:
- Apache-2.0 license
- pip-installable (
pip install git+https://github.com/protoLabsAI/protoBanana) - Workflows shipped alongside the package
- Test suite (46 unit tests covering intent classifier, workflow loader, chat extraction)
- Full docs: README, PROPOSAL, PHASES, JOURNEY (this file), HOWTO, ARCHITECTURE, INSTALLATION, OPERATING, WORKFLOWS-COOKBOOK, INTENT-ROUTER, API, BENCHMARKS, DECISIONS, CHANGELOG
The homelab-iac integration changes from mount inline provider directory to pip install protobanana && set custom_handler: protobanana.handler.
What we did NOT solve and why β
- 3-reference cap. Qwen-Image-Edit-2511 ceiling. Wait for upstream.
--kv-cache-dtype fp8on Blackwell. FlashInfer sm120 support. Wait for upstream.- LM-based intent classifier. Phase 7. Keyword classifier covers ~95%; LM router is polish, not blocker.
- Phase 4-6 implementation. Specced; not built. Each is ~half day to 1-2 days of work; sequenced behind validation that Phases 1-3 actually stick in production.
Lessons that survived β
- One stable contract beats many integrations. The gateway alias works for every client. We don't have to chase Open WebUI's image-gen schema shifts, protoCLI's, or anyone else's.
- Server-side substitution is the only durable path. Letting clients format ComfyUI workflows is a bug factory; clients change too fast.
- Strip metadata keys from workflow JSONs. ComfyUI iterates top-level keys; one stray
_metacrashes the worker. --local-diron huggingface-cli is dangerous. Cache-only downloads are the safe default.- VRAM planning is real engineering. "Just add more memory" doesn't exist on a fixed-budget homelab; smart memory swapping matters.
- Hallucinated error messages from a chat LLM look exactly like real errors. Always check the upstream service's logs before chasing the client's error text.
The conditioning bug β
After 0.1.0a1 shipped the Gradio app and we walked away thinking it was solid, the user opened the Edit tab, uploaded an image, sent a prompt β and got back an image with no relationship to the input. Multi-ref had the same problem.
The workflows passed every test we had. The static validator (added the same day to gate exactly this kind of issue) said all 5 workflows were fine. ComfyUI accepted the submissions, returned prompt_ids, executed to completion, and emitted images. The 46 unit tests stayed green.
The bug was structural, not syntactic. Both edit-class workflows wired the input image two ways:
LoadImage β ImageScale β VAEEncode β latent_imagefor KSampler- ...nothing else.
CLIPTextEncode was wired into positive/negative β but CLIPTextEncode only sees text. With denoise=1.0 (correct for instruction edit), the latent_image gets fully overwritten with random noise. So the model ran pure text-to-image with the input's spatial dimensions. Hence: a fresh, unrelated image.
The right node is TextEncodeQwenImageEditPlus β it takes prompt + clip + vae + image1/image2/image3 and pipes the image into Qwen2.5-VL's vision tower. Both positive and negative get the same image so the model has consistent visual context.
Verification was straightforward once the fix was in place:
- Input: red 768Γ768 + white circle
- Prompt: "change the white circle to a yellow star, keep the red background"
- Output: red 768Γ768 + yellow star (avg RGB 225,49,29)
The fix landed in protoBanana#3 and homelab-iac#55. ADR 0011 documents the encoder choice; ADR 0012 documents why we now treat "static validation passes" as necessary-but-not-sufficient and added an e2e smoke pattern.
The lesson that mattered most: schema validation answers "will the runtime accept this graph"; only end-to-end answers "will the model do the work." We had been operating as if those were the same question. They are not. The validator we'd just built passed the broken workflow because the broken workflow was structurally valid β the bug was in what the workflow meant relative to the model loaded at node 37.
That's also why the bug survived through PR review, validator runs, and 46 unit tests. The whole chain was checking spelling on a grammatically-valid sentence with the wrong meaning.
Phases 4-7 land, and the chat path becomes an agent β
Once the conditioning fix shipped and the gateway started serving Edit / Sticker / Multi-ref correctly, the natural next step was the remaining ChatGPT-image-2 capabilities: region edit, inpaint, outpaint. What we discovered along the way changed the shape of the chat path itself.
Phase 4 didn't go where we planned β
The plan was Florence-2 + SAM 2.1 for text-grounded region edit ("change the man's tie"). Both nodes existed in the ComfyUI install, both crashed at runtime β BertModel.get_head_mask had been removed from a newer transformers, taking GroundingDINO with it, and Florence-2's bundled config dropped forced_bos_token_id in the same release. Our ComfyUI's transformers version was newer than the grounding nodes expected.
Three options sat on the table: pin transformers in ComfyUI's venv (risk of breaking Qwen-Image-Edit), find a different grounding stack, or skip auto-grounding and ship brushed-mask UX as Phase 5 first.
The user reframed: brushed mask is not the priority β the agent should generate the mask, not the user. That kept Phase 4 alive but needed a working textβmask grounder. SAM 3 (SAM3Segment in ComfyUI-RMBG) turned out to bundle its own grounding without DINO or Florence dependencies. One probe, "white circle" β clean polygon mask in 30 seconds. Wired that into a new region_edit workflow:
LoadImage β SAM3Segment(prompt=grounding_text) β GrowMask β
TextEncodeQwenImageEditPlus(prompt + image) β
InpaintModelConditioning(noise_mask=true) β KSampler β
ImageCompositeMasked(destination=init, source=edited, mask=sam_mask)The composite at the end is what makes the technique honest β outside-mask pixels come from the original at byte level, not through the VAE roundtrip. Verified with a corner-pixel diff: 0 drift on a "change the white circle to a yellow star" run.
The chat-path direction the user actually wanted β
Phase 6 (outpaint) and Phase 7 went in cleanly back-to-back. Phase 7 as built was an "optional LM second-pass classifier" β fire the LM only when the keyword router picked the catch-all EDIT/GEN, let it refine to something more specific. It worked but felt minor. The user pushed back:
wait, no. i want an llm to be the router, we have protolabs/fast to get quick feedback from and this would power the chat portion of the agent as well, yeah?
That was the real ask. The keyword router had been the chat path since day one β every chat turn ran one image op and returned an image. It couldn't say "you're welcome", couldn't ask "did you mean her hat or his?", couldn't chain remove_background then outpaint in one reply. Those weren't fixable with smarter keywords. The chat path needed an LLM that owned the conversation surface, with image ops as tools.
Built that. Tool definitions in protobanana/tools.py, agent loop in protobanana/agent.py. The LLM never sees image bytes β server- side state holds the most recent image, tool results returned to the LLM are tiny {success, size} dicts. Three-iteration cap. Keyword classifier kept as fallback when no LM endpoint is configured.
ADR 0013 captures the full context.
Two production bugs the build didn't catch β
First deploy: chat hung for 2+ minutes per turn, no error, no progress. Diagnosis took longer than the fix because the proxy logs showed nothing β the request just sat. Root cause was embarrassingly simple: the agent used the synchronous OpenAI client inside an async coroutine, and its PROTOBANANA_AGENT_BASE points at the same gateway. The sync call held the event loop while waiting on a response that needed the same loop to be served. Instant deadlock, invisible to logs. Fix: AsyncOpenAI + await. ADR 0014.
Second deploy: agent ran but Langfuse traces stopped emitting. LiteLLM's own callback failed at boot:
Langfuse.__init__() got an unexpected keyword argument 'sdk_integration'LiteLLM hard-pins langfuse v2.59.7 and constructs Langfuse( sdk_integration=...). Our [tracing] extra had requested langfuse>=3.0,<4, and v3 dropped the kwarg. Pip resolved to v3, LiteLLM's callback registration broke. Pinned [tracing] back to v2; our v3-only _tracing.py fails the import gracefully on v2 β no-op spans. Trade: only LiteLLM's per-request traces emit until a v2 adapter ships in protoBanana. ADR 0015.
Both bugs slipped past my smoke imports β sync vs async is runtime semantics invisible to import openai, and the Langfuse kwarg mismatch fires at LiteLLM's callback registration, not at module import. The lesson: a real boot-and-poke smoke (spin litellm + hit an endpoint) would have caught both. Build-time imports give a false sense of safety on the integration surface.
The prompt-anchoring bug β
The third deploy hit the actual user-facing UX:
close, it was able to infer what was i asking for, but it didn't call the edit, it just created a new cat in a hat
The user had typed draw a cat in a hat, gotten an image, then said make it a bowling cap. The agent saw the prior assistant image and still picked generate_image. The prompt rules said the right thing in english but framed the image-in-conversation as informational ("the recent assistant image is available for edit_image..."). At T=0, Qwen3.6-35B doesn't promote informational context to a routing constraint β it's just background.
Rewrote the prompt as a directive contract: "If the conversation already contains an image, the user's next message is almost always about THAT image." Plus few-shot examples for the actually- confusing cases (make it a bowling cap β region_edit, make it watercolor β edit_image, now draw a dog instead β generate_image, thanks! β text reply, no tool). Verified live against vLLM local-fast: 7/8 cases route deterministically.
The lesson worth keeping: smaller routing models need anchored prompts + examples; larger models don't. When protolabs/fast isn't enough, PROTOBANANA_AGENT_MODEL=protolabs/smart (Qwen3.6- 27B-FP8 thinking) is one env var away β no rebuild.
By the end of this session β a few hours of iteration, not days β the chat agent owned the chat surface, all seven phases had shipped, and the deploy-after-deploy stack of fixes had made the gateway integration robust. The architecture reached a stable shape: keyword classifier as guaranteed fallback, LM agent as the production default, Langfuse tracing through LiteLLM's callback for now, fine-grained sub-spans deferred behind the v2 adapter.
The compressed timeline matters. Everything from the conditioning fix through the agent deploy through the prompt-anchoring follow-up landed inside a single working session, with the user running real chats against the live gateway between iterations and reporting back. That tight loop is why the JOURNEY reads less like a project history and more like a debugging conversation β because that's what it was.