ARCHITECTURE β
Component breakdown + extension points. For why this shape, see PROPOSAL.md. For setup, see INSTALLATION.md.
System diagram β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client surface β
β β’ Open WebUI β
β β’ protoCLI β
β β’ raw curl / OpenAI SDK β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β /v1/chat/completions
β /v1/images/generations
β /v1/images/edits
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LiteLLM gateway β
β β’ auth, retries β
β β’ Langfuse + Prometheus observability β
β β’ routes by `model_name` to providers β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β ProtoBananaProvider (this package) β
β β
β provider.py β
β βββ aimage_generation β
β βββ aimage_edit β
β βββ acompletion β the chat UX β
β β β
β βΌ β
β intents/keywords.py β
β classify_operation(prompt, has_image, β¦) β
β β Operation.{GEN|EDIT|MULTIREF|BGREMOVE| β
β REGION_EDIT|INPAINT|OUTPAINT} β
β β β
β βΌ β
β routes/{gen,edit,multiref,bgremove}.py β
β βββ load workflow JSON β
β βββ substitute(prompt, seed, β¦) β
β βββ client.upload_image (if needed) β
β βββ client.submit_prompt β
β βββ client.wait_for_completion β
β βββ client.fetch_image_bytes β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β ComfyUIClient (HTTP transport, no logic) β
β client.py β
β βββ upload_image β POST /upload/image β
β βββ submit_prompt β POST /prompt β
β βββ wait_for_completion β poll /history/<id> β
β βββ fetch_image_bytes β GET /view β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β ComfyUI server β
β β’ workflow execution β
β β’ smart memory: swaps UNets between calls β
β β’ models on disk in models/{diffusion_models, β
β text_encoders, vae, ...} β
βββββββββββββββββββββββββββββββββββββββββββββββββββModule responsibilities β
protobanana.client β
Pure HTTP transport. Knows ComfyUI's /upload/image, /prompt, /history/<id>, /view endpoints. Async via httpx. No business logic, no workflow knowledge β anyone can use this independently.
Single class: ComfyUIClient. Reusable across contexts (the LiteLLM provider, integration tests, custom scripts).
protobanana.workflows.loader β
Loads JSON workflow templates from disk. Caches templates; returns deep copies on each load() call so callers can mutate without polluting the cache. Strips top-level keys without class_type β protects against metadata-key crashes (see DECISIONS.md Β§0003).
Single class: WorkflowLoader. Initialized with a workflows dir path (env-overridable via PROTOBANANA_WORKFLOWS_DIR).
protobanana.intents.keywords β
Operation classifier + aspect-ratio inference. Pure functions, deterministic, no LM calls. Phase 7 may add an LM-based classifier in intents/llm.py β the provider would route through both with keyword first, llm on fallback.
Public API:
Operationenum (GEN, EDIT, MULTIREF, BGREMOVE, REGION_EDIT, INPAINT, OUTPAINT)classify_operation(prompt, has_init_image, n_ref_images, explicit_mask)β Operationinfer_size_from_prompt(prompt, default)β (width, height)
Priority order (top wins) inside classify_operation:
explicit_mask=Trueβ INPAINThas_init_imageAND bgremove keyword β BGREMOVEhas_init_imageAND outpaint keyword β OUTPAINThas_init_imageAND inpaint keyword β INPAINThas_init_imageAND sub-object pattern β REGION_EDITn_ref_images >= 2β MULTIREFhas_init_imageβ EDIT- otherwise β GEN
protobanana.routes.<op> β
Per-operation modules. Each owns:
- A workflow stem (
DEFAULT_STEM) β file inworkflows/<stem>.json - A
substitute(workflow, ...)function β knows the workflow's node-ID conventions - An
async run(client, loader, ...)coroutine that executes end-to-end and returns image bytes
Routes don't know about LiteLLM, OpenAI, chat history, or other operations. They're isolated, testable, swappable.
| Route | Stem | Substitution | Returns |
|---|---|---|---|
gen | qwen_image_2512 | prompt, neg_prompt, seed, width, height (nodes 6/7/3/5) | bytes |
edit | qwen_image_edit_2511 | prompt, neg_prompt, seed, image filename (nodes 6/7/3/4) | bytes |
multiref | multiref_qwen_image_2511 | prompt, neg_prompt, seed, up to 3 image filenames (nodes 6/7/3/100/101/102) | bytes |
bgremove | bgremove_birefnet | image filename only (node 4) | bytes (PNG with alpha) |
protobanana.provider β
The LiteLLM CustomLLM subclass. Three async entry points:
aimage_generation(model, prompt, β¦)β direct text-to-imageaimage_edit(model, prompt, image, β¦)β direct editacompletion(model, messages, β¦)β the chat UX, auto-routes per turn
Plus helpers:
_extract_chat_request(messages)β walks history, returns (latest_user_text, all_images[:3])_coerce_image_to_bytes(image)β bytes / file-like / str / data URL / path β bytes_image_response,_chat_responseβ build OpenAI-shaped responses
The provider is thin: pick op, call route's run(), format response. ~300 LOC.
workflows/ β
Static JSON workflows. One file per operation/model combination:
workflows/
βββ qwen_image_2512.json # Phase 1 β text-to-image
βββ qwen_image_edit_2511.json # Phase 1 β single-image edit
βββ multiref_qwen_image_2511.json # Phase 3 β 2-3 image compose
βββ bgremove_birefnet.json # Phase 2 β bg removal (commercial)
βββ bgremove_rmbg2.json # Phase 2 β bg removal (NC)
βββ (Phase 4-6 workflows TBD)Each file is a valid ComfyUI workflow that runs standalone in the ComfyUI UI for debugging. Static defaults + per-request mutations from routes/.
Extension points β adding a new operation β
Example: adding "edge detection" as a debug operation.
Add to
Operationenum (intents/keywords.py):pythonEDGE_DETECT = "edge_detect"Add keyword triggers + dispatch arm in
classify_operation:python_EDGE_KEYWORDS = ["show edges", "edge map", "canny edges"] ... if has_init_image and any(kw in p for kw in _EDGE_KEYWORDS): return Operation.EDGE_DETECTAdd tests (
tests/test_intents_keywords.py):pythondef test_edge_detect(): assert classify_operation("show edges", has_init_image=True) == Operation.EDGE_DETECTBuild the workflow JSON (
workflows/edge_canny.json) β a ComfyUI workflow that takes an image, runs Canny, saves the result.Add the route (
protobanana/routes/edge.py):pythonDEFAULT_STEM = "edge_canny" def substitute(workflow, *, image_filename): ... async def run(client, loader, *, init_image_bytes, ...): ...Register in
routes/__init__.py+ add dispatch arm inprovider.acompletion():pythonelif op == Operation.EDGE_DETECT: img_bytes = await edge.run(cy, self._loader, init_image_bytes=init_images[0], ...)Optional: add a model_list entry in your gateway config:
yaml- model_name: protolabs/qwen-image-edge litellm_params: { model: protobanana/edge_canny, api_base: http://comfy:8188 } model_info: { mode: image_edit }
That's it. ~50 LOC + 1 JSON.
Trade-offs and why β
Markdown image embed in chat output β
We return assistant content as a string with  rather than OpenAI's multimodal content list. Trade: harder for clients to programmatically detect "this is an image", but renders inline in any markdown UI without per-client work. See DECISIONS.md Β§0008.
Server-side workflow substitution β
The provider mutates ComfyUI node IDs in Python, not the client. Trade: provider must know each workflow's node-ID conventions, but every client gets the same UX with zero per-client code. See DECISIONS.md Β§0006.
Per-route modules vs shared substitution β
Each operation has its own routes/<op>.py with its own substitute(). We could share via a substitute(workflow, mapping) helper. Chose per-route because:
- Conventions vary (gen uses
EmptySD3LatentImagefor size; edit usesLoadImagefor input; multiref uses parallel chains) - One module is easier to evolve than a shared substrate when ops diverge
- Tests stay focused (each route's tests cover only that route)
Trade: 4 modules of ~50 LOC each instead of 1 module of ~150 LOC. Acceptable.
3-reference cap β
Hard-coded in routes/multiref.py (MAX_REFS = 3). Qwen-Image-Edit-2511's spec ceiling. Easy to bump if upstream changes; documented in PHASES.md as a known limitation vs nano-banana 2.
Workflows as JSON files, not Python builders β
Workflows are committed as .json files (matching ComfyUI's native format) rather than constructed by Python builders. Trade:
- β Workflows can be authored/debugged in ComfyUI's UI directly
- β Hot-swappable without code deploy
- β Visible diffs in PRs
- β Some duplication across similar workflows
- β No type safety on node references
We can add a Python DSL later if duplication becomes painful. For now the JSONs are short and clear.