PHASES β protoBanana roadmap β
Each phase = one new operation. Adding a phase = one route module + one workflow JSON + one keyword set + one Operation enum value. See PROPOSAL.md Β§3.5 for the cost model.
Status legend β
- β shipped: code merged, tests pass, integrated end-to-end
- π§ in flight: PR open
- π queued: spec written, waiting on prior phase or model dep
- π blocked: external dep missing (pinned model, upstream bug)
- β deferred: ruled out for v1
β Phase 1 β Gen + edit + chat-completions β
Operation: GEN, EDIT (auto-routes via chat-completions)
Workflows: qwen_image_2512.json, qwen_image_edit_2511.json
Models:
Comfy-Org/Qwen-Image_ComfyUI/split_files/diffusion_models/qwen_image_2512_fp8_e4m3fn.safetensors(~20 GB)Comfy-Org/Qwen-Image-Edit_ComfyUI/split_files/diffusion_models/qwen_image_edit_2511_fp8mixed.safetensors(~21 GB)- Shared text encoder:
qwen_2.5_vl_7b_fp8_scaled.safetensors(~8.8 GB) - Shared VAE:
qwen_image_vae.safetensors(~243 MB)
Acceptance:
- β
/v1/images/generationsreturns base64 image - β
/v1/chat/completionswith text-only message β image inline - β Follow-up turn with prior assistant image β edit, not regenerate
- β Aspect-ratio inference from prompt ("portrait", "16:9", "hero", etc.)
- β 46 unit tests pass
Lessons learned:
- Open WebUI's native
IMAGE_GENERATION_ENGINE=comfyuiis brittle (workflow JSON node-ID mapping mismatches between OpenWebUI versions). Server-side substitution via this provider is the durable path. - ComfyUI iterates ALL top-level keys as nodes β metadata keys (
_meta,_doc) crash the worker. Loader strips them.
β Phase 2 β Background removal / sticker β
Operation: BGREMOVE
Workflows:
bgremove_birefnet.jsonβ default, commercial-safe (BiRefNet, ~85% quality vs SOTA)bgremove_rmbg2.jsonβ opt-in, non-commercial only (RMBG-2.0, ~90% quality, CC BY-NC 4.0)
Models (auto-downloaded by ComfyUI-RMBG node pack on first use):
- BiRefNet-general (~440 MB)
- RMBG-2.0 (~177 MB) β only if used
Custom node dependency: ComfyUI-RMBG (1038lab, GPL-3.0). Bundles BiRefNet, RMBG-2.0, BEN/BEN2, INSPYRENET, SDMatte, SAM/SAM2/SAM3, GroundingDINO under one install β also lights up Phase 4 deps.
Intent triggers: "remove background", "transparent png", "as a sticker", "alpha background", "knock out the background", etc. (See intents/keywords.py)
Acceptance:
- β Init image + bg-remove keyword in chat β transparent PNG output
- β No init image β falls back to GEN (we don't generate stickers from text alone in v1)
- β Workflow ships as both BiRefNet (default) and RMBG-2.0 (opt-in)
- β Intent classifier tests cover all keywords
β Phase 3 β Multi-reference (2-3 images) β
Operation: MULTIREF
Workflow: multiref_qwen_image_2511.json
Model: Qwen-Image-Edit-2511 (already loaded for Phase 1).
Acceptance:
- β Provider walks ENTIRE chat history collecting images (not just latest)
- β Hard cap at 3 (Qwen-Image-Edit-2511 ceiling)
- β β₯2 images present β routes to MULTIREF, not single EDIT
- β Workflow uses parallel LoadImage β ImageScale β VAEEncode chains, conditioning-stacked
- β Tests verify image collection order and cap
Limitation: Qwen-Image-Edit-2511 maxes at 3 reference images. Nano-banana 2 supports 14. We don't compete on this axis until Qwen ships a higher-ref variant; recommend cloud-fallback for >3-ref tasks.
π Phase 4 β Region edit by text (Florence-2 + SAM 2.1) β
Operation: REGION_EDIT
The killer feature. User says "change the man's tie to red" β Florence-2 finds the bounding box from the text, SAM 2.1 generates a pixel-precise mask, Qwen-Image-Edit inpaints just that region.
Workflows:
region_edit_florence2_sam2_qwen.jsonβ single-shot text-grounded edit
Models (added on top of Phase 1+2):
microsoft/Florence-2-large(~770 MB) β text β bounding boxfacebook/sam2-hiera-base+or smaller (~150 MB-2.6 GB) β bbox β mask
Custom node: ComfyUI-RMBG already includes Florence-2-SAM2 nodes (one install, multiple capabilities).
Intent triggers (already in keyword classifier):
"change the X to Y","change her X","change his X""replace the X","replace her X""just the X","only the X"
VRAM impact: Florence-2 + SAM 2.1 base together ~3 GB peak when invoked. ComfyUI's smart memory swaps in/out, so peak per-request stays ~30 GB.
Acceptance criteria:
- [ ]
region_edit_florence2_sam2_qwen.jsonworkflow ships - [ ] Provider routes correctly: prompt has sub-object pattern + has init image β REGION_EDIT
- [ ] Mask quality verified: 5 hand-crafted "change the X" prompts produce visibly correct masks (eyeball-tested)
- [ ] Edit fidelity: edited region looks like the request, surrounding pixels unchanged
- [ ] Latency: β€ 60s per region edit on cold model (β€ 20s warm)
- [ ] Tests: 6 new region-edit cases in
test_intents_keywords.py, integration test against ComfyUI intests/integration/
Open questions:
- Should we expose mask-output mode (return the mask alongside the image) for client-side debugging?
- Florence-2 vs Grounding-DINO 1.5 β Florence-2 is smaller and integrated; Grounding-DINO 1.5 is more accurate. Default Florence-2; allow Grounding-DINO via workflow swap.
Risks:
- Florence-2's text-to-bbox accuracy on small/occluded objects is uneven. Phase 7 LM-router could pre-validate the target.
- SAM 2.1's mask quality on transparent/glassy materials is poor. Document as known.
π Phase 5 β Inpaint with brushed mask (LanPaint) β
Operation: INPAINT
Use case: Open WebUI lets users brush a mask over a generated image, then prompt for what to fill. Our provider receives the mask in the multimodal request payload and routes to LanPaint.
Workflow: inpaint_lanpaint.json
Model dependency: LanPaint β universal training-free inpaint that works with any Qwen-Image variant. Custom node install, no separate model file (uses already-loaded Qwen-Image-Edit).
Intent triggers:
- Brushed mask present in request β INPAINT regardless of words (winning rule)
- Plus keyword fallback:
"inpaint","fill in","fill the masked area"
Acceptance:
- [ ] LanPaint node installed in ComfyUI
- [ ] Workflow accepts (image, mask, prompt), produces seamless fill
- [ ] Provider extracts mask from multimodal payload (Open WebUI sends as separate
image_urlpart with role hint, or as a discrete file in the multipart request) - [ ] Tests cover mask extraction, fallback behavior
Open questions:
- Does Open WebUI's image-mask UI emit OpenAI-standard mask payloads? May need a small adapter for their specific format.
π Phase 6 β Outpaint β
Operation: OUTPAINT
Use case: "extend this scene to the left", "make this wider", "uncrop".
Approach: No new model. Pad the canvas in the requested direction; create a feathered edge mask covering the new area; route through the inpaint workflow (Phase 5) to fill.
Workflow: outpaint_qwen.json β composes canvas-pad + edge-mask + LanPaint.
Intent triggers: "extend [direction]", "make this wider", "outpaint", "uncrop", "show more of".
Acceptance:
- [ ] Workflow exists; tested on 4 outpaint directions (left/right/up/down)
- [ ] Direction parsed from prompt (
"extend left 256px"β 256-pixel left pad) - [ ] Default extension: 25% of original dimension if unspecified
- [ ] Tests cover direction parsing edge cases
π Phase 7 β LM-based intent classifier β
Operation: routing layer (no new operation; replaces keyword classifier on ambiguous inputs)
Use case: Keyword classifier is deterministic and 95% correct. The 5% miss is on ambiguous instructions like "swap the roles", "do something fun", or domain-specific phrasing the keywords miss.
Approach: Optional small VLM call to protolabs/fast (heretic 35B-A3B, 226 tok/s) with a structured-output JSON schema:
{
"operation": "gen | edit | multiref | bgremove | region_edit | inpaint | outpaint",
"confidence": 0.0-1.0,
"target_phrase": "the man's tie | null",
"instruction": "make it red"
}If confidence < threshold, fall back to keyword classifier.
Acceptance:
- [ ]
intents/llm.pymodule withclassify_via_lm(prompt, has_image, n_refs) β Operation - [ ] Configurable:
PROTOBANANA_INTENT_MODE = keyword | lm | hybrid - [ ] Latency benchmark: keyword β 0ms, LM β 500ms; hybrid uses keyword first, LM only on
Operation.GENfallback for ambiguous EDIT-like prompts - [ ] Quality benchmark: 50 ambiguous prompts hand-classified; LM router improves accuracy by β₯10pp over keyword
Open questions:
- Which model?
protolabs/fastis fast but heretic doesn't always hold structured-output discipline. Tryprotolabs/smart(27B, slower but better at structured outputs). - Cache classifications? Same prompt twice = same intent; in-memory LRU could halve LM calls.
β Deferred β
>3 reference images β
Qwen-Image-Edit-2511's hard cap is 3. Nano-Banana 2 supports 14. To compete, we'd need either:
- Wait for Qwen to release a higher-ref variant (unknown ETA)
- Build a pre-mux step that pairs/selects refs intelligently (hacky, lossy)
- Route 4+ ref requests to the cloud
protolabs/nano-banana-2alias (defeats the local-first thesis for those tasks)
For v1: document the limitation, recommend cloud-fallback. Revisit when Qwen ships next major version.
Streaming chat-completions β
Currently we buffer until the image is ready. Streaming a markdown image chunk-by-chunk doesn't add value (the data URL is one indivisible blob). Could stream "Generating..." placeholders for UX, but Phase 7's intent classifier output could be a more useful early-stream signal. Defer until we have a real client demand.
Per-org workflow publishing β
The compound-rlm library-publishing pattern would map cleanly: per-org ComfyUI workflow bundles versioned and pip-installable as overlays. Defer until we have the first downstream consumer asking for it.
Cross-phase notes β
Single ComfyUI install handles everything. ComfyUI-RMBG (Phase 2) + LanPaint (Phase 5) + Florence-2/SAM2 (Phase 4) are all custom nodes installed once via ComfyUI-Manager or manual git clone. Models auto-download on first use. No per-phase infrastructure churn.
VRAM budget across all phases. With ComfyUI's smart memory manager:
| Phase | Models loaded for that op | Peak VRAM |
|---|---|---|
| Gen | Qwen-Image-2512 + qwen_2.5_vl + VAE | ~30 GB |
| Edit | Qwen-Image-Edit-2511 + qwen_2.5_vl + VAE | ~30 GB |
| Multiref | Qwen-Image-Edit-2511 + qwen_2.5_vl + VAE | ~32 GB (multi VAE encodes) |
| BGremove | BiRefNet (alone β text models offloaded) | ~3 GB |
| Region edit | Florence-2 + SAM 2.1 + Qwen-Image-Edit-2511 | ~33 GB |
| Inpaint / Outpaint | Qwen-Image-Edit-2511 + LanPaint | ~30 GB |
Mode switches cost 5-10s of model load. Same-mode steady-state is warm. Verified against our vllm-fast (heretic) co-tenancy on GPU 1 β fits in the 33 GB free we budgeted (heretic at 0.42 util, Fish TTS at ~20 GB, embed at ~2 GB, ComfyUI peak ~30 GB).