Skip to content

PHASES β€” protoBanana roadmap ​

Each phase = one new operation. Adding a phase = one route module + one workflow JSON + one keyword set + one Operation enum value. See PROPOSAL.md Β§3.5 for the cost model.

Status legend ​

  • βœ… shipped: code merged, tests pass, integrated end-to-end
  • 🚧 in flight: PR open
  • πŸ“‹ queued: spec written, waiting on prior phase or model dep
  • πŸ›‘ blocked: external dep missing (pinned model, upstream bug)
  • ❌ deferred: ruled out for v1

βœ… Phase 1 β€” Gen + edit + chat-completions ​

Operation: GEN, EDIT (auto-routes via chat-completions)

Workflows: qwen_image_2512.json, qwen_image_edit_2511.json

Models:

  • Comfy-Org/Qwen-Image_ComfyUI/split_files/diffusion_models/qwen_image_2512_fp8_e4m3fn.safetensors (~20 GB)
  • Comfy-Org/Qwen-Image-Edit_ComfyUI/split_files/diffusion_models/qwen_image_edit_2511_fp8mixed.safetensors (~21 GB)
  • Shared text encoder: qwen_2.5_vl_7b_fp8_scaled.safetensors (~8.8 GB)
  • Shared VAE: qwen_image_vae.safetensors (~243 MB)

Acceptance:

  • βœ… /v1/images/generations returns base64 image
  • βœ… /v1/chat/completions with text-only message β†’ image inline
  • βœ… Follow-up turn with prior assistant image β†’ edit, not regenerate
  • βœ… Aspect-ratio inference from prompt ("portrait", "16:9", "hero", etc.)
  • βœ… 46 unit tests pass

Lessons learned:

  • Open WebUI's native IMAGE_GENERATION_ENGINE=comfyui is brittle (workflow JSON node-ID mapping mismatches between OpenWebUI versions). Server-side substitution via this provider is the durable path.
  • ComfyUI iterates ALL top-level keys as nodes β€” metadata keys (_meta, _doc) crash the worker. Loader strips them.

βœ… Phase 2 β€” Background removal / sticker ​

Operation: BGREMOVE

Workflows:

  • bgremove_birefnet.json β€” default, commercial-safe (BiRefNet, ~85% quality vs SOTA)
  • bgremove_rmbg2.json β€” opt-in, non-commercial only (RMBG-2.0, ~90% quality, CC BY-NC 4.0)

Models (auto-downloaded by ComfyUI-RMBG node pack on first use):

  • BiRefNet-general (~440 MB)
  • RMBG-2.0 (~177 MB) β€” only if used

Custom node dependency: ComfyUI-RMBG (1038lab, GPL-3.0). Bundles BiRefNet, RMBG-2.0, BEN/BEN2, INSPYRENET, SDMatte, SAM/SAM2/SAM3, GroundingDINO under one install β€” also lights up Phase 4 deps.

Intent triggers: "remove background", "transparent png", "as a sticker", "alpha background", "knock out the background", etc. (See intents/keywords.py)

Acceptance:

  • βœ… Init image + bg-remove keyword in chat β†’ transparent PNG output
  • βœ… No init image β†’ falls back to GEN (we don't generate stickers from text alone in v1)
  • βœ… Workflow ships as both BiRefNet (default) and RMBG-2.0 (opt-in)
  • βœ… Intent classifier tests cover all keywords

βœ… Phase 3 β€” Multi-reference (2-3 images) ​

Operation: MULTIREF

Workflow: multiref_qwen_image_2511.json

Model: Qwen-Image-Edit-2511 (already loaded for Phase 1).

Acceptance:

  • βœ… Provider walks ENTIRE chat history collecting images (not just latest)
  • βœ… Hard cap at 3 (Qwen-Image-Edit-2511 ceiling)
  • βœ… β‰₯2 images present β†’ routes to MULTIREF, not single EDIT
  • βœ… Workflow uses parallel LoadImage β†’ ImageScale β†’ VAEEncode chains, conditioning-stacked
  • βœ… Tests verify image collection order and cap

Limitation: Qwen-Image-Edit-2511 maxes at 3 reference images. Nano-banana 2 supports 14. We don't compete on this axis until Qwen ships a higher-ref variant; recommend cloud-fallback for >3-ref tasks.


πŸ“‹ Phase 4 β€” Region edit by text (Florence-2 + SAM 2.1) ​

Operation: REGION_EDIT

The killer feature. User says "change the man's tie to red" β€” Florence-2 finds the bounding box from the text, SAM 2.1 generates a pixel-precise mask, Qwen-Image-Edit inpaints just that region.

Workflows:

  • region_edit_florence2_sam2_qwen.json β€” single-shot text-grounded edit

Models (added on top of Phase 1+2):

  • microsoft/Florence-2-large (~770 MB) β€” text β†’ bounding box
  • facebook/sam2-hiera-base+ or smaller (~150 MB-2.6 GB) β€” bbox β†’ mask

Custom node: ComfyUI-RMBG already includes Florence-2-SAM2 nodes (one install, multiple capabilities).

Intent triggers (already in keyword classifier):

  • "change the X to Y", "change her X", "change his X"
  • "replace the X", "replace her X"
  • "just the X", "only the X"

VRAM impact: Florence-2 + SAM 2.1 base together ~3 GB peak when invoked. ComfyUI's smart memory swaps in/out, so peak per-request stays ~30 GB.

Acceptance criteria:

  • [ ] region_edit_florence2_sam2_qwen.json workflow ships
  • [ ] Provider routes correctly: prompt has sub-object pattern + has init image β†’ REGION_EDIT
  • [ ] Mask quality verified: 5 hand-crafted "change the X" prompts produce visibly correct masks (eyeball-tested)
  • [ ] Edit fidelity: edited region looks like the request, surrounding pixels unchanged
  • [ ] Latency: ≀ 60s per region edit on cold model (≀ 20s warm)
  • [ ] Tests: 6 new region-edit cases in test_intents_keywords.py, integration test against ComfyUI in tests/integration/

Open questions:

  • Should we expose mask-output mode (return the mask alongside the image) for client-side debugging?
  • Florence-2 vs Grounding-DINO 1.5 β€” Florence-2 is smaller and integrated; Grounding-DINO 1.5 is more accurate. Default Florence-2; allow Grounding-DINO via workflow swap.

Risks:

  • Florence-2's text-to-bbox accuracy on small/occluded objects is uneven. Phase 7 LM-router could pre-validate the target.
  • SAM 2.1's mask quality on transparent/glassy materials is poor. Document as known.

πŸ“‹ Phase 5 β€” Inpaint with brushed mask (LanPaint) ​

Operation: INPAINT

Use case: Open WebUI lets users brush a mask over a generated image, then prompt for what to fill. Our provider receives the mask in the multimodal request payload and routes to LanPaint.

Workflow: inpaint_lanpaint.json

Model dependency: LanPaint β€” universal training-free inpaint that works with any Qwen-Image variant. Custom node install, no separate model file (uses already-loaded Qwen-Image-Edit).

Intent triggers:

  • Brushed mask present in request β†’ INPAINT regardless of words (winning rule)
  • Plus keyword fallback: "inpaint", "fill in", "fill the masked area"

Acceptance:

  • [ ] LanPaint node installed in ComfyUI
  • [ ] Workflow accepts (image, mask, prompt), produces seamless fill
  • [ ] Provider extracts mask from multimodal payload (Open WebUI sends as separate image_url part with role hint, or as a discrete file in the multipart request)
  • [ ] Tests cover mask extraction, fallback behavior

Open questions:

  • Does Open WebUI's image-mask UI emit OpenAI-standard mask payloads? May need a small adapter for their specific format.

πŸ“‹ Phase 6 β€” Outpaint ​

Operation: OUTPAINT

Use case: "extend this scene to the left", "make this wider", "uncrop".

Approach: No new model. Pad the canvas in the requested direction; create a feathered edge mask covering the new area; route through the inpaint workflow (Phase 5) to fill.

Workflow: outpaint_qwen.json β€” composes canvas-pad + edge-mask + LanPaint.

Intent triggers: "extend [direction]", "make this wider", "outpaint", "uncrop", "show more of".

Acceptance:

  • [ ] Workflow exists; tested on 4 outpaint directions (left/right/up/down)
  • [ ] Direction parsed from prompt ("extend left 256px" β†’ 256-pixel left pad)
  • [ ] Default extension: 25% of original dimension if unspecified
  • [ ] Tests cover direction parsing edge cases

πŸ“‹ Phase 7 β€” LM-based intent classifier ​

Operation: routing layer (no new operation; replaces keyword classifier on ambiguous inputs)

Use case: Keyword classifier is deterministic and 95% correct. The 5% miss is on ambiguous instructions like "swap the roles", "do something fun", or domain-specific phrasing the keywords miss.

Approach: Optional small VLM call to protolabs/fast (heretic 35B-A3B, 226 tok/s) with a structured-output JSON schema:

json
{
  "operation": "gen | edit | multiref | bgremove | region_edit | inpaint | outpaint",
  "confidence": 0.0-1.0,
  "target_phrase": "the man's tie | null",
  "instruction": "make it red"
}

If confidence < threshold, fall back to keyword classifier.

Acceptance:

  • [ ] intents/llm.py module with classify_via_lm(prompt, has_image, n_refs) β†’ Operation
  • [ ] Configurable: PROTOBANANA_INTENT_MODE = keyword | lm | hybrid
  • [ ] Latency benchmark: keyword β‰ˆ 0ms, LM β‰ˆ 500ms; hybrid uses keyword first, LM only on Operation.GEN fallback for ambiguous EDIT-like prompts
  • [ ] Quality benchmark: 50 ambiguous prompts hand-classified; LM router improves accuracy by β‰₯10pp over keyword

Open questions:

  • Which model? protolabs/fast is fast but heretic doesn't always hold structured-output discipline. Try protolabs/smart (27B, slower but better at structured outputs).
  • Cache classifications? Same prompt twice = same intent; in-memory LRU could halve LM calls.

❌ Deferred ​

>3 reference images ​

Qwen-Image-Edit-2511's hard cap is 3. Nano-Banana 2 supports 14. To compete, we'd need either:

  • Wait for Qwen to release a higher-ref variant (unknown ETA)
  • Build a pre-mux step that pairs/selects refs intelligently (hacky, lossy)
  • Route 4+ ref requests to the cloud protolabs/nano-banana-2 alias (defeats the local-first thesis for those tasks)

For v1: document the limitation, recommend cloud-fallback. Revisit when Qwen ships next major version.

Streaming chat-completions ​

Currently we buffer until the image is ready. Streaming a markdown image chunk-by-chunk doesn't add value (the data URL is one indivisible blob). Could stream "Generating..." placeholders for UX, but Phase 7's intent classifier output could be a more useful early-stream signal. Defer until we have a real client demand.

Per-org workflow publishing ​

The compound-rlm library-publishing pattern would map cleanly: per-org ComfyUI workflow bundles versioned and pip-installable as overlays. Defer until we have the first downstream consumer asking for it.


Cross-phase notes ​

Single ComfyUI install handles everything. ComfyUI-RMBG (Phase 2) + LanPaint (Phase 5) + Florence-2/SAM2 (Phase 4) are all custom nodes installed once via ComfyUI-Manager or manual git clone. Models auto-download on first use. No per-phase infrastructure churn.

VRAM budget across all phases. With ComfyUI's smart memory manager:

PhaseModels loaded for that opPeak VRAM
GenQwen-Image-2512 + qwen_2.5_vl + VAE~30 GB
EditQwen-Image-Edit-2511 + qwen_2.5_vl + VAE~30 GB
MultirefQwen-Image-Edit-2511 + qwen_2.5_vl + VAE~32 GB (multi VAE encodes)
BGremoveBiRefNet (alone β€” text models offloaded)~3 GB
Region editFlorence-2 + SAM 2.1 + Qwen-Image-Edit-2511~33 GB
Inpaint / OutpaintQwen-Image-Edit-2511 + LanPaint~30 GB

Mode switches cost 5-10s of model load. Same-mode steady-state is warm. Verified against our vllm-fast (heretic) co-tenancy on GPU 1 β€” fits in the 33 GB free we budgeted (heretic at 0.42 util, Fish TTS at ~20 GB, embed at ~2 GB, ComfyUI peak ~30 GB).

Apache-2.0 licensed. Docs follow the DiΓ‘taxis framework.