PROPOSAL β protoBanana β
Build the OSS counterpart to Nano-Banana 2 / GPT-Image-2: chat-native image gen + edit, multi-reference compose, background removal, region-aware editing, inpaint, and outpaint β exposed as a single LiteLLM gateway alias that any OpenAI client can call.
Status: alpha β Phase 1-3 implemented (gen + edit + multi-ref + bg-remove); Phases 4-7 specced and queued. See PHASES.md.
Author: protoLabsAI, May 2026.
Targets:
- Day 30: Phases 1-3 ship; Open WebUI runs the full chat-native UX through the gateway against Qwen-Image, with multi-ref + sticker working
- Day 60: Phase 4 ships (Florence-2 + SAM 2.1 region editing)
- Day 90: Phases 5-7 ship; first public blog post on protolabs.studio
- Q4 2026: workshop paper or technical report on the routing architecture
0. TL;DR β
We're not building a new model. We're synthesizing six existing components into one productionized stack and shipping the missing piece: a typed, chat-native, gateway-routed orchestration layer that picks the right ComfyUI workflow per chat turn and returns OpenAI-shaped responses.
| Ingredient | Source |
|---|---|
| Unified gen + edit + multi-ref | Qwen-Image-2512 / Qwen-Image-Edit-2511 (Alibaba) |
| Background removal (commercial-safe) | BiRefNet |
| Background removal (best quality, NC) | RMBG-2.0 (BRIA) |
| Text β bbox (region edit) | Florence-2 (Microsoft) |
| Bbox β mask (region edit) | SAM 2.1 (Meta) |
| Universal inpaint | LanPaint |
| Bundled ComfyUI nodes | ComfyUI-RMBG β RMBG/BiRefNet/SAM2/SAM3/GroundingDINO |
| LLM gateway | LiteLLM |
| Image runtime | ComfyUI |
The novelty: the gateway becomes the only contact surface for clients. Open WebUI, protoCLI, raw curl, any OpenAI SDK β they all talk to one endpoint. The provider classifies intent per turn and dispatches to the right ComfyUI workflow. New operations land as one Python module + one workflow JSON; clients change nothing.
1. The problem this solves β
1.1 The closed-source benchmark β
Nano-Banana 2 (Google's gemini-2.5-flash-image-pro / gemini-3-image) and GPT-Image-2 (OpenAI's autoregressive image model) made conversational image editing mainstream in 2026. The UX is:
user: draw a cat in a hat, watercolor
[image]
user: now make it blue
[edited image]
user: remove the background
[transparent png]
user: change just the hat to red
[masked edit]One model. One context. Multi-turn. Multi-reference. Region-aware. Background removal. Text inside images. Style transfer.
1.2 The OSS gap β
Open-source has the components but no integration:
- Qwen-Image-2512 + Qwen-Image-Edit-2511 cover gen + edit + multi-ref (cap 3)
- BiRefNet/RMBG cover background removal
- Florence-2 + SAM 2.1 cover region segmentation
- LanPaint covers inpaint with arbitrary masks
- ComfyUI runs all of these as workflow graphs
- LiteLLM gateways OpenAI-compatible endpoints
But to give an end user the nano-banana UX, you have to:
- Route requests to the right workflow per intent
- Translate OpenAI chat-completions into ComfyUI prompt API
- Manage multi-image conversation history (prior assistant images become next turn's edit init)
- Handle UNet swapping in ComfyUI between gen / edit / segmentation models
- Wire all of it through a single OpenAI-shaped endpoint
This is what protoBanana does. It's the last 5% that turns a pile of SOTA OSS models into a product.
1.3 Customer fit β
For organizations that can't or won't send their data to Google or OpenAI β compliance, IP sensitivity, sovereignty, cost β protoBanana provides bit-for-bit the same call shape with all data and weights local.
2. Antagonistic review β
Before committing, six adversarial criticisms. Three hold up:
| # | Criticism | Verdict |
|---|---|---|
| 1 | "Just call nano-banana via gateway β you already have protolabs/nano-banana-2" | Valid for non-sensitive workflows. Defense: many users can't send data to Google. We're orthogonal, not competitive. |
| 2 | "OSS image quality is 6-12 months behind frontier" | Valid. Defense: text rendering is actually our strength (Qwen leads). For most use cases the gap is acceptable. |
| 3 | "The 3-ref ceiling kills compose ambitions vs nano-banana's 14" | Valid. Defense: 95% of use cases need β€3 refs. Document the limitation; route 4+ to the cloud alias when needed. |
| 4 | "Building infra someone else will commodify (e.g., LiteLLM ships native ComfyUI support)" | Partial. Even if LiteLLM ships ComfyUI, our intent-routing + multi-workflow orchestration is the layer above. |
| 5 | "Open WebUI's native ComfyUI integration could improve, obviating us" | Partial. Even improved, it's a UI-coupled solution; the gateway alias is reusable. |
| 6 | "Why a separate repo and not just inline in homelab-iac?" | Valid. Decision: standalone repo for publication discipline + drop-in installability for non-protoLabs users. |
Net effect: we ship as a published OSS package, document the cloud-fallback path for >3 refs, and keep the architecture decoupled from any one client UI.
3. Architecture β
3.1 Component layout β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client surface β
β Open WebUI Β· protoCLI Β· raw curl Β· any OpenAI SDK β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β /v1/chat/completions (image output)
β /v1/images/generations
β /v1/images/edits
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LiteLLM gateway β
β - Auth, observability (Langfuse), retries β
β - Routes by `model_name` to ProtoBananaProvider β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ProtoBananaProvider (this package) β
β - Parses OpenAI request (chat / images / images-edits) β
β - Walks message history β (text, [image, ...]) β
β - classify_operation β Operation enum β
β - Dispatches to one of routes/{gen, edit, multiref, bgremove} β
β - Phases 4-6: routes/{region_edit, inpaint, outpaint} β
β - Returns OpenAI-shaped response (b64 image OR markdown image) β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ComfyUIClient (HTTP transport, no business logic) β
β - upload_image, submit_prompt, wait_for_completion, fetch_image β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ComfyUI β
β - Loads JSON workflows from workflows/ β
β - Phase 1-3 models: Qwen-Image-2512, Qwen-Image-Edit-2511, β
β BiRefNet, optional RMBG-2.0 β
β - Phase 4-6 models: Florence-2, SAM 2.1, LanPaint β
β - Smart memory manager: swaps UNets between gen / edit / seg β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ3.2 Per-turn operation routing β
For each chat completion request, the provider:
- Walks
messagesnewest β oldest to extract:- Latest user text (the instruction)
- All accessible images (max 3): user-attached
image_urlparts, plus prior assistant turns' markdown-embedded data URLs
- Classifies the operation via
intents.keywords.classify_operation:- Brushed mask present β INPAINT (Phase 5)
- Init image + bg-remove keywords β BGREMOVE
- Init image + outpaint keywords β OUTPAINT (Phase 6)
- Init image + inpaint keywords (without explicit mask) β INPAINT (Phase 5)
- Init image + sub-object reference patterns β REGION_EDIT (Phase 4)
- β₯2 images β MULTIREF
- 1 image β EDIT
- No image β GEN
- Dispatches to
routes/<op>.run()β each route knows its workflow stem and node-ID conventions. Routes are isolated; adding one is one new module + one workflow JSON. - Returns OpenAI-shaped response β
image/pngbase64 for/v1/images/*, markdown-embedded data URL in chat content for/v1/chat/completions.
3.3 Why the markdown embed for chat output β
OpenAI's image-output protocol (multimodal content list with image_url parts) is supported by some clients but not universally. Markdown  works in every markdown-rendering client we tested (Open WebUI, Slack rich-text bridge, Discord markdown). Trade-off: harder for clients to programmatically detect "this content is an image" β they have to regex-extract. We accept that trade for ubiquity.
Phase 7 may add a config flag to switch output format per model_list entry.
3.4 Workflow contract β
Every workflow JSON in workflows/ follows three rules:
- All top-level keys are nodes with
class_typeset. ComfyUI iterates top-level keys as node IDs and validates each β metadata keys (e.g._doc) crash the worker withmissing_node_type. Loader strips them. - Static defaults for all node inputs. The route's
substitute()mutates per-request fields. This means the workflow file is runnable standalone in ComfyUI's UI for debugging. - Convention:
class_typedetermines the node's semantic role. Not the node ID β multiple workflows can use ID6for different things. Routes inspectclass_typebefore substituting.
3.5 Extension model β
Adding a new operation:
- New module under
routes/<op>.pywithsubstitute()+run()(~50 LOC) - New workflow JSON under
workflows/<op>_<model>.json - New keyword set in
intents/keywords.py(or LM router prompt update for Phase 7) - New entry in
Operationenum - New dispatch arm in
provider.acompletion()
That's the extension cost. Each phase below adds one operation along these lines. The boundaries between gen/edit/multiref/bgremove established the shape; phases 4-6 follow the pattern.
4. Phased plan β
See PHASES.md for status, model dependencies, and acceptance criteria. Summary:
| Phase | Adds | Effort | Status |
|---|---|---|---|
| 1 | Gen + edit + chat-completions + size inference | 2 days | β done |
| 2 | Background removal / sticker | half day | β done |
| 3 | Multi-reference (2-3 images) | half day | β done |
| 4 | Region edit by text (Florence-2 + SAM 2.1) | 1-2 days | queued |
| 5 | Inpaint with brushed mask (LanPaint) | half day | queued |
| 6 | Outpaint | half day | queued |
| 7 | LM-based intent classifier | 1 day | queued (optional polish) |
5. Benchmarks β
We compare against three references:
- nano-banana 2 via
protolabs/nano-banana-2gateway alias (cloud) - GPT-Image-2 via OpenAI API (cloud)
- FLUX.1 Kontext via
replicate.com(cloud, alternative OSS-leaning)
Our test suite: 25 representative prompts Γ 4 categories (gen, edit, multi-ref, region-edit-when-Phase-4-ships). Methodology + raw scores in docs/BENCHMARKS.md.
We accept being 5-15pp behind on quality vs frontier. The win is data locality + cost: ~$0.0001 per generation electricity vs $0.04+ per metered API call.
6. Risks and what we'd do β
| Risk | Likelihood | Mitigation |
|---|---|---|
| Qwen-Image quality plateau | Med | Phase swappable; can route to nano-banana cloud alias for hard cases |
| ComfyUI workflow API breaks | Low | Pin ComfyUI version; integration tests against pinned version |
| Open WebUI changes IMAGE_GENERATION_ENGINE protocol | Med | We're not on that protocol β we're on =openai, which is stable |
| RMBG-2.0 license confusion (NC) | Med | Default workflow uses BiRefNet (commercial-safe); RMBG opt-in only |
LiteLLM aimage_edit not supported for custom providers | Med | Stub raises NotImplementedError; chat-completions path covers edit |
| GPU pressure with multiple UNets resident | Med | ComfyUI's smart memory swaps; verified 30 GB peak fits in budgeted 33 GB free |
| 3-ref ceiling becomes the marketing wedge competitors use against us | Low | Document explicitly; offer cloud-fallback for β₯4 refs |
| Markdown embed breaks in some client | Low | Phase 7 adds format-flag per model_list entry |
7. Repo extraction strategy β
Standalone repo (this one) extracted from inline implementation in protoLabsAI/homelab-iac (PRs #52, #53). The homelab-iac PR feat/protobanana-package swaps the inline providers/comfyui_image.py with pip install protobanana, mounting the workflows dir from the package install.
Reproducibility commitments:
- Locked deps via
uv.lock(committed) - Workflows versioned alongside code; bump workflow filename when changing node conventions
- CI runs tests + lint on every PR
- Trajectories archived to
trajectories/(LFS) for reproducing benchmarks - Library snapshots under
libraries/(LFS) β versioned ComfyUI workflow bundles for share
8. Brand fit and positioning β
- protoLabs identity: local-first AI for organizations that care about data sovereignty. protoBanana is the image axis of that thesis (RLM/ compound-rlm is the long-context axis; voice stack is the conversational axis).
- protolabs.studio publishing: every shipped phase produces a blog draft in
docs/content/. The benchmark numbers + architecture diagrams become the content surface. - HuggingFace presence: workflow bundles and benchmark prompts published as
protoLabsAI/protobanana-workflowsHF dataset.
9. Open questions β
- Streaming chat completion β currently buffered until image is ready. Do clients (Open WebUI in particular) gain anything from streamed
delta.contentof the markdown image? Probably not β but worth verifying. - LM-based intent classifier (Phase 7) latency budget β adds ~500ms per turn. Worth it on ambiguous prompts, harmful on simple ones. Decide based on Phase 4 data.
- Per-org library publishing β should organizations be able to publish their own protoBanana workflow bundles to HuggingFace and pip-install them as overlays? Mirrors the compound-rlm library publishing pattern.
- Benchmark methodology β should we use LLM-as-judge for image quality scoring? Or human eval at small N? Defer to Phase 4 results.
- >3 reference images β wait for Qwen ceiling lift, or build a pre-mux step that pairs/selects refs intelligently? Probably wait.