Skip to content

BENCHMARKS ​

Methodology + how to reproduce. Numbers landing here as Phases 1-7 ship.


Status ​

Phase 1-3 (gen + edit + multi-ref + bgremove) β€” implemented; benchmarking pipeline scaffolded but full numbers not yet committed.

Phase 4-7 β€” methodology defined; runs blocked on those phases shipping.


What we're measuring ​

For each model under test (protoBanana, nano-banana 2, GPT-Image-2, FLUX.1 Kontext, Qwen-Image direct via Replicate), we score:

DimensionHow
Prompt adherenceHand-rated 0-5 scale: does the image show the prompt's content?
Composition / aestheticsHand-rated 0-5: framing, lighting, balance
Text renderingBinary: is rendered text in the image legible + correct?
Edit fidelity (edit suite only)Hand-rated 0-5: does the edit target the right region without disturbing rest?
Identity preservation (multi-turn edit)Hand-rated 0-5: does the subject still look like the original after N edits?
Latency p50Wall-clock seconds per image, median across N=20 samples
Cost per image$/image β€” for cloud APIs, list price; for local, electricity at ~$0.20/kWh Γ— ~600W Γ— per-image runtime

Test suites ​

Gen suite β€” 25 prompts ​

1.  a watercolor of a cat in a hat
2.  cinematic landscape of misty mountains at dawn, ultra-wide
3.  portrait of an elderly woman with deep wrinkles, soft Rembrandt light
4.  a hero banner for a SaaS product page, minimalist
5.  a poster that says "GRAND OPENING" in bold serif type
6.  isometric illustration of a small mountain town
7.  a 35mm film photo of a vintage 1970s diner interior
8.  abstract geometric pattern, art deco, gold and navy
9.  a friendly cartoon robot with one eye, 3D rendered
10. dense forest in autumn, top-down satellite view
11. a chef plating a dish in a high-end restaurant kitchen
12. a single perfect tomato on a marble countertop, food photography
13. cyberpunk cityscape at night with rain and neon
14. medical illustration of a human heart, anatomical accuracy
15. a watercolor map of an imaginary island
16. studio photo of a vintage Leica camera
17. a hand-drawn architectural sketch of a Victorian house
18. matte painting of a dragon flying over a castle
19. logo design for a coffee shop called "Slow Mornings"
20. a child's crayon drawing of a family at the beach
21. a black-and-white documentary photo of a marketplace
22. flat-design illustration of a person hiking a mountain
23. a steaming bowl of ramen, photorealistic, top-down
24. surreal Dali-style melting clocks in a desert
25. a single autumn leaf with morning dew, macro photography

Coverage: photorealism, illustration, text rendering, technical accuracy, abstract art, design styles.

Edit suite β€” 15 prompts ​

10 single-image edits:

  • Color change ("make the hat red")
  • Object add ("add a scarf")
  • Object remove ("remove the umbrella")
  • Style transfer ("make this anime style")
  • Background replace ("put this on a beach")
  • Material change ("velvet texture on the chair")
  • Lighting change ("sunset lighting")
  • Pose change ("turn to face the camera")
  • Expression ("smile slightly")
  • Time of day ("nighttime")

5 multi-turn (edit on prior edit):

  • 4-step progressive edits to test identity preservation drift

Multi-ref suite β€” 10 prompts ​

5Γ— 2-image:

  • Character + outfit
  • Subject + background
  • Foreground + style
  • Object + lighting
  • Person + pose reference

5Γ— 3-image:

  • Character + outfit + setting
  • Three style refs blended
  • Logo + colors + typography
  • Three character refs (group shot)
  • Foreground + midground + background

Region edit suite (Phase 4) β€” 10 prompts ​

  • "change the man's tie to red"
  • "replace her hat with a top hat"
  • "make just the dog smaller"
  • ... (etc.)

Inpaint suite (Phase 5) β€” 5 prompts with brushed masks ​

Outpaint suite (Phase 6) β€” 5 directional extensions ​


Reference systems ​

SystemAccessCost (per image)
protoBanana (us)local gateway~$0.0001 (electricity)
Nano-Banana 2Google API via gateway alias~$0.04
GPT-Image-2OpenAI API~$0.05
FLUX.1 KontextReplicate~$0.03
Qwen-Image (Replicate)Replicate~$0.02

Reproducing ​

bash
# 1. Set credentials
export LITELLM_API_KEY=...
export OPENAI_API_KEY=...
export REPLICATE_API_KEY=...
export GEMINI_API_KEY=...

# 2. Run the benchmark
uv run python benchmarks/run_benchmark.py \
  --suites gen,edit,multiref \
  --models protobanana,nano-banana-2,gpt-image-2,flux-kontext \
  --output benchmarks/results/$(date +%Y-%m-%d)/

# 3. Score
uv run python benchmarks/score.py \
  benchmarks/results/2026-05-XX/ \
  --output benchmarks/scored/2026-05-XX.csv

Scoring is currently a hand-rating UI; Phase 7 may add LLM-as-judge auto-scoring for prompt adherence.


Honest framing ​

Going in:

  • We expect to lose 5-15pp to nano-banana 2 / GPT-Image-2 on raw quality (multi-source comparisons consistently rank Qwen-Image-Edit behind frontier).
  • We expect to win on text rendering (Qwen's strength).
  • We expect to lose badly on >3-reference compose tasks (3 vs 14 cap).
  • We expect to win on cost-per-image and data locality (definitionally).
  • We expect comparable or better latency vs cloud APIs on local infra (warm models < network round-trip + cold-start in the cloud).

The bet is that for organizations whose acceptable quality threshold is "good" rather than "best", and whose data sensitivity is high, the trade-off is favorable.


Results format ​

When numbers land, they go in benchmarks/results/<date>/:

benchmarks/results/2026-05-15/
β”œβ”€β”€ summary.md          # headline numbers per model per suite
β”œβ”€β”€ gen.csv             # per-prompt scores
β”œβ”€β”€ edit.csv
β”œβ”€β”€ multiref.csv
β”œβ”€β”€ samples/            # the actual generated images for spot-check
β”‚   β”œβ”€β”€ protobanana/
β”‚   β”œβ”€β”€ nano-banana-2/
β”‚   └── ...
└── methodology.md      # date-stamped run config (model versions, etc.)

A summary.md lands in this directory linking to the latest run.


Why this matters ​

Without numbers we're hand-waving. With numbers we can:

  • Honestly tell users when to use protoBanana vs cloud
  • Track quality vs latency trade-offs as we iterate
  • Reproduce results when comparing PRs
  • Anchor blog posts in measurable claims

Apache-2.0 licensed. Docs follow the DiΓ‘taxis framework.