Browser-first inference (deferred research)
Research snapshot from 2026-04-22. Deferred in favor of the R3F orb migration. When we come back to this, start here — every link and model identity was verified on that date.
Headline
Moving protoVoice's inference stack into the browser via WebGPU is viable for the LLM on modern desktop hardware (Gemma 4 E2B, ~20 tok/s on M3), marginal on iPhone 15 Pro, painful on mid-range Android [speculative — no direct benchmarks at the time]. The full pipeline cannot go client-side without rewriting pipecat's processor primitives in TypeScript, which would forfeit the voice-pipeline moat (BackchannelController, DeliveryController, BargeInGate, MicroAckInjector, async-tool inbox). A hybrid where the LLM runs in the browser but STT/TTS/pipeline stay server-side is the pragmatic path.
Models confirmed
google/gemma-4-E2B-it — the headline candidate
- Released 2026-04-02, Apache-2
- 2.3 B effective / 5.1 B with embeddings, 128K context
- Native text + image + audio (audio ≤ 30 s) — audio modality means Whisper can be skipped on the happy path
- ~1.3–1.5 GB at q4; day-one
transformers.jsv3 support; LiteRT-LM web-optimized variant published - Speculative throughput: ~20–25 tok/s M3 MacBook, ~10–15 tok/s iPhone 15 Pro, ~4–8 tok/s mid-range Android, ~50–80 tok/s laptop RTX 4060
prism-ml/Bonsai-1.7B-gguf — 1-bit, not a router
- ~290 MB on disk (1-bit quantized), working WebGPU demo at webml-community/bonsai-webgpu
- 6–12 tok/s on integrated GPU
- Quality: simple Q&A / rephrase. Not viable as a ReAct/tool-call router. Good as an offline-fallback tier.
deepgrove/Bonsai (the original 0.5B ternary base) is a research artifact — base model only, no instruct. Skip.
Browser runtimes surveyed
| Runtime | Best for | Strengths | Weaknesses |
|---|---|---|---|
| WebLLM (MLC) | LLM chat | Fastest WebGPU LLM (71–80% of native); OpenAI-compat API | Custom model format; per-arch recompile |
| MediaPipe LiteRT-LM | Gemma specifically | Google-authored, KV-cache tuned for web, Web Worker ergonomics | Google-model-centric |
| transformers.js v3 | Audio/vision/multi-modal, STT, TTS | 100× over WASM on WebGPU; huge ONNX zoo | Slower LLM decode than WebLLM |
| ONNX Runtime Web | Custom pipelines | Max control, smallest runtime | Ergonomics poor vs transformers.js |
WebGPU shipped by default across Chrome/Firefox/Edge/Safari on 2025-11-25; coverage ~82.7% as of the 2026 inference analyses.
STT / TTS in the browser
- Moonshine Web — ~60 MB, streaming-first, ~75 ms latency claim. Best for weak hardware.
- Whisper large-v3-turbo via transformers.js — ~800 MB; q8 decoder has a known WebGPU bug, fall back to q4 or WASM.
- Kokoro.js (
kokoro-js, 82 M params, ~150 MB) — recommended default for TTS; matches server-side Kokoro voices. - Piper WASM — WASM-only fallback, 904 voices.
@ricky0123/vad— Silero VAD in the browser via onnxruntime-web + AudioWorklet. Same model pipecat uses server-side.- Fish S2 in browser — not viable. 4.4B params, 22GB VRAM server-side with
--half --compile; no browser port exists.
Pipecat pipeline reality
@pipecat-ai/client-js is a client transport SDK, not a client-side pipeline engine. The processor primitives (Pipeline([...]), FrameProcessor, aggregators, observers, BackchannelController, DeliveryController, BargeInGate, MicroAckInjector, EchoGuard*, ProsodyTagStripper) live in the Python pipecat-ai package. No pipecat-js exists.
Implication: "full browser" requires rewriting those primitives in TypeScript. That's the moat. Don't do it for a marginal privacy/cost gain.
Recommended architecture (when revisited)
Option B — hybrid with client LLM in three phases:
- Spike (1–2w): standalone React route
/lab/browser-llmrunning Gemma 4 E2B via WebLLM. Text-in/text-out harness against existing skill prompts. Measure tok/s + TTFT on target devices.- Kill gate: <15 tok/s on M3 or TTFT >1 s → abort.
- Relay (3–4w): FastAPI WebSocket shim (
/api/local-llm/relay) exposing an OpenAI-compatible API. Server'sOpenAILLMServiceposts to the shim; shim forwards to the connected browser client over the existing RTVI data channel. Newplugins/local-llm/plugin runs WebLLM, streams deltas back. Feature-flagged.- De-risk: inline pre-tool preambles depend on token-streaming fidelity through the relay. Prototype that first.
- STT/TTS swap (2–3w, optional): Moonshine Web + Kokoro.js +
@ricky0123/vad. Keep Fish + Whisper if voice-clone is the differentiator.
Risks to de-risk early
- WebRTC data-channel bandwidth for LLM streaming (likely fine; RTVI already carries tens of KB/s).
- Mobile thermal throttling — plan a "server reclaim" hand-off when a hot client falls back.
- Gemma 4 E2B tool-calling quality vs Qwen-35B-A3B — run existing tool-schema fixtures in the spike.
- Voice-clone skills (Fish reference id) pin TTS server-side — auto-disable local-LLM mode when active skill uses a custom Fish reference.
role: developerquirk — vLLM rejects; we setllm.supports_developer_role = False. Validate WebLLM's OpenAI shim accepts both.
Open questions for when this restarts
- Primary model: Gemma 4 E2B (recommended) vs Bonsai 1.7B?
- Router quality floor — willing to accept slightly worse ReAct than Qwen-35B?
- WebGPU-unavailable fallback — refuse / server-fall-through / WASM at 10×?
- Desktop-first or mobile-first?
- Privacy framing — is "conversations never leave your device" load-bearing marketing? (If yes, Option B isn't enough.)
- Voice-clone deprecation appetite — Fish pinning keeps TTS server-side forever.