Use LocalAI / LiteLLM Gateway / OpenAI
protoVoice's STT, LLM, and TTS are all swappable via env. Point them at LocalAI, a LiteLLM proxy, OpenRouter, OpenAI, vllm-omni, or any mix — protoVoice doesn't need a single GPU on the host if you've already got those services running elsewhere.
All-API setup (no in-process model loads)
If you already have LocalAI hosting Whisper STT + Kokoro TTS + an LLM, drop these lines in .env at the repo root (copy from .env.example):
# LLM
START_VLLM=0
LLM_URL=http://localai:8080/v1
LLM_SERVED_NAME=qwen3.5-4b-instruct
LLM_API_KEY=
# STT
STT_BACKEND=openai
STT_URL=http://localai:8080/v1
STT_MODEL=whisper-1
# TTS
TTS_BACKEND=openai
TTS_OPENAI_URL=http://localai:8080/v1
TTS_OPENAI_MODEL=kokoro
TTS_OPENAI_VOICE=af_heart
TTS_OPENAI_SAMPLE_RATE=24000Restart the server — protoVoice now does no GPU work itself. It's a thin orchestration layer on top of LocalAI, and can run on a CPU box.
Mixing providers
There's no requirement that all three use the same endpoint. Common splits:
# Local STT + LLM, OpenAI TTS for premium voice
STT_BACKEND=local # in-process Whisper on this host's GPU
LLM_URL=http://gateway:4000/v1
TTS_BACKEND=openai
TTS_OPENAI_URL=https://api.openai.com/v1
TTS_OPENAI_API_KEY=sk-...
TTS_OPENAI_MODEL=tts-1-hd
TTS_OPENAI_VOICE=nova# OpenAI for STT, local Fish for cloned voice, gateway LLM
STT_BACKEND=openai
STT_URL=https://api.openai.com/v1
STT_API_KEY=sk-...
LLM_URL=http://gateway:4000/v1
TTS_BACKEND=fish # voice cloningWhat works with which backend
| Feature | local Whisper | OpenAI STT | Fish TTS | Kokoro TTS | OpenAI TTS |
|---|---|---|---|---|---|
| Streaming voice in/out | ✓ | ✓ | ✓ | ✓ | ✓ |
| One-shot transcribe (clone endpoint) | ✓ | ✓ | — | — | — |
Voice cloning (/api/voice/clone) | — | — | ✓ | ✗ | ✗ |
Inline prosody tags ([hmm], [pause]) | — | — | ✓ | ✗ | depends on model |
The clone endpoint guards itself — if TTS_BACKEND isn't fish, it returns an error explaining cloning needs Fish.
Health check
curl http://localhost:7867/healthz
# {"status":"ok","stt_backend":"openai","tts_backend":"openai", ...}stt_backend and tts_backend reflect what's actually wired.
Latency considerations
Going all-API trades local-GPU latency for network round-trips. Rough budgets:
| Path | Local (GPU on this box) | API (LAN gateway) | API (cloud) |
|---|---|---|---|
| STT (8s utterance) | ~55 ms | ~150-300 ms | ~600-1500 ms |
| LLM TTFB | ~40 ms | ~100-200 ms | ~300-800 ms |
| TTS TTFA (Fish-quality) | ~580 ms | ~200-400 ms (LocalAI) | ~400-800 ms (OpenAI tts-1) |
Sub-second turn-around is realistic with a LAN gateway. Cloud-only paths push to 1-2 seconds.
Examples
Joe's setup (LocalAI, Whisper + Kokoro + an LLM on the same box) — drop in .env:
START_VLLM=0
LLM_URL=http://localai:8080/v1
LLM_SERVED_NAME=qwen3.5-4b
STT_BACKEND=openai
STT_URL=http://localai:8080/v1
STT_MODEL=whisper-1
TTS_BACKEND=openai
TTS_OPENAI_URL=http://localai:8080/v1
TTS_OPENAI_MODEL=kokoro
TTS_OPENAI_VOICE=af_heartThen docker compose up -d protovoice. The protovoice container has no GPU work to do in this configuration; you can drop runtime: nvidia and device_ids from the compose file.
LiteLLM Proxy (e.g. the protoLabs gateway:4000) — same shape, single master key, model names from your litellm_config.yaml:
START_VLLM=0
LLM_URL=http://gateway:4000/v1
LLM_SERVED_NAME=claude-opus-4-6
LLM_API_KEY=${LITELLM_MASTER_KEY}
STT_BACKEND=openai
STT_URL=http://gateway:4000/v1
STT_MODEL=whisper-1
STT_API_KEY=${LITELLM_MASTER_KEY}
TTS_BACKEND=openai
TTS_OPENAI_URL=http://gateway:4000/v1
TTS_OPENAI_MODEL=tts-1
TTS_OPENAI_VOICE=alloy
TTS_OPENAI_API_KEY=${LITELLM_MASTER_KEY}
LITELLM_MASTER_KEY=sk-...LiteLLM is route-shaped — it dispatches model names to whatever provider you've configured behind it (Anthropic, OpenAI, Bedrock, Vertex, your own vLLM, …). A single LLM_SERVED_NAME=claude-opus-4-6 might hit Anthropic while tts-1 hits OpenAI. Pick model names from your litellm_config.yaml.
TIP
LiteLLM requires the master key on every request by default. The same key works for all three endpoints — there's no separate auth per modality. For tighter scoping, issue one virtual key per service.