Running with Docker Compose

protoVoice ships three services:

protovoice — WebRTC + Whisper + routing vLLM + Kokoro fallback (GPU 0)
fish-speech — Fish Audio S2-Pro TTS sidecar (GPU 1)
fish-openai-shim — tiny FastAPI proxy on :8093 that rewrites OpenAI /v1/audio/speech requests into Fish's native /v1/tts and forwards them to fish-speech. Lets LiteLLM and any OpenAI SDK client route TTS to Fish without knowing Fish's wire format. The container is pure HTTP + ffmpeg — it does no ML and needs no GPU. See TTS Backends → OpenAI-compat.

Default config assumes two GPUs on one host. Adjust via env if you have fewer.

Basic start

bash

docker compose up -d
docker compose logs -f protovoice

Health check: curl http://localhost:7867/healthz.

Common overrides

All env vars are documented in the Environment Variables reference. The ones you'll reach for most:

bash

# Pin to specific GPUs
PROTOVOICE_GPU=0 FISH_GPU=1 docker compose up -d

# Use only Kokoro (no Fish sidecar)
TTS_BACKEND=kokoro docker compose up -d protovoice

# Point at an external LLM instead of starting our own vllm
START_VLLM=0 LLM_URL=http://10.0.0.10:8000/v1 docker compose up -d

Volume mounts

HF_HOME → /models — HuggingFace cache. Default: /mnt/models/huggingface.
FISH_REFERENCES_DIR → /app/references — persists saved voice references across restarts.
../fish-speech/.venv → /app/.venv (read-only) — bind-mounted because Fish's .dockerignore excludes .venv/, so the image ships without it. Without this mount the container crashes at startup with /app/.venv/bin/python: No such file or directory.
../fish-speech/checkpoints → /app/checkpoints (read-only) — same rationale; S2-Pro weights are too large to bake into the image. Keep them in the host checkout and mount.

GPU allocation

Service	Default GPU	VRAM
protovoice (Whisper + routing vLLM + Kokoro)	0	~23 GB
fish-speech (S2-Pro with `--half --compile`)	1	~22 GB

On a single-GPU host, set TTS_BACKEND=kokoro to skip Fish entirely and keep everything on GPU 0.

Cold start

Fish Audio's first call triggers a ~2-minute torch.compile codegen. protoVoice's prewarm() on startup sends a single silent utterance to absorb this — you should never see that hit in a real turn.

The sidecar's healthcheck uses start_period: 600s so Docker doesn't mark the container unhealthy and restart-loop while torch.compile is still running. If you observe the container repeatedly bouncing on first boot, confirm that value hasn't been lowered.

Fish image needs a C toolchain + Python headers

Dockerfile.fish installs build-essential and python3-dev on top of nvidia/cuda:runtime. Both are required — torch.compile's Inductor backend shells out to gcc at synth time to build CUDA kernels as Python extension modules. Missing either produces Failed to find C compiler or CalledProcessError ... cuda_utils.c on first synth, followed by an infinite restart loop. If you ever strip these from the Dockerfile for image size, Fish will stop working on any torch.compile-gated build.

Whisper takes ~2 s to load + warm.

vLLM takes 30-60 s to load + warm depending on model size.

Stopping cleanly

bash

docker compose down

down tears down both containers. The Fish torch.compile cache lives in-container and will re-run on next start — mount /tmp/torchinductor_* to persist it if you want a faster restart, though this is currently undocumented upstream.

Running with Docker Compose ​

Basic start ​

Common overrides ​

Volume mounts ​

GPU allocation ​

Cold start ​

Fish image needs a C toolchain + Python headers ​

Stopping cleanly ​

Running with Docker Compose

Basic start

Common overrides

Volume mounts

GPU allocation

Cold start

Fish image needs a C toolchain + Python headers

Stopping cleanly