Pipeline Shape
The full Pipecat pipeline as it exists in app.py::run_bot.
Order
Pipeline([
transport.input(), # SmallWebRTCInputTransport — mic RTP → AudioRawFrame
EchoGuardSuppressor(...), # drops mic audio while bot speaks + 300ms tail
stt, # LocalWhisperSTT — HF Whisper large-v3-turbo
user_agg, # LLMUserAggregator — VAD turn-taking + context build
BargeInGate(), # adaptive barge-in: grace window on VAD-fired interrupts
MicroAckInjector(...), # "mm" after 500ms if pipeline is slow (Vapi pattern)
backchannel, # BackchannelController — "mm-hmm" during long user turns
delivery, # DeliveryController — priority-routed push deliveries
llm, # OpenAILLMService — vLLM / external
tts, # FishAudioTTS or LocalKokoroTTS / OpenAI
transport.output(), # SmallWebRTCOutputTransport — TTS → RTP → speaker
ProsodyTagStripper(), # cleans Fish tags out of LLM history
assistant_agg, # LLMAssistantAggregator + built-in LLMContextSummarizer
])- EchoGuardSuppressor drops mic audio while the bot is speaking +
ECHO_GUARD_MStail so VAD never sees it. See Audio Handling. - BargeInGate buffers VAD-fired interrupts during bot speech for a ~350 ms grace window; swallows coughs / backchannels, flushes real interrupts.
- MicroAckInjector fires a quiet "mm" / "hm" if the main pipeline hasn't produced bot audio within ~500 ms of UserStoppedSpeaking. Cancels on BotStartedSpeaking.
- BackchannelController watches
UserStartedSpeakingFrame/UserStoppedSpeakingFrameto fire brief listener-acks during long user utterances. See Backchannels. - DeliveryController watches the same VAD frames +
TranscriptionFrameto decide when to drain queued push results. See Delivery Policies. - ProsodyTagStripper removes Fish-style prosody tags (
[softly],[pause:300]) from TextFrames before they reach LLM context. Non-Fish TTS services strip at the adapter level viatext_filters=kwarg. - assistant_agg has pipecat's
LLMContextSummarizerbaked in; it auto-compresses old turns once token or message thresholds fire. See Memory.
Pre-tool acknowledgements
Pipecat's OpenAILLMService streams text deltas to TTS before running function calls — this is what makes inline pre-tool preambles work. The model emits "hmm, let me check" as delta.content tokens, those tokens flow through TTS to the user, then the function call executes, then post-tool tokens flow back through TTS as the answer. One LLM call, one stream, no race conditions.
The TOOL USE block in the system prompt (built by agent/filler.tool_use_block) instructs the model to do this on every tool call. See Natural-Sounding Fillers.
Key frame types
Upstream (user → agent):
InputAudioRawFrame— 20 ms PCM chunks from the micUserStartedSpeakingFrame/UserStoppedSpeakingFrame— from VADInterruptionFrame— broadcast when VAD sees the user start mid-bot-responseTranscriptionFrame— final STT output, flows into the user aggregator
Downstream (agent → user):
LLMRunFrame— optional "kick off the LLM" signal (we don't use it; user turns trigger the run automatically)LLMFullResponseStartFrame— LLM response opening markerLLMTextFrame— per-token text from the LLM streamTTSTextFrame— aggregated sentence headed to TTSTTSSpeakFrame— "speak this text now" (our duplex primitive for filler)TTSAudioRawFrame— int16 PCM output from the TTS serviceOutputAudioRawFrame— resampled to the transport's SR; the browser plays thisLLMFullResponseEndFrame— LLM response close marker
Control:
StartFrame— pipeline init (sets sample rates on services)EndFrame— graceful shutdown (uninterruptible)CancelFrame— hard stop
Latency budget (steady-state)
| Stage | Typical |
|---|---|
| STT (Whisper large-v3-turbo, 8 s audio) | 55 ms |
| LLM TTFB (Qwen 4B local / 35B local) | 100-200 ms |
| LLM first sentence | +200-500 ms |
| TTS TTFA (Fish streaming PCM) | 400-800 ms |
| TTS TTFA (Kokoro) | ~50 ms |
| Transport output + browser playback | ~30 ms |
End-to-end TTFA: ~600-1500 ms with Fish, ~200-400 ms with Kokoro.
Adding a processor
Drop it into the Pipeline([...]) list at the right position. For example, a debugging frame tracer between LLM and TTS:
class Tracer(FrameProcessor):
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
logger.info(f"{type(frame).__name__}")
await self.push_frame(frame, direction)
pipeline = Pipeline([... llm, Tracer(), tts, ...])Processors see every frame. Remember to await super().process_frame(...) first and await self.push_frame(...) at the end, or the pipeline stalls.
Tools + function calling
agent/tools.register_tools(llm, on_finish=...) registers handlers on the LLM service. Tool schemas attach to the LLMContext:
context = LLMContext(messages, tools=tools_schema)When the LLM emits a tool call, pipecat runs the handler, takes the result, re-enters the LLM for the final spoken response. on_function_calls_started / on_function_calls_cancelled are the lifecycle hooks. There is no on_function_calls_finished — cancel long-running work from inside the handler itself.