Generate Speech (Kokoro)

1 version

Generate speech using Kokoro TTS models

Use This When

Building voice assistants that need fast, natural-sounding text-to-speech responses
Creating conversational UIs where low-latency TTS improves user experience
Implementing audio feedback systems for accessibility or hands-free operation
Generating voice narration for automated content creation or alerts

What It Does

Converts text strings to audio using Kokoro neural TTS pipeline
Supports multiple voice presets via configurable voice parameter (e.g., af_heart)
Returns AudioFrame at 24kHz sample rate (tensor shape [samples, channels], typically [N, 1])
Handles empty input gracefully by returning silent audio frame

Works Best With

query-llm → this component → output-audio-file or audio playback for voice responses
Chatbot systems → this component → real-time audio streaming to users
Integration with detect-voice-activity for bidirectional voice conversations
Alert systems that need spoken notifications rather than visual displays

Caveats

Fixed 24kHz output sample rate; resampling required if downstream expects different rate
Voice quality and availability depends on Kokoro model; verify voice parameter validity
Real-time factor (RTF) varies by text length and hardware; GPU strongly recommended
Language code parameter affects pronunciation; ensure alignment with input text language

Versions

db738b77latestdefaultlinux/amd64
Automated release
4/7/2026