Generate Speech (Kokoro)
1 version
Generate speech using Kokoro TTS models
Use This When
- Building voice assistants that need fast, natural-sounding text-to-speech responses
- Creating conversational UIs where low-latency TTS improves user experience
- Implementing audio feedback systems for accessibility or hands-free operation
- Generating voice narration for automated content creation or alerts
What It Does
- Converts text strings to audio using Kokoro neural TTS pipeline
- Supports multiple voice presets via configurable voice parameter (e.g., af_heart)
- Returns AudioFrame at 24kHz sample rate (tensor shape
[samples, channels], typically[N, 1]) - Handles empty input gracefully by returning silent audio frame
Works Best With
- query-llm → this component → output-audio-file or audio playback for voice responses
- Chatbot systems → this component → real-time audio streaming to users
- Integration with detect-voice-activity for bidirectional voice conversations
- Alert systems that need spoken notifications rather than visual displays
Caveats
- Fixed 24kHz output sample rate; resampling required if downstream expects different rate
- Voice quality and availability depends on Kokoro model; verify voice parameter validity
- Real-time factor (RTF) varies by text length and hardware; GPU strongly recommended
- Language code parameter affects pronunciation; ensure alignment with input text language
Versions
- db738b77latestdefaultlinux/amd64
Automated release