
Detect Voice Activity (Silero VAD)
2 versions
Frame-level voice activity detection using the Silero VAD model. Returns true whenever any speech region is detected inside the audio frame — typical use is gating ASR or downstream language stages.
How it fits
AudioFrame ──► detect_voice_activity_silero_vad ──► Bool (speech present)
Pick this when you want a cheap "is anyone talking?" signal to avoid spending compute on silence. For full turn-end detection use detect_turn_speech.
Typical pipelines
- Speech-gated transcription:
input_audio_file→denoise_audio_mp_senet→detect_voice_activity_silero_vad→filter→transcribe_audio_faster_whisper→send_http. - Conversation segment collector: live audio →
detect_voice_activity_silero_vad→collect(gate) →transcribe_audio_faster_whisper→generate_text_ollama. - Recording auto-trim:
input_audio_file→detect_voice_activity_silero_vad→filter→output_audio_file(only speech).
Caveats
- The boolean fires on ANY speech-flagged region in the frame, no matter how short — clicks, coughs, or sneezes can register as speech.
- Silero is trained at 16 kHz. The worker resamples internally; that's fine for higher rates but downsampling from 8 kHz hurts accuracy.
- No exposed tuning knobs (threshold, min/max speech duration, padding). If you need to filter by duration, accumulate VAD outputs downstream.
- On CPU the worker disables PyTorch's NNPACK backend (Silero needs that for predictable behaviour). On GPU this is irrelevant.
- Only
deviceis configurable in this worker — for fine-grained Silero tuning you'll need a custom variant or post-processing.
Related components
detect_turn_speech— paired component for "speaker has just finished" detection.denoise_audio_mp_senet— typical pre-VAD step on noisy inputs.transcribe_audio_faster_whisper,transcribe_audio_moonshine,transcribe_audio_parakeet,transcribe_audio_sensevoice— typical downstream ASR.filter,collect— typical control-flow consumers.
Versions
- 6a34b255latestdefaultlinux/amd64
Automated release
- 182b7f25linux/amd64
Automated release