Detect Voice Activity (Silero Vad)

1 version

Detect voice activity using Silero VAD

Use This When

Building voice assistants that should only process speech segments to save compute
Implementing push-to-talk alternatives with automatic speech detection
Reducing ASR and LLM costs by filtering out silence and non-speech audio
Creating audio recording systems that skip silent intervals

What It Does

Detects presence of human speech in audio frames using lightweight Silero VAD model
Resamples audio to 16kHz and returns boolean indicating speech presence
Generates timestamp-based speech segments for temporal speech localization
Runs efficient neural model suitable for real-time streaming applications

Works Best With

Audio inputs → this component → transcribe-audio to gate ASR on speech segments only
Integration with denoise-audio → this component → ASR for clean speech detection
Voice assistant pipelines where VAD triggers wake word detection or command processing
Recording systems that need automatic silence removal or speech-only archival

Caveats

Very low SNR or loud music can cause false positive speech detections
Model optimized for 16kHz; quality degrades if original recording is lower sample rate
Frame-level decisions may miss very short utterances shorter than analysis window
Background babble or TV dialogue may trigger false positives in multi-speaker environments

Versions

182b7f25latestdefaultlinux/amd64
Automated release
4/7/2026