
Detect Voice Activity (Silero VAD)
Silero VAD frame-level detector: downmixes each AudioFrame to mono at 16 kHz and emits speech spans as [{start: Double, end: Double}] in seconds relative to the frame. Use when downstream needs local speech intervals for trimming or routing.
How it fits
Typical backends
Speech-present routing from file.
Pre-ASR gate after denoising.
Speech-turn aggregation into turn detector.
Caveats
- I/O contractOutput spans are in seconds relative to the start of the input
AudioFrame, NOT absolute stream time. Carry a sibling timestamp or accumulate a frame offset downstream if absolute times are needed. - Hard constraintAn empty input
AudioFramereturns an empty list without invoking the model. A non-positive sample rate aborts mid-tick with an error. - I/O contractMulti-channel input is mixed to mono before resampling.
- AccuracySilero is trained at 16 kHz. The component resamples internally; downsampling from below 16 kHz hurts accuracy noticeably.
- Parameter interaction
threshold,min_speech_duration_ms,min_silence_duration_ms, andspeech_pad_msare re-read on every tick, so live tuning takes effect on the next call without restarting. - Parameter interaction
thresholdcontrols per-frame speech probability cutoff.min_speech_duration_msdrops too-short spans after merging.min_silence_duration_mscontrols how much silence is required to split an utterance into two spans.speech_pad_mspads the final spans after all other processing. - Fallback
deviceis captured once at startup; switching CPU/GPU requires a redeploy.devicestarting with `cuda` silently falls back to CPU with a warning when CUDA is unavailable.
Versionen
- 6a34b255defaultlatestlinux/amd64
live-test prerelease 2026-05-24T22:06:38Z

