Detect Voice Activity (Silero VAD)

1 VersionDetect

Silero VAD frame-level detector: downmixes each AudioFrame to mono at 16 kHz and emits speech spans as [{start: Double, end: Double}] in seconds relative to the frame. Use when downstream needs local speech intervals for trimming or routing.

How it fits

Typical backends

Speech-present routing from file.

Pre-ASR gate after denoising.

Speech-turn aggregation into turn detector.

Caveats

I/O contractOutput spans are in seconds relative to the start of the input AudioFrame, NOT absolute stream time. Carry a sibling timestamp or accumulate a frame offset downstream if absolute times are needed.
Hard constraintAn empty input AudioFrame returns an empty list without invoking the model. A non-positive sample rate aborts mid-tick with an error.
I/O contractMulti-channel input is mixed to mono before resampling.
AccuracySilero is trained at 16 kHz. The component resamples internally; downsampling from below 16 kHz hurts accuracy noticeably.
Parameter interactionthreshold, min_speech_duration_ms, min_silence_duration_ms, and speech_pad_ms are re-read on every tick, so live tuning takes effect on the next call without restarting.
Parameter interactionthreshold controls per-frame speech probability cutoff. min_speech_duration_ms drops too-short spans after merging. min_silence_duration_ms controls how much silence is required to split an utterance into two spans. speech_pad_ms pads the final spans after all other processing.
Fallbackdevice is captured once at startup; switching CPU/GPU requires a redeploy. device starting with `cuda` silently falls back to CPU with a warning when CUDA is unavailable.

Versionen

6a34b255defaultlatestlinux/amd64
live-test prerelease 2026-05-24T22:06:38Z
9.5.2026