Detect Voice Activity (Silero VAD)

1 SürümDetect

Frame-level voice activity detection using the Silero VAD model. Each {{type:AudioFrame}} is downmixed to mono, resampled to 16 kHz, and fed through Silero VAD; the worker emits {{type:[{start:Double,end:Double}]}} carrying per-speech-span start and end times in SECONDS relative to the input frame's beginning.

How it fits

{{type:AudioFrame}} -> {{component:detect_voice_activity_silero_vad}} -> {{type:[{start:Double,end:Double}]}}
                          |
                          +-- weights pulled at startup from the bundled Silero VAD runtime; loaded onto {{param:device}}
                          +-- mono mixdown + resample to 16 kHz -> Silero inference gated by {{param:threshold}}
                          +-- spans merged across {{param:min_silence_duration_ms}}, trimmed by {{param:min_speech_duration_ms}}, padded by {{param:speech_pad_ms}}

Pick this when downstream needs local speech intervals for trimming, indexing, routing, or visualisation. For full turn-end detection prefer {{component:detect_turn_speech_smart_turn_v3}}.

Typical backends

Speech-present routing: {{component:input_audio_file}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:evaluate_expression}}.
Pre-ASR gate: {{component:denoise_audio_mp_senet}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:transcribe_audio_faster_whisper}}.
Speech-turn aggregation: {{component:detect_voice_activity_silero_vad}} -> {{component:detect_turn_speech_smart_turn_v3}} -> {{component:collect_speech_turn}}.

Caveats

Output spans are in SECONDS relative to the start of the input {{type:AudioFrame}}, NOT absolute stream time. {{type:AudioFrame}} has no absolute timestamp; carry a sibling timestamp or accumulate a frame offset downstream if absolute times are needed.
An empty input {{type:AudioFrame}} returns an empty list without invoking the model. A non-positive sample rate aborts mid-tick with an AudioFrame sample_rate must be positive error.
Multi-channel input is mixed to mono before resampling.
Silero is trained at 16 kHz. The worker resamples internally; downsampling from < 16 kHz hurts accuracy noticeably.
Audio with peak absolute magnitude > 1.0 is rescaled by its max-abs before inference; pre-normalised audio is preserved as-is.
{{param:threshold}}, {{param:min_speech_duration_ms}}, {{param:min_silence_duration_ms}}, and {{param:speech_pad_ms}} are re-read on EVERY tick, so live tuning takes effect on the next call without restarting.
{{param:threshold}} controls Silero's per-frame speech probability cutoff (higher = stricter). {{param:min_speech_duration_ms}} drops too-short spans AFTER merging. {{param:min_silence_duration_ms}} controls how much silence is required to split a single utterance into two spans. {{param:speech_pad_ms}} adds padding around the final spans, applied AFTER everything else.
{{param:device}} is captured ONCE at startup; switching CPU / GPU requires a redeploy. {{param:device}} starting with cuda silently falls back to CPU with a stderr warning when CUDA is unavailable.
On CPU the worker disables PyTorch's NNPACK backend (Silero requires that for predictable behaviour). On GPU this is irrelevant.

Sürümler

6a34b255defaultlatestlinux/amd64
live-test prerelease 2026-05-24T22:06:38Z
09.05.2026