
Detect Voice Activity (Silero VAD)
Frame-level voice activity detection using the Silero VAD model. Each {{type:AudioFrame}} is downmixed to mono, resampled to 16 kHz, and fed through Silero VAD; the worker emits {{type:[{start:Double,end:Double}]}} carrying per-speech-span start and end times in SECONDS relative to the input frame's beginning.
How it fits
{{type:AudioFrame}} -> {{component:detect_voice_activity_silero_vad}} -> {{type:[{start:Double,end:Double}]}}
|
+-- weights pulled at startup from the bundled Silero VAD runtime; loaded onto {{param:device}}
+-- mono mixdown + resample to 16 kHz -> Silero inference gated by {{param:threshold}}
+-- spans merged across {{param:min_silence_duration_ms}}, trimmed by {{param:min_speech_duration_ms}}, padded by {{param:speech_pad_ms}}
Pick this when downstream needs local speech intervals for trimming, indexing, routing, or visualisation. For full turn-end detection prefer {{component:detect_turn_speech_smart_turn_v3}}.
Typical backends
- Speech-present routing: {{component:input_audio_file}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:evaluate_expression}}.
- Pre-ASR gate: {{component:denoise_audio_mp_senet}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:transcribe_audio_faster_whisper}}.
- Speech-turn aggregation: {{component:detect_voice_activity_silero_vad}} -> {{component:detect_turn_speech_smart_turn_v3}} -> {{component:collect_speech_turn}}.
Caveats
- Output spans are in SECONDS relative to the start of the input {{type:AudioFrame}}, NOT absolute stream time. {{type:AudioFrame}} has no absolute timestamp; carry a sibling timestamp or accumulate a frame offset downstream if absolute times are needed.
- An empty input {{type:AudioFrame}} returns an empty list without invoking the model. A non-positive sample rate aborts mid-tick with an
AudioFrame sample_rate must be positiveerror. - Multi-channel input is mixed to mono before resampling.
- Silero is trained at 16 kHz. The worker resamples internally; downsampling from
< 16 kHzhurts accuracy noticeably. - Audio with peak absolute magnitude
> 1.0is rescaled by its max-abs before inference; pre-normalised audio is preserved as-is. - {{param:threshold}}, {{param:min_speech_duration_ms}}, {{param:min_silence_duration_ms}}, and {{param:speech_pad_ms}} are re-read on EVERY tick, so live tuning takes effect on the next call without restarting.
- {{param:threshold}} controls Silero's per-frame speech probability cutoff (higher = stricter). {{param:min_speech_duration_ms}} drops too-short spans AFTER merging. {{param:min_silence_duration_ms}} controls how much silence is required to split a single utterance into two spans. {{param:speech_pad_ms}} adds padding around the final spans, applied AFTER everything else.
- {{param:device}} is captured ONCE at startup; switching CPU / GPU requires a redeploy. {{param:device}} starting with
cudasilently falls back to CPU with a stderr warning when CUDA is unavailable. - On CPU the worker disables PyTorch's NNPACK backend (Silero requires that for predictable behaviour). On GPU this is irrelevant.
Sürümler
- 6a34b255defaultlatestlinux/amd64
live-test prerelease 2026-05-24T22:06:38Z

