Skip to main content
Detect Voice Activity (Silero VAD) icon

Detect Voice Activity (Silero VAD)

1 VersionDetect

Silero VAD frame-level detector: downmixes each AudioFrame to mono at 16 kHz and emits speech spans as [{start: Double, end: Double}] in seconds relative to the frame. Use when downstream needs local speech intervals for trimming or routing.

How it fits

Detect Voice Activity…AudioFrame[{start: Double, end:…

Typical backends

Speech-turn aggregation into turn detector.

Caveats

  • I/O contractOutput spans are in seconds relative to the start of the input AudioFrame, NOT absolute stream time. Carry a sibling timestamp or accumulate a frame offset downstream if absolute times are needed.
  • Hard constraintAn empty input AudioFrame returns an empty list without invoking the model. A non-positive sample rate aborts mid-tick with an error.
  • I/O contractMulti-channel input is mixed to mono before resampling.
  • AccuracySilero is trained at 16 kHz. The component resamples internally; downsampling from below 16 kHz hurts accuracy noticeably.
  • Parameter interactionthreshold, min_speech_duration_ms, min_silence_duration_ms, and speech_pad_ms are re-read on every tick, so live tuning takes effect on the next call without restarting.
  • Parameter interactionthreshold controls per-frame speech probability cutoff. min_speech_duration_ms drops too-short spans after merging. min_silence_duration_ms controls how much silence is required to split an utterance into two spans. speech_pad_ms pads the final spans after all other processing.
  • Fallbackdevice is captured once at startup; switching CPU/GPU requires a redeploy. device starting with `cuda` silently falls back to CPU with a warning when CUDA is unavailable.

Versionen

  • 6a34b255defaultlatestlinux/amd64

    live-test prerelease 2026-05-24T22:06:38Z