Transcribe Audio (SenseVoice)
1 version
Transcribe Audio (SenseVoice)
Summary: CTC-based speech-to-text that cannot hallucinate on silence, ideal for voice assistants and real-time pipelines.
Use This When
- You need transcription that never produces phantom text on silence or noise
- Building voice assistants where false "Thank you" or "Bye" hallucinations break downstream logic
- You need multilingual support (English, Chinese, Japanese, Korean, Cantonese, 50+ languages)
What It Does
- Converts audio frames to text using FunAudioLLM SenseVoice (234M params, CTC-based)
- Non-autoregressive architecture: maps audio frames directly to text without generative decoding
- Returns empty string on silence instead of hallucinating words
- Processes 10 seconds of audio in ~70ms on GPU
- Detects speech events (laughter, applause, music) and speaker emotions internally
Works Best With
- Voice assistant pipelines (VAD → this → LLM → TTS)
- Any pipeline where silence hallucinations cause cascading errors
- Pair with detect-voice-activity for pre-segmented audio
Caveats
- Requires funasr package which pulls in torch and torchaudio
- Model auto-downloads from HuggingFace on first run (~470MB)
- Slightly lower accuracy than Whisper large-v3 on English benchmarks, but no hallucination tradeoff
Versions
- 7abdbc5elatestdefaultlinux/amd64
Automated release