Transcribe Audio (SenseVoice)

1 version

Transcribe Audio (SenseVoice)

Summary: CTC-based speech-to-text that cannot hallucinate on silence, ideal for voice assistants and real-time pipelines.

Use This When

You need transcription that never produces phantom text on silence or noise
Building voice assistants where false "Thank you" or "Bye" hallucinations break downstream logic
You need multilingual support (English, Chinese, Japanese, Korean, Cantonese, 50+ languages)

What It Does

Converts audio frames to text using FunAudioLLM SenseVoice (234M params, CTC-based)
Non-autoregressive architecture: maps audio frames directly to text without generative decoding
Returns empty string on silence instead of hallucinating words
Processes 10 seconds of audio in ~70ms on GPU
Detects speech events (laughter, applause, music) and speaker emotions internally

Works Best With

Voice assistant pipelines (VAD → this → LLM → TTS)
Any pipeline where silence hallucinations cause cascading errors
Pair with detect-voice-activity for pre-segmented audio

Caveats

Requires funasr package which pulls in torch and torchaudio
Model auto-downloads from HuggingFace on first run (~470MB)
Slightly lower accuracy than Whisper large-v3 on English benchmarks, but no hallucination tradeoff

Versions

7abdbc5elatestdefaultlinux/amd64
Automated release
4/8/2026