Transcribe Audio (SenseVoice) avatar

Transcribe Audio (SenseVoice)

1 version
Open in App

Transcribe Audio (SenseVoice)

Summary: CTC-based speech-to-text that cannot hallucinate on silence, ideal for voice assistants and real-time pipelines.

Use This When

  • You need transcription that never produces phantom text on silence or noise
  • Building voice assistants where false "Thank you" or "Bye" hallucinations break downstream logic
  • You need multilingual support (English, Chinese, Japanese, Korean, Cantonese, 50+ languages)

What It Does

  • Converts audio frames to text using FunAudioLLM SenseVoice (234M params, CTC-based)
  • Non-autoregressive architecture: maps audio frames directly to text without generative decoding
  • Returns empty string on silence instead of hallucinating words
  • Processes 10 seconds of audio in ~70ms on GPU
  • Detects speech events (laughter, applause, music) and speaker emotions internally

Works Best With

  • Voice assistant pipelines (VAD → this → LLM → TTS)
  • Any pipeline where silence hallucinations cause cascading errors
  • Pair with detect-voice-activity for pre-segmented audio

Caveats

  • Requires funasr package which pulls in torch and torchaudio
  • Model auto-downloads from HuggingFace on first run (~470MB)
  • Slightly lower accuracy than Whisper large-v3 on English benchmarks, but no hallucination tradeoff

Versions

  • 7abdbc5elatestdefaultlinux/amd64

    Automated release