Transcribe Audio (Parakeet)
1 version
Transcribe Audio (Parakeet)
Summary: High-accuracy speech-to-text using NVIDIA Parakeet-TDT with transducer-based decoding and minimal silence hallucination.
Use This When
- Accuracy is the top priority (6.05% WER on Open ASR Leaderboard)
- You need multilingual European language support (v3: 25 languages)
- You want near-zero hallucination on silence (v3 trained on 36,000 hours of non-speech data)
What It Does
- Converts audio frames to text using NVIDIA Parakeet-TDT (600M params, FastConformer + TDT decoder)
- Transducer architecture can emit blank tokens for silence instead of hallucinating
- 3386x realtime throughput on GPU
- v2 is English-only with best accuracy; v3 adds multilingual support and better silence handling
Works Best With
- High-fidelity transcription pipelines where accuracy matters most
- Pair with detect-voice-activity for pre-segmented audio
- Meeting transcription, dictation, and professional captioning workflows
Caveats
- Large dependency footprint: nemo_toolkit pulls pytorch-lightning, hydra-core, omegaconf, sentencepiece
- Model weights are 2.47GB
- v2 can produce minor filler words ("Yeah", "Mm-hmm") on silence; v3 handles this better
- CC-BY-4.0 license
Versions
- bd174c31latestdefaultlinux/amd64
Automated release