Transcribe Audio (Parakeet)

1 version

Transcribe Audio (Parakeet)

Summary: High-accuracy speech-to-text using NVIDIA Parakeet-TDT with transducer-based decoding and minimal silence hallucination.

Use This When

Accuracy is the top priority (6.05% WER on Open ASR Leaderboard)
You need multilingual European language support (v3: 25 languages)
You want near-zero hallucination on silence (v3 trained on 36,000 hours of non-speech data)

What It Does

Converts audio frames to text using NVIDIA Parakeet-TDT (600M params, FastConformer + TDT decoder)
Transducer architecture can emit blank tokens for silence instead of hallucinating
3386x realtime throughput on GPU
v2 is English-only with best accuracy; v3 adds multilingual support and better silence handling

Works Best With

High-fidelity transcription pipelines where accuracy matters most
Pair with detect-voice-activity for pre-segmented audio
Meeting transcription, dictation, and professional captioning workflows

Caveats

Large dependency footprint: nemo_toolkit pulls pytorch-lightning, hydra-core, omegaconf, sentencepiece
Model weights are 2.47GB
v2 can produce minor filler words ("Yeah", "Mm-hmm") on silence; v3 handles this better
CC-BY-4.0 license

Versions

bd174c31latestdefaultlinux/amd64
Automated release
4/8/2026