Transcribe Audio (Moonshine)
1 version
Transcribe Audio (Moonshine)
Summary: Ultra-lightweight speech-to-text optimized for speed and low resource usage, with hallucination mitigation via max-length capping.
Use This When
- You need the smallest possible ASR model footprint (27M–61M params)
- Running on constrained GPU memory alongside larger models
- Speed matters more than peak accuracy
What It Does
- Converts audio frames to text using UsefulSensors Moonshine (encoder-decoder architecture)
- Caps max token generation based on audio duration (6.5 tokens/sec) to limit hallucination on silence
- Supports variable-length input with no fixed 30-second window
- 565x realtime throughput on GPU
Works Best With
- Pipelines where GPU memory is tight and a tiny ASR model is needed
- Pair with detect-voice-activity to pre-filter silence before transcription
- Voice command interfaces where utterances are short
Caveats
- English-only (base and tiny variants)
- Autoregressive decoder can still hallucinate on silence despite max-length cap
- Best results when paired with upstream VAD filtering
- Model auto-downloads from HuggingFace on first run (~400MB for base)
Versions
- 7371e622latestdefaultlinux/amd64
Automated release