Transcribe Audio (Moonshine)

1 version

Transcribe Audio (Moonshine)

Summary: Ultra-lightweight speech-to-text optimized for speed and low resource usage, with hallucination mitigation via max-length capping.

Use This When

You need the smallest possible ASR model footprint (27M–61M params)
Running on constrained GPU memory alongside larger models
Speed matters more than peak accuracy

What It Does

Converts audio frames to text using UsefulSensors Moonshine (encoder-decoder architecture)
Caps max token generation based on audio duration (6.5 tokens/sec) to limit hallucination on silence
Supports variable-length input with no fixed 30-second window
565x realtime throughput on GPU

Works Best With

Pipelines where GPU memory is tight and a tiny ASR model is needed
Pair with detect-voice-activity to pre-filter silence before transcription
Voice command interfaces where utterances are short

Caveats

English-only (base and tiny variants)
Autoregressive decoder can still hallucinate on silence despite max-length cap
Best results when paired with upstream VAD filtering
Model auto-downloads from HuggingFace on first run (~400MB for base)

Versions

7371e622latestdefaultlinux/amd64
Automated release
4/8/2026