Transcribe Audio (Moonshine) avatar

Transcribe Audio (Moonshine)

1 version
Open in App

Transcribe Audio (Moonshine)

Summary: Ultra-lightweight speech-to-text optimized for speed and low resource usage, with hallucination mitigation via max-length capping.

Use This When

  • You need the smallest possible ASR model footprint (27M–61M params)
  • Running on constrained GPU memory alongside larger models
  • Speed matters more than peak accuracy

What It Does

  • Converts audio frames to text using UsefulSensors Moonshine (encoder-decoder architecture)
  • Caps max token generation based on audio duration (6.5 tokens/sec) to limit hallucination on silence
  • Supports variable-length input with no fixed 30-second window
  • 565x realtime throughput on GPU

Works Best With

  • Pipelines where GPU memory is tight and a tiny ASR model is needed
  • Pair with detect-voice-activity to pre-filter silence before transcription
  • Voice command interfaces where utterances are short

Caveats

  • English-only (base and tiny variants)
  • Autoregressive decoder can still hallucinate on silence despite max-length cap
  • Best results when paired with upstream VAD filtering
  • Model auto-downloads from HuggingFace on first run (~400MB for base)

Versions

  • 7371e622latestdefaultlinux/amd64

    Automated release