Transcribe Audio (Moonshine) icon

Transcribe Audio (Moonshine)

1 VersionTranscribe

Moonshine ASR transcriber (UsefulSensors). Loads {{param:model_name}} (AutoModelForSpeechSeq2Seq + AutoProcessor) on {{param:device}} with FP16 on CUDA / FP32 on CPU, resamples each {{type:AudioFrame}} to 16 kHz, caps max_length at attention_mask.sum() * 6.5 / 16000 (per Moonshine's token-rate heuristic), runs model.generate, and decodes to a single English {{type:String}}.

How it fits

{{type:AudioFrame}} -> {{component:transcribe_audio_moonshine}} -> {{type:String}}
                          |
                          +-- load `AutoProcessor.from_pretrained({{param:model_name}})` and `AutoModelForSpeechSeq2Seq.from_pretrained({{param:model_name}}, torch_dtype=fp16 on CUDA)` then `.to({{param:device}}).eval()`
                          +-- warm up the model on 4000 zero samples with `max_length=2`
                          +-- if input `size == 0`, return the EMPTY STRING
                          +-- if the audio sample rate is not 16 kHz, resample via librosa `kaiser_fast`
                          +-- run the processor at `sampling_rate=16000`, move tensors to {{param:device}}
                          +-- compute `max_length = max(2, attention_mask.sum() * 6.5 / 16000)` to bound the decoder
                          +-- `model.generate(...)` and decode the first sequence (`skip_special_tokens=True`), strip whitespace

Pick this when GPU memory is tight or for short voice-command latencies (Moonshine is much smaller than Whisper variants). English ONLY. For multilingual transcription use {{component:transcribe_audio_faster_whisper}}.

Typical backends

  • Tight-GPU voice command: {{component:input_audio_file}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:transcribe_audio_moonshine}} -> {{component:evaluate_expression}} (intent route).
  • Streaming wake-word: {{component:input_browser_webcam}} (audio) -> {{component:transcribe_audio_moonshine}} ({{param:model_name}} = ...streaming-tiny) -> trigger.

Caveats

  • ENGLISH ONLY. Multi-language audio is decoded as nonsense English without warning.
  • {{param:model_name}} accepts the UsefulSensors Moonshine checkpoint ids (moonshine-tiny, moonshine-base, moonshine-streaming-tiny, moonshine-streaming-small, moonshine-streaming-medium). Unknown ids fail in from_pretrained.
  • {{param:device}} cuda (or cuda:N) silently FALLS BACK to cpu with a WARNING when no CUDA runtime is visible; FP16 is also disabled in that path.
  • The max_length heuristic (attention_mask.sum() * 6.5 / 16000) caps decoder length to keep silence hallucinations short. It does NOT eliminate them — pair with {{component:detect_voice_activity_silero_vad}} upstream when silences are common.
  • Empty input (size == 0) returns the EMPTY STRING without invoking the model.
  • The tiny variant trades accuracy for speed. Validate on your specific domain (accent, vocabulary, environment).
  • Both config keys ({{param:model_name}}, {{param:device}}) are captured ONCE at module import; runtime changes have NO effect.

Versionen

  • 7371e622defaultlatestlinux/amd64

    live-test prerelease 2026-05-24T20:08:39Z