
Transcribe Audio (Moonshine)
Moonshine ASR transcriber (UsefulSensors). Loads {{param:model_name}} (AutoModelForSpeechSeq2Seq + AutoProcessor) on {{param:device}} with FP16 on CUDA / FP32 on CPU, resamples each {{type:AudioFrame}} to 16 kHz, caps max_length at attention_mask.sum() * 6.5 / 16000 (per Moonshine's token-rate heuristic), runs model.generate, and decodes to a single English {{type:String}}.
How it fits
{{type:AudioFrame}} -> {{component:transcribe_audio_moonshine}} -> {{type:String}}
|
+-- load `AutoProcessor.from_pretrained({{param:model_name}})` and `AutoModelForSpeechSeq2Seq.from_pretrained({{param:model_name}}, torch_dtype=fp16 on CUDA)` then `.to({{param:device}}).eval()`
+-- warm up the model on 4000 zero samples with `max_length=2`
+-- if input `size == 0`, return the EMPTY STRING
+-- if the audio sample rate is not 16 kHz, resample via librosa `kaiser_fast`
+-- run the processor at `sampling_rate=16000`, move tensors to {{param:device}}
+-- compute `max_length = max(2, attention_mask.sum() * 6.5 / 16000)` to bound the decoder
+-- `model.generate(...)` and decode the first sequence (`skip_special_tokens=True`), strip whitespace
Pick this when GPU memory is tight or for short voice-command latencies (Moonshine is much smaller than Whisper variants). English ONLY. For multilingual transcription use {{component:transcribe_audio_faster_whisper}}.
Typical backends
- Tight-GPU voice command: {{component:input_audio_file}} -> {{component:detect_voice_activity_silero_vad}} -> {{component:transcribe_audio_moonshine}} -> {{component:evaluate_expression}} (intent route).
- Streaming wake-word: {{component:input_browser_webcam}} (audio) -> {{component:transcribe_audio_moonshine}} ({{param:model_name}} =
...streaming-tiny) -> trigger.
Caveats
- ENGLISH ONLY. Multi-language audio is decoded as nonsense English without warning.
- {{param:model_name}} accepts the UsefulSensors Moonshine checkpoint ids (
moonshine-tiny,moonshine-base,moonshine-streaming-tiny,moonshine-streaming-small,moonshine-streaming-medium). Unknown ids fail infrom_pretrained. - {{param:device}}
cuda(orcuda:N) silently FALLS BACK tocpuwith a WARNING when no CUDA runtime is visible; FP16 is also disabled in that path. - The
max_lengthheuristic (attention_mask.sum() * 6.5 / 16000) caps decoder length to keep silence hallucinations short. It does NOT eliminate them — pair with {{component:detect_voice_activity_silero_vad}} upstream when silences are common. - Empty input (
size == 0) returns the EMPTY STRING without invoking the model. - The tiny variant trades accuracy for speed. Validate on your specific domain (accent, vocabulary, environment).
- Both config keys ({{param:model_name}}, {{param:device}}) are captured ONCE at module import; runtime changes have NO effect.
Sürümler
- 7371e622defaultlatestlinux/amd64
live-test prerelease 2026-05-24T20:08:39Z

