
Extract Text (Surya OCR)
Surya OCR transcriber. Each {{type:Image}} is converted to RGB and run through the Surya detection + recognition backend configured by {{param:task_name}}, {{param:sort_lines}}, and {{param:disable_math}}; the worker emits either a single {{type:String}} (newline-separated, or via {{param:separator}}) or a JSON-encoded record per the {{param:output_format}} setting.
How it fits
{{type:Image}} -> {{component:extract_text_surya_ocr}} -> {{type:String}}
|
+-- build Surya predictors at startup on {{param:device}}; recover onto CPU on a CUDA OOM
+-- run recognition at {{param:task_name}} with {{param:sort_lines}} and math toggled by {{param:disable_math}}
+-- join lines with {{param:separator}}; wrap as JSON when {{param:output_format}} is `json`
Pick this when classical OCR engines garble math, mixed scripts, or multi-column layouts. For lighter-weight plain text prefer {{component:extract_text_doctr_ocr}}, {{component:extract_text_easyocr}}, or {{component:extract_text_paddleocr}}; for layout-preserving markdown prefer {{component:extract_text_got_ocr}}; for the fastest plain text prefer {{component:extract_text_rapidocr}}.
Typical backends
- Document scrape with reading order: {{component:input_image_file}} -> {{component:extract_text_surya_ocr}} -> {{component:generate_text_ollama}} -> {{component:send_http_post_request}}.
- Math-aware paper OCR: {{component:input_image_file}} -> {{component:extract_text_surya_ocr}} -> {{component:split_text_wtpsplit}} -> analytics.
- Already-cropped lines: detection upstream -> {{component:extract_text_surya_ocr}} -> text consumer.
Caveats
- {{param:task_name}} MUST be
ocr_with_boxes,ocr_without_boxes, orblock_without_boxes; any other value aborts startup with a "must be one of" error. - {{param:output_format}} MUST be
textorjson; the JSON form returns{"text": "...", "lines": [...]}. Other values abort startup. - Bounding boxes are NOT on the output even in
ocr_with_boxesmode — the worker only returns text. Use a layout-aware OCR when geometry is needed. - The recogniser runs at half precision (
float16) on GPU and single precision on CPU; on a CUDA OOM the worker logs the fallback, rebuilds the predictors on CPU, and keeps running. - {{param:device}}
cudafalls back to CPU when no CUDA backend is visible; a warning is printed and inference continues at CPU speeds. - {{param:recognition_batch_size}}, {{param:detector_batch_size}}, {{param:foundation_chunk_size}}, and {{param:foundation_max_tokens}} are VRAM knobs;
0means "library default". Lower them to fit constrained GPUs; {{param:foundation_model_quantize}}trueadds an additional quantisation pass. - {{param:models}} is a cache-seeding hint only — actual checkpoints come from the installed Surya library. The worker prints a "WARNING" if the configured ids differ from the library's defaults, but does NOT fail.
- {{param:base_url}} is the Datalab model host base URL used for cache seeding only.
- {{param:disable_math}}
falsekeeps math recognition on; prose pages can pick up stray$...$markup. Set totruefor pure-prose backends. - {{param:sort_lines}} toggles approximate reading-order sort of the output lines.
- {{param:separator}} joins recognised lines into the FLAT text output; the default is a double newline.
- An empty {{type:Image}} short-circuits and returns either an empty string or
{"text":"","lines":[]}depending on {{param:output_format}}. - Per-call hot-swap: {{param:task_name}}, {{param:output_format}}, {{param:sort_lines}}, {{param:disable_math}}, {{param:separator}}, {{param:recognition_batch_size}}, {{param:detector_batch_size}}. Predictors and {{param:device}} are captured ONCE at startup.
Versions
- c332a33edefaultlatestlinux/amd64
live-test prerelease 2026-05-24T10:36:00Z