Extract Text (Surya OCR)

1 versionExtract

Surya OCR transcriber. Each {{type:Image}} is converted to RGB and run through the Surya detection + recognition backend configured by {{param:task_name}}, {{param:sort_lines}}, and {{param:disable_math}}; the worker emits either a single {{type:String}} (newline-separated, or via {{param:separator}}) or a JSON-encoded record per the {{param:output_format}} setting.

How it fits

{{type:Image}} -> {{component:extract_text_surya_ocr}} -> {{type:String}}
                          |
                          +-- build Surya predictors at startup on {{param:device}}; recover onto CPU on a CUDA OOM
                          +-- run recognition at {{param:task_name}} with {{param:sort_lines}} and math toggled by {{param:disable_math}}
                          +-- join lines with {{param:separator}}; wrap as JSON when {{param:output_format}} is `json`

Pick this when classical OCR engines garble math, mixed scripts, or multi-column layouts. For lighter-weight plain text prefer {{component:extract_text_doctr_ocr}}, {{component:extract_text_easyocr}}, or {{component:extract_text_paddleocr}}; for layout-preserving markdown prefer {{component:extract_text_got_ocr}}; for the fastest plain text prefer {{component:extract_text_rapidocr}}.

Typical backends

  • Document scrape with reading order: {{component:input_image_file}} -> {{component:extract_text_surya_ocr}} -> {{component:generate_text_ollama}} -> {{component:send_http_post_request}}.
  • Math-aware paper OCR: {{component:input_image_file}} -> {{component:extract_text_surya_ocr}} -> {{component:split_text_wtpsplit}} -> analytics.
  • Already-cropped lines: detection upstream -> {{component:extract_text_surya_ocr}} -> text consumer.

Caveats

  • {{param:task_name}} MUST be ocr_with_boxes, ocr_without_boxes, or block_without_boxes; any other value aborts startup with a "must be one of" error.
  • {{param:output_format}} MUST be text or json; the JSON form returns {"text": "...", "lines": [...]}. Other values abort startup.
  • Bounding boxes are NOT on the output even in ocr_with_boxes mode — the worker only returns text. Use a layout-aware OCR when geometry is needed.
  • The recogniser runs at half precision (float16) on GPU and single precision on CPU; on a CUDA OOM the worker logs the fallback, rebuilds the predictors on CPU, and keeps running.
  • {{param:device}} cuda falls back to CPU when no CUDA backend is visible; a warning is printed and inference continues at CPU speeds.
  • {{param:recognition_batch_size}}, {{param:detector_batch_size}}, {{param:foundation_chunk_size}}, and {{param:foundation_max_tokens}} are VRAM knobs; 0 means "library default". Lower them to fit constrained GPUs; {{param:foundation_model_quantize}} true adds an additional quantisation pass.
  • {{param:models}} is a cache-seeding hint only — actual checkpoints come from the installed Surya library. The worker prints a "WARNING" if the configured ids differ from the library's defaults, but does NOT fail.
  • {{param:base_url}} is the Datalab model host base URL used for cache seeding only.
  • {{param:disable_math}} false keeps math recognition on; prose pages can pick up stray $...$ markup. Set to true for pure-prose backends.
  • {{param:sort_lines}} toggles approximate reading-order sort of the output lines.
  • {{param:separator}} joins recognised lines into the FLAT text output; the default is a double newline.
  • An empty {{type:Image}} short-circuits and returns either an empty string or {"text":"","lines":[]} depending on {{param:output_format}}.
  • Per-call hot-swap: {{param:task_name}}, {{param:output_format}}, {{param:sort_lines}}, {{param:disable_math}}, {{param:separator}}, {{param:recognition_batch_size}}, {{param:detector_batch_size}}. Predictors and {{param:device}} are captured ONCE at startup.

Versions

  • c332a33edefaultlatestlinux/amd64

    live-test prerelease 2026-05-24T10:36:00Z