Estimate Crowd Count

1 VersionEstimate

CLIP-text-conditioned crowd counter. Each {{type:Image}} is converted to RGB, normalised with the per-channel statistics matching the backbone of {{param:model}}, and fed through the density head; the worker emits three outputs per tick — a {{type:DensityMap}} of estimated head density, a {{type:DensityMap}} of the underlying per-class probability map, and a {{type:UInt64}} total count rounded from the density-map sum.

How it fits

{{type:Image}} -> {{component:estimate_crowd_count}} -> ({{type:DensityMap}}, {{type:DensityMap}}, {{type:UInt64}})
                          |
                          +-- weights pulled at startup from {{param:model}} (exactly one `.pth` file)
                          +-- RGB input -> backbone-matched per-channel normalise -> CLIP / ImageNet backbone -> density + class-probability heads
                          +-- count = round(sum(density_map)); maps bilinearly upsampled back to input resolution

Pick this when bounding-box detection breaks down because of occlusion or distance — public squares, stadium stands, transit halls. For sparse crowds where boxes still work prefer {{component:detect_objects_triton}}; to convert the density map into discrete head positions use {{component:estimate_density_map_peaks}} downstream.

Typical backends

Live count alarm: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:send_object_count_mqtt}}.
Heatmap dashboard: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:visualize_crowd_observation}}.
Spot people: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:estimate_density_map_peaks}} -> {{component:visualize_object_detections}}.
Capacity audit: {{component:input_video_file}} -> {{component:estimate_crowd_count}} -> {{component:send_http}}.

Caveats

{{param:model}} is resolved as a directory and scanned for *.pth files; exactly ONE checkpoint must be present — zero or more than one aborts startup with a count-mismatch error.
The total count is the ROUNDED sum of the density map. It is fine for thresholds and trends but is NOT an exact head count; downstream code should treat it as integer-quantised regression, not enumeration.
Accuracy degrades in extremely dense crowds where individual heads merge; the density map saturates and the count under-reports. Combine with {{component:estimate_density_map_peaks}} if discrete head positions matter.
Normalisation statistics are selected automatically from the checkpoint's model_name: any CLIP-prefixed backbone uses the CLIP image-encoder triple (0.481, 0.458, 0.408) / (0.269, 0.261, 0.276); every other backbone uses the standard ImageNet triple (0.485, 0.456, 0.406) / (0.229, 0.224, 0.225). The worker logs the chosen pair to stdout at startup.
Both density and probability maps are bilinearly UPSAMPLED back to the input frame's height and width before emission; downstream consumers can read pixel-aligned values.
{{param:device}} is captured ONCE at startup. {{param:device}} starting with cuda silently falls back to CPU with a stderr warning when CUDA is unavailable; CPU inference is impractical for live video.
Input is converted to RGB internally; callers do not need to colour convert.
All config keys are captured ONCE at startup; runtime changes have NO effect and require a redeploy.

Versions

194b6a1bdefaultlatestlinux/amd64
Downscale to model_width x model_height before CLIP backbone (emit maps at that size; count exact); main speed lever
5/12/2026