
Estimate Crowd Count
CLIP-text-conditioned crowd counter. Each {{type:Image}} is converted to RGB, normalised with the per-channel statistics matching the backbone of {{param:model}}, and fed through the density head; the worker emits three outputs per tick — a {{type:DensityMap}} of estimated head density, a {{type:DensityMap}} of the underlying per-class probability map, and a {{type:UInt64}} total count rounded from the density-map sum.
How it fits
{{type:Image}} -> {{component:estimate_crowd_count}} -> ({{type:DensityMap}}, {{type:DensityMap}}, {{type:UInt64}})
|
+-- weights pulled at startup from {{param:model}} (exactly one `.pth` file)
+-- RGB input -> backbone-matched per-channel normalise -> CLIP / ImageNet backbone -> density + class-probability heads
+-- count = round(sum(density_map)); maps bilinearly upsampled back to input resolution
Pick this when bounding-box detection breaks down because of occlusion or distance — public squares, stadium stands, transit halls. For sparse crowds where boxes still work prefer {{component:detect_objects_triton}}; to convert the density map into discrete head positions use {{component:estimate_density_map_peaks}} downstream.
Typical backends
- Live count alarm: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:send_object_count_mqtt}}.
- Heatmap dashboard: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:visualize_crowd_observation}}.
- Spot people: {{component:input_camera}} -> {{component:estimate_crowd_count}} -> {{component:estimate_density_map_peaks}} -> {{component:visualize_object_detections}}.
- Capacity audit: {{component:input_video_file}} -> {{component:estimate_crowd_count}} -> {{component:send_http}}.
Caveats
- {{param:model}} is resolved as a directory and scanned for
*.pthfiles; exactly ONE checkpoint must be present — zero or more than one aborts startup with a count-mismatch error. - The total count is the ROUNDED sum of the density map. It is fine for thresholds and trends but is NOT an exact head count; downstream code should treat it as integer-quantised regression, not enumeration.
- Accuracy degrades in extremely dense crowds where individual heads merge; the density map saturates and the count under-reports. Combine with {{component:estimate_density_map_peaks}} if discrete head positions matter.
- Normalisation statistics are selected automatically from the checkpoint's
model_name: any CLIP-prefixed backbone uses the CLIP image-encoder triple(0.481, 0.458, 0.408)/(0.269, 0.261, 0.276); every other backbone uses the standard ImageNet triple(0.485, 0.456, 0.406)/(0.229, 0.224, 0.225). The worker logs the chosen pair to stdout at startup. - Both density and probability maps are bilinearly UPSAMPLED back to the input frame's height and width before emission; downstream consumers can read pixel-aligned values.
- {{param:device}} is captured ONCE at startup. {{param:device}} starting with
cudasilently falls back to CPU with a stderr warning when CUDA is unavailable; CPU inference is impractical for live video. - Input is converted to RGB internally; callers do not need to colour convert.
- All config keys are captured ONCE at startup; runtime changes have NO effect and require a redeploy.
Versions
- 194b6a1bdefaultlatestlinux/amd64
Downscale to model_width x model_height before CLIP backbone (emit maps at that size; count exact); main speed lever

