Crowd Counting through Density Map Estimation

Crowd Counting through Density Map Estimation

Crowd counting via density map estimation is highly sensitive to a model’s receptive field, which determines how much spatial context informs each prediction. We present a controlled study on ShanghaiTech Parts A (dense) and B (sparse) that isolates the effect of receptive field by varying depth in UNet-style encoder–decoder architectures with pretrained VGG19 and ResNet50 backbones. Ground-truth density maps are generated with geometry-adaptive Gaussians, and we evaluate both count- and pixel-level errors (MAE/RMSE). Our modified UNet outputs half-resolution density maps and uses skip connections after max pooling to focus the analysis on receptive-field behavior.

Results show a clear data–architecture match: on dense Part A, VGG-D4 attains the best count accuracy (MAE 109.3, RMSE 150.4), benefiting from strong local feature extraction; on sparse Part B, ResNet-D4 performs best (MAE 19.82, RMSE 24.81), leveraging a larger effective receptive field to suppress false positives in empty regions. Deeper variants generally improve density fidelity across both families.

We also report that naive patch-based augmentation increases sample count but harms validation generalization due to distribution shift.

Contributions include:

  1. A systematic comparison that isolates receptive-field effects across depths and backbones;
  2. Auantitative evidence linking receptive field to density characteristics;
  3. Practical guidance for selecting architectures in spatially distributed labeling tasks.
Flavio Caroli

Flavio Caroli

MSc in Artificial Intelligence at Bocconi University

Luca Colaci

Luca Colaci

MSc in Artificial Intelligence at Bocconi University

Vittorio Rossi

Vittorio Rossi

MSc in Artificial Intelligence at Bocconi University