Crowd Counting through Density Map Estimation
Crowd counting via density map estimation is highly sensitive to a model’s receptive field, which determines how much spatial context informs each prediction. We present a controlled study on ShanghaiTech Parts A (dense) and B (sparse) that isolates the effect of receptive field by varying depth in UNet-style encoder–decoder architectures with pretrained VGG19 and ResNet50 backbones. Ground-truth density maps are generated with geometry-adaptive Gaussians, and we evaluate both count- and pixel-level errors (MAE/RMSE). Our modified UNet outputs half-resolution density maps and uses skip connections after max pooling to focus the analysis on receptive-field behavior.
Results show a clear data–architecture match: on dense Part A, VGG-D4 attains the best count accuracy (MAE 109.3, RMSE 150.4), benefiting from strong local feature extraction; on sparse Part B, ResNet-D4 performs best (MAE 19.82, RMSE 24.81), leveraging a larger effective receptive field to suppress false positives in empty regions. Deeper variants generally improve density fidelity across both families.
We also report that naive patch-based augmentation increases sample count but harms validation generalization due to distribution shift.
Contributions include:
- A systematic comparison that isolates receptive-field effects across depths and backbones;
- Auantitative evidence linking receptive field to density characteristics;
- Practical guidance for selecting architectures in spatially distributed labeling tasks.