ABACUS

Adapting Unified Foundation Models for Bridging Image Count Understanding and Generation

Anindya Mondal1,  Sauradip Nag2,  Anjan Dutta1

1University of Surrey2Simon Fraser University

ABACUS overview: count understanding (sparse to dense) and count-faithful generation

Fig. 1. ABACUS unifies count understanding (left) — handling sparse, moderate and extremely dense scenes — with count-faithful generation (right), producing images that exactly match a specified count.


One model. Four tasks. Zero benchmark-specific tuning.

ABACUS is a unified VLM built on a 3B-parameter foundation model that simultaneously handles object counting, crowd counting, referring-expression counting, and count-faithful image generation — with no benchmark-specific training.

Three complementary innovations drive the model:

  • Density-aware adaptive zooming with objectness maps for spatial grounding
  • Boundary-aware count policy via GRPO to eliminate crop-edge errors
  • Cycle-consistent GRPO — the understanding branch self-critiques generated outputs, closing the understanding–generation gap with no external labels

The result: state-of-the-art across 7 benchmarks, outperforming both task-specific specialists and larger generalist models.

Problems in existing count generation and understanding systems

Fig. 2. (a) Diffusion models produce wrong cardinalities. (b) VLMs default to coarse estimates on dense scenes. (c) Unified models show a synergy gap: correct counting ≠ correct generation.

How ABACUS Works

Three targeted innovations on top of a frozen foundation model.

Density-Aware Adaptive Zooming
Generates an objectness map to locate potential instances, then crops and re-processes high-density sub-regions at higher resolution before aggregating counts.
Boundary-Aware Count Policy
GRPO reward penalises splits of the same object across crop boundaries, eliminating the systematic double-counting artefact that plagues tiled counting approaches.
Cycle-Consistent GRPO
The understanding branch verifies whether generated images match the requested count, providing a self-supervised reward signal that closes the understanding–generation gap.
ABACUS full model pipeline

Fig. 3. ABACUS pipeline: objectness-guided adaptive zooming feeds into a boundary-aware GRPO counter, which in turn supervises the generation branch through cycle-consistent self-reward.

Density-aware adaptive zooming module

Density zooming — sub-region selection by objectness.

Objectness maps for spatial grounding

Objectness maps — instance-aware spatial grounding.

State-of-the-Art Across 7 Benchmarks

ABACUS outperforms task-specific specialists and larger generalist models on every benchmark — without benchmark-specific training.

Cite ABACUS

    @article{mondal2026abacus,
  title   = {ABACUS: Adapting Unified Foundation Model for
             Bridging Image Count Understanding and Generation},
  author  = {Mondal, Anindya and Nag, Sauradip and Dutta, Anjan},
      journal = {arXiv preprint},
  year    = {2026}
}