ABACUS

Adapting Unified Foundation Models for Bridging Image Count Understanding and Generation

Anindya Mondal¹, Sauradip Nag², Anjan Dutta¹

¹ University of Surrey ² Simon Fraser University

ABACUS overview: count understanding (sparse to dense) and count-faithful generation

Fig. 1. ABACUS unifies count understanding (left) — handling sparse, moderate and extremely dense scenes — with count-faithful generation (right), producing images that exactly match a specified count.

Abstract

One model. Four tasks. Zero benchmark-specific tuning.

ABACUS is a unified VLM built on a 3B-parameter foundation model that simultaneously handles object counting, crowd counting, referring-expression counting, and count-faithful image generation — with no benchmark-specific training.

Three complementary innovations drive the model:

Density-aware adaptive zooming with objectness maps for spatial grounding
Boundary-aware count policy via GRPO to eliminate crop-edge errors
Cycle-consistent GRPO — the understanding branch self-critiques generated outputs, closing the understanding–generation gap with no external labels

The result: state-of-the-art across 7 benchmarks, outperforming both task-specific specialists and larger generalist models.

Problems in existing count generation and understanding systems

Fig. 2. (a) Diffusion models produce wrong cardinalities. (b) VLMs default to coarse estimates on dense scenes. (c) Unified models show a synergy gap: correct counting ≠ correct generation.

Method

How ABACUS Works

Three targeted innovations on top of a frozen foundation model.

Density-Aware Adaptive Zooming

Generates an objectness map to locate potential instances, then crops and re-processes high-density sub-regions at higher resolution before aggregating counts.

Boundary-Aware Count Policy

GRPO reward penalises splits of the same object across crop boundaries, eliminating the systematic double-counting artefact that plagues tiled counting approaches.

Cycle-Consistent GRPO

The understanding branch verifies whether generated images match the requested count, providing a self-supervised reward signal that closes the understanding–generation gap.

Fig. 3. ABACUS pipeline: objectness-guided adaptive zooming feeds into a boundary-aware GRPO counter, which in turn supervises the generation branch through cycle-consistent self-reward.

Density zooming — sub-region selection by objectness.

Objectness maps — instance-aware spatial grounding.

Results

State-of-the-Art Across 7 Benchmarks

ABACUS outperforms task-specific specialists and larger generalist models on every benchmark — without benchmark-specific training.

Count Understanding

Fig. 4. ABACUS predictions (green) vs. ground truth across FSC-147, CARPK, and ShanghaiTech. Handles sparse pencils (GT: 1) to dense go-stones (GT: 261) from text-only prompts. ↗ Full size

Count-Faithful Generation

Fig. 5. Generated images match the exact requested count while maintaining naturalistic spatial arrangements. No grid patterns, no mode collapse. ↗ Full size

Citation

Cite ABACUS

    @article{mondal2026abacus,
  title   = {ABACUS: Adapting Unified Foundation Model for
             Bridging Image Count Understanding and Generation},
  author  = {Mondal, Anindya and Nag, Sauradip and Dutta, Anjan},
      journal = {arXiv preprint},
  year    = {2026}
}