Anindya Mondal

We imagine what we understand.

We understand what we dare build.

— Anonymous

I am a final-year PhD student at the Surrey Institute for People-Centred AI (CVSSP), University of Surrey, advised by Dr. Anjan Dutta, Dr. Xiatian Zhu, and Dr. Joaquin M. Prada, and in close collaboration with Dr. Sauradip Nag.

My research focuses on unified vision-language models that bridge visual understanding and controllable generation. I am interested in language-grounded visual perception, instance-level scene understanding, and count-faithful image synthesis, with the broader goal of building models that can precisely perceive and generate structured visual content from natural language without task-specific architectures or supervision.

News

Jul 2026 [New] Paper on unified understanding & generation accepted as a TOG Journal paper at SIGGRAPH ASIA 2026

Jun 2026 [New] Starting as a Research Intern at Adobe

Jan 2025 Paper on multi-label object counting accepted at AAAI 2025!

Dec 2024 Awarded AAAI 2025 Conference Travel Grant ($1,200)

Oct 2023 Presented actor-agnostic action recognition work at ICCV Workshop 2023 in Paris

Sep 2022 Started PhD at University of Surrey with full studentship funding

Publications

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation One unified VLM that both counts objects and generates count-faithful images.

TL;DR▾

Single unified 3B-parameter VLM that jointly solves object counting, crowd counting, referring-expression counting, and count-faithful image generation via density-aware adaptive zooming, MLLM attention-derived objectness maps, and a cycle-consistent GRPO strategy with nested local, boundary, and global rewards, achieving state-of-the-art across seven benchmarks without any benchmark-specific training.

Anindya Mondal, Sauradip Nag, Anjan Dutta SIGGRAPH ASIA 2026 (TOG)

PDF / Project

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors Training-free open-vocabulary multi-label counting; released the OmniCount-191 benchmark.

TL;DR▾

Training-free multi-label counter coupling CLIP semantic embeddings for open-vocabulary category grounding with SAM point-prompt geometric priors for precise instance segmentation; introduces OmniCount-191, the first benchmark with point, bounding-box, and VQA multi-label count annotations.

Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta AAAI 2025

DOI / Project

CountLoop: Iterative Agent Guided High Instance Image Generation Training-free planner-critic loop for accurate high-instance image generation.

TL;DR▾

Training-free diffusion pipeline with a VLM planner–critic loop: the planner generates structured instance layouts via chain-of-thought reasoning; the critic provides count and spatial feedback; instance-driven cross-attention masking with cumulative attention composition prevents semantic leakage across high-density scenes, reducing counting error by up to 57%.

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta Under Review

PDF / Project / Surrey

Actor-agnostic Multi-label Action Recognition with Multi-modal Query Actor-agnostic multi-label action recognition via multi-modal semantic queries.

TL;DR▾

Transformer decoder treating each action class as a multi-modal semantic query (CLIP visual + text embeddings), decoupling action classification from actor-specific pose topology for actor-agnostic multi-label recognition; state-of-the-art across five benchmarks spanning human and animal actions.

Anindya Mondal, Sauradip Nag, Joaquin M. Prada, Xiatian Zhu, Anjan Dutta ICCVW 2023

DOI / Code

Time-varying Signals Recovery via Graph Neural Networks Graph neural network for recovering missing time-varying sensor signals.

TL;DR▾

Encoder-decoder GNN (TimeGNN) trained with a composite loss of MSE and a Sobolev graph smoothness operator, exploiting spatio-temporal correlations across graph topology for robust missing-entry recovery on real sensor and traffic datasets.

John A. Castro-Correa, Jhony H. Giraldo, Anindya Mondal, Mohsen Badiey, Thierry Bouwmans, Fragkiskos D. Malliaros ICASSP 2023

PDF / DOI

Recovery of Missing Sensor Data by Reconstructing Time-varying Graph Signals Recovers missing sensor data as smooth time-varying graph signals.

TL;DR▾

Formulates missing wireless sensor data recovery as time-varying graph signal reconstruction minimising a Sobolev norm combining graph Laplacian smoothness across spatial and temporal dimensions, surpassing state-of-the-art by up to 54% at high missing-data rates.

Anindya Mondal, Mayukhmali Das, Aditi Chatterjee, Palaniandavar Venkateswaran EUSIPCO 2022

PDF / DOI / Code

Moving Object Detection for Event-based Vision using Graph Spectral Clustering Unsupervised moving-object detection for event cameras via graph spectral clustering.

TL;DR▾

Maps asynchronous event streams to a k-NN graph, applies graph Laplacian spectral decomposition for eigenvector-based clustering (GSCEventMOD), and determines the optimal cluster count via spectral gap analysis, enabling fully unsupervised moving object detection for event cameras.

Anindya Mondal, Shashant R, Jhony H. Giraldo, Thierry Bouwmans, Ananda S. Chowdhury ICCVW 2021

PDF / DOI / Code