Anindya Mondal

We imagine what we understand.

We understand what we dare build.

— Anonymous

I am a final-year PhD student at the Surrey Institute for People-Centred AI (CVSSP), University of Surrey, advised by Dr. Anjan Dutta, Dr. Xiatian Zhu, and Dr. Joaquin M. Prada, and in close collaboration with Dr. Sauradip Nag.

My research focuses on unified vision-language models that bridge visual understanding and controllable generation. I am interested in language-grounded visual perception, instance-level scene understanding, and count-faithful image synthesis, with the broader goal of building models that can precisely perceive and generate structured visual content from natural language without task-specific architectures or supervision.

News

Jun 2026 [New] Starting as a Research Intern at Adobe
Jan 2025 [New] Paper on multi-label object counting accepted at AAAI 2025!
Dec 2024 Awarded AAAI 2025 Conference Travel Grant ($1,200)
Oct 2023 Presented actor-agnostic action recognition work at ICCV Workshop 2023 in Paris
Sep 2022 Started PhD at University of Surrey with full studentship funding

Publications

ABACUS
ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation
TL;DR Single unified 3B-parameter VLM that jointly solves object counting, crowd counting, referring-expression counting, and count-faithful image generation via density-aware adaptive zooming, MLLM attention-derived objectness maps, and a cycle-consistent GRPO strategy with nested local, boundary, and global rewards, achieving state-of-the-art across seven benchmarks without any benchmark-specific training.
Anindya Mondal, Sauradip Nag, Anjan Dutta Under Review
CountLoop
CountLoop: Iterative Agent Guided High Instance Image Generation
TL;DR Training-free diffusion pipeline with a VLM planner–critic loop: the planner generates structured instance layouts via chain-of-thought reasoning; the critic provides count and spatial feedback; instance-driven cross-attention masking with cumulative attention composition prevents semantic leakage across high-density scenes, reducing counting error by up to 57%.
Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta Under Review
OmniCount
OmniCount: Multi-label Object Counting with Semantic-Geometric Priors
TL;DR Training-free multi-label counter coupling CLIP semantic embeddings for open-vocabulary category grounding with SAM point-prompt geometric priors for precise instance segmentation; introduces OmniCount-191, the first benchmark with point, bounding-box, and VQA multi-label count annotations.
Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta AAAI 2025
MSQNet
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
TL;DR Transformer decoder treating each action class as a multi-modal semantic query (CLIP visual + text embeddings), decoupling action classification from actor-specific pose topology for actor-agnostic multi-label recognition; state-of-the-art across five benchmarks spanning human and animal actions.
Anindya Mondal, Sauradip Nag, Joaquin M. Prada, Xiatian Zhu, Anjan Dutta ICCVW 2023
Time-varying GNN
Time-varying Signals Recovery via Graph Neural Networks
TL;DR Encoder-decoder GNN (TimeGNN) trained with a composite loss of MSE and a Sobolev graph smoothness operator, exploiting spatio-temporal correlations across graph topology for robust missing-entry recovery on real sensor and traffic datasets.
John A. Castro-Correa, Jhony H. Giraldo, Anindya Mondal, Mohsen Badiey, Thierry Bouwmans, Fragkiskos D. Malliaros ICASSP 2023
EUSIPCO 2022
Recovery of Missing Sensor Data by Reconstructing Time-varying Graph Signals
TL;DR Formulates missing wireless sensor data recovery as time-varying graph signal reconstruction minimising a Sobolev norm combining graph Laplacian smoothness across spatial and temporal dimensions, surpassing state-of-the-art by up to 54% at high missing-data rates.
Anindya Mondal, Mayukhmali Das, Aditi Chatterjee, Palaniandavar Venkateswaran EUSIPCO 2022
ICCVW 2021
Moving Object Detection for Event-based Vision using Graph Spectral Clustering
TL;DR Maps asynchronous event streams to a k-NN graph, applies graph Laplacian spectral decomposition for eigenvector-based clustering (GSCEventMOD), and determines the optimal cluster count via spectral gap analysis, enabling fully unsupervised moving object detection for event cameras.
Anindya Mondal, Shashant R, Jhony H. Giraldo, Thierry Bouwmans, Ananda S. Chowdhury ICCVW 2021

Education

Awards & Recognition

Teaching

Academic Service