My research focuses on unified vision-language models that bridge visual understanding and controllable generation.
I am interested in language-grounded visual perception, instance-level scene understanding, and count-faithful image synthesis,
with the broader goal of building models that can precisely perceive and generate structured visual content
from natural language without task-specific architectures or supervision.
News
Jun 2026[New] Starting as a Research Intern at Adobe
Jan 2025[New] Paper on multi-label object counting accepted at AAAI 2025!
Dec 2024Awarded AAAI 2025 Conference Travel Grant ($1,200)
Oct 2023Presented actor-agnostic action recognition work at ICCV Workshop 2023 in Paris
Sep 2022Started PhD at University of Surrey with full studentship funding
Publications
ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and GenerationTL;DR▾Single unified 3B-parameter VLM that jointly solves object counting, crowd counting, referring-expression counting, and count-faithful image generation via density-aware adaptive zooming, MLLM attention-derived objectness maps, and a cycle-consistent GRPO strategy with nested local, boundary, and global rewards, achieving state-of-the-art across seven benchmarks without any benchmark-specific training.Anindya Mondal, Sauradip Nag, Anjan Dutta
Under Review
CountLoop: Iterative Agent Guided High Instance Image GenerationTL;DR▾Training-free diffusion pipeline with a VLM planner–critic loop: the planner generates structured instance layouts via chain-of-thought reasoning; the critic provides count and spatial feedback; instance-driven cross-attention masking with cumulative attention composition prevents semantic leakage across high-density scenes, reducing counting error by up to 57%.Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, Anjan Dutta
Under Review
OmniCount: Multi-label Object Counting with Semantic-Geometric PriorsTL;DR▾Training-free multi-label counter coupling CLIP semantic embeddings for open-vocabulary category grounding with SAM point-prompt geometric priors for precise instance segmentation; introduces OmniCount-191, the first benchmark with point, bounding-box, and VQA multi-label count annotations.Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta
AAAI 2025
Actor-agnostic Multi-label Action Recognition with Multi-modal QueryTL;DR▾Transformer decoder treating each action class as a multi-modal semantic query (CLIP visual + text embeddings), decoupling action classification from actor-specific pose topology for actor-agnostic multi-label recognition; state-of-the-art across five benchmarks spanning human and animal actions.Anindya Mondal, Sauradip Nag, Joaquin M. Prada, Xiatian Zhu, Anjan Dutta
ICCVW 2023
Time-varying Signals Recovery via Graph Neural NetworksTL;DR▾Encoder-decoder GNN (TimeGNN) trained with a composite loss of MSE and a Sobolev graph smoothness operator, exploiting spatio-temporal correlations across graph topology for robust missing-entry recovery on real sensor and traffic datasets.
John A. Castro-Correa, Jhony H. Giraldo, Anindya Mondal, Mohsen Badiey, Thierry Bouwmans, Fragkiskos D. Malliaros
ICASSP 2023
Recovery of Missing Sensor Data by Reconstructing Time-varying Graph SignalsTL;DR▾Formulates missing wireless sensor data recovery as time-varying graph signal reconstruction minimising a Sobolev norm combining graph Laplacian smoothness across spatial and temporal dimensions, surpassing state-of-the-art by up to 54% at high missing-data rates.Anindya Mondal, Mayukhmali Das, Aditi Chatterjee, Palaniandavar Venkateswaran
EUSIPCO 2022
Moving Object Detection for Event-based Vision using Graph Spectral ClusteringTL;DR▾Maps asynchronous event streams to a k-NN graph, applies graph Laplacian spectral decomposition for eigenvector-based clustering (GSCEventMOD), and determines the optimal cluster count via spectral gap analysis, enabling fully unsupervised moving object detection for event cameras.Anindya Mondal, Shashant R, Jhony H. Giraldo, Thierry Bouwmans, Ananda S. Chowdhury
ICCVW 2021
PhD in Computer Vision and AI (2022 – Present)
University of Surrey, UK ·
CVSSP Advisors:
Dr. Anjan Dutta,
Dr. Xiatian Zhu,
Dr. Joaquin M. Prada
Awards & Recognition
AAAI 2025 Conference Travel Grant ($1,200) · Philadelphia, USA
ICCV 2023 Conference Grant · Paris, France
University of Surrey Postgraduate Studentship (2022–2025) · Full PhD funding
Uplink Research Internship Award · ACM SIGKDD India Chapter
Teaching
Teaching Assistant (2023–2025) · University of Surrey
Applied Machine Learning (EEEM068), Advanced Topics in Computer Vision and Deep Learning (EEEM071),
UKRI Centre for Doctoral Training (CDT)