MIRROR: Multimodal Iterative Reasoning via Reflection On Visual Regions

¹Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology ²Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

Abstract

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs. We propose MIRROR, a framework for multimodal iterative reasoning via reflection on visual regions. By embedding visual reflection as a core mechanism, MIRROR follows a closed-loop process: draft, critique, region-based verification, and revision. To support this, we construct ReflectV, a dataset for multi-turn supervision. Experiments show MIRROR significantly improves correctness and reduces hallucinations.

MIRROR Framework

Method: Closed-Loop Reasoning

MIRROR upgrades standard VLM inference into a cycle of drafting, critiquing, and region-based verification to anchor reasoning in concrete visual evidence.

ReflectV Dataset

ReflectV transforms static QA into 24k high-quality reflective trajectories using a multi-agent pipeline and self-reflective conversion.

Experimental Results

🏆 Quantitative Comparison

Performance on General Capabilities and OCR & Document Benchmarks.

Model	Size	MM-Vet	MMStar	Seed-2-P	TextVQA	OCRBench	ChartQA
LLaVA-OneVision	7B	48.80	61.70	--	76.10	62.10	80.00
InternVL3	2B	54.95	60.70	64.95	77.00	82.20	76.08
InternVL3	8B	64.27	61.50	69.61	80.51	85.00	79.64
Qwen2.5-VL-3B	3B	47.39	55.87	68.81	79.12	82.60	83.20
Qwen2.5-VL-7B	7B	56.60	61.21	70.88	84.90	83.20	86.08
MIRROR (w/o tool)	7B	59.91	62.80	70.36	85.37	88.30	86.56
MIRROR (ours)	7B	66.70	73.33	76.86	86.62	92.00	87.92

Performance on Hallucination, Perception, and Math Benchmarks.

Model	Size	POPE	HalluBench	HRBench-4k	MME-RW	VStarBench	MathVison
LLaVA-OneVision	7B	78.10	31.60	63.00	--	72.30	18.30
InternVL3	2B	89.60	42.50	61.75	43.88	68.59	21.71
InternVL3	8B	90.37	49.90	70.00	48.83	68.06	20.39
Qwen2.5-VL-3B	3B	86.21	63.09	50.25	42.15	72.77	25.66
Qwen2.5-VL-7B	7B	86.45	68.66	68.87	44.29	75.39	23.36
MIRROR (w/o tool)	7B	87.95	68.24	69.13	46.01	76.44	27.30
MIRROR (ours)	7B	94.42	82.02	72.88	51.49	83.77	28.29

⚔️ Comparison with Reasoning Paradigms

MIRROR addresses the inherent limitations of Text Reflection and Thinking with Images by incorporating a targeted feedback loop.

Method	OCRBench	POPE	MME-RW	MMVet
Text Reflection
VL-Rethinker	85.40	84.19	47.21	56.19
Thinking with Images
PixelReasoner	82.10	86.03	49.70	52.98
DeepEyes	88.10	87.70	49.50	60.28
Adaptive-CoF	86.00	89.30	50.90	66.21
MIRROR (ours)	92.00	94.42	51.49	66.70

🔍 Qualitative Case Study

MIRROR demonstrates superior Spatial Reasoning by identifying counting errors and Object Identification via active point-based querying.

Citation

@article{zhang2026mirror, title={MIRROR: Multimodal Iterative Reasoning via Reflection On Visual Regions}, author={Zhang, Haoyu and Wu, Yuwei and Li, Pengxiang and Zhang, Xintong and Gao, Zhi and Gao, Rui and Gao, Mingyang and Sun, Che and Jia, Yunde}, journal={arXiv preprint arXiv:2602.18746}, year={2026} }