MIRROR: Multimodal Iterative Reasoning via Reflection On Visual Regions

Haoyu Zhang1,2, Yuwei Wu1,2, Pengxiang Li1, Xintong Zhang1, Zhi Gao1,2,
Rui Gao1,2, Mingyang Gao1, Che Sun2, Yunde Jia2
1Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology 2Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

Abstract

Teaser

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs. We propose MIRROR, a framework for multimodal iterative reasoning via reflection on visual regions. By embedding visual reflection as a core mechanism, MIRROR follows a closed-loop process: draft, critique, region-based verification, and revision. To support this, we construct ReflectV, a dataset for multi-turn supervision. Experiments show MIRROR significantly improves correctness and reduces hallucinations.

MIRROR Framework

Method: Closed-Loop Reasoning

Architecture

MIRROR upgrades standard VLM inference into a cycle of drafting, critiquing, and region-based verification to anchor reasoning in concrete visual evidence.

ReflectV Dataset

Dataset

ReflectV transforms static QA into 24k high-quality reflective trajectories using a multi-agent pipeline and self-reflective conversion.

Experimental Results

🏆 Quantitative Comparison

Performance on General Capabilities and OCR & Document Benchmarks.

Model Size MM-Vet MMStar Seed-2-P TextVQA OCRBench ChartQA
LLaVA-OneVision7B48.8061.70--76.1062.1080.00
InternVL32B54.9560.7064.9577.0082.2076.08
InternVL38B64.2761.5069.6180.5185.0079.64
Qwen2.5-VL-3B3B47.3955.8768.8179.1282.6083.20
Qwen2.5-VL-7B7B56.6061.2170.8884.9083.2086.08
MIRROR (w/o tool)7B59.9162.8070.3685.3788.3086.56
MIRROR (ours)7B66.7073.3376.8686.6292.0087.92

Performance on Hallucination, Perception, and Math Benchmarks.

Model Size POPE HalluBench HRBench-4k MME-RW VStarBench MathVison
LLaVA-OneVision7B78.1031.6063.00--72.3018.30
InternVL32B89.6042.5061.7543.8868.5921.71
InternVL38B90.3749.9070.0048.8368.0620.39
Qwen2.5-VL-3B3B86.2163.0950.2542.1572.7725.66
Qwen2.5-VL-7B7B86.4568.6668.8744.2975.3923.36
MIRROR (w/o tool)7B87.9568.2469.1346.0176.4427.30
MIRROR (ours)7B94.4282.0272.8851.4983.7728.29

⚔️ Comparison with Reasoning Paradigms

MIRROR addresses the inherent limitations of Text Reflection and Thinking with Images by incorporating a targeted feedback loop.

Method OCRBench POPE MME-RW MMVet
Text Reflection
VL-Rethinker85.4084.1947.2156.19
Thinking with Images
PixelReasoner82.1086.0349.7052.98
DeepEyes88.1087.7049.5060.28
Adaptive-CoF86.0089.3050.9066.21
MIRROR (ours)92.0094.4251.4966.70

🔍 Qualitative Case Study

Real Case

MIRROR demonstrates superior Spatial Reasoning by identifying counting errors and Object Identification via active point-based querying.

Citation

@article{zhang2026mirror,
  title={MIRROR: Multimodal Iterative Reasoning via Reflection On Visual Regions},
  author={Zhang, Haoyu and Wu, Yuwei and Li, Pengxiang and Zhang, Xintong and Gao, Zhi and Gao, Rui and Gao, Mingyang and Sun, Che and Jia, Yunde},
  journal={arXiv preprint arXiv:2602.18746},
  year={2026}
}