Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

Abstract

Self-reflection has become a key mechanism for improving reasoning in Vision-Language Models (VLMs), yet this corrective mechanism often fails when resolving complex fine-grained regional ambiguities. This performance degradation stems from the issue of modality disconnect in self-reflection: most existing models execute self-reflection either within textual or latent space, lacking a mechanism to explicitly align textual reasoning with visual evidence.

In this paper, we propose MIRROR, a closed-loop visual reflection framework comprising four steps: initial response generation, error identification, region-based visual verification, and revision. In this cycle, the model first generates an initial response, identifies uncertain logical assertions that require visual verification, then grounds them in relevant image regions, and finally revises based on the visual evidence. We construct a multi-turn visual reflection dataset ReflectV, which empowers the model with such reflective capability.

Extensive experiments across 12 diverse multimodal benchmarks show that MIRROR achieves an average absolute improvement of 7.2 percentage points over the base model, with particularly strong gains in hallucination mitigation (+13.36 on HallusionBench) and general reasoning (+10.10 on MM-Vet), demonstrating the advantage of transforming self-reflection from open-loop textual revision into closed-loop, visually grounded verification.

MIRROR Framework

Method: Closed-Loop Reasoning

MIRROR upgrades standard VLM inference into a cycle of drafting, critiquing, and region-based verification to anchor reasoning in concrete visual evidence.

ReflectV Dataset

ReflectV transforms static QA into 24k high-quality reflective trajectories using a multi-agent pipeline and self-reflective conversion.

Experimental Results

🏆 Quantitative Comparison

Performance on General Capabilities and OCR & Document Benchmarks.

Model	Size	MM-Vet	MMStar	Seed-2-P	TextVQA	OCRBench	ChartQA
LLaVA-OneVision	7B	48.80	61.70	--	76.10	62.10	80.00
InternVL3	2B	54.95	60.70	64.95	77.00	82.20	76.08
InternVL3	8B	64.27	61.50	69.61	80.51	85.00	79.64
Qwen2.5-VL-3B	3B	47.39	55.87	68.81	79.12	82.60	83.20
Qwen2.5-VL-7B	7B	56.60	61.21	70.88	84.90	83.20	86.08
MIRROR (w/o tool)	7B	59.91	62.80	70.36	85.37	88.30	86.56
MIRROR (ours)	7B	66.70	73.33	76.86	86.62	92.00	87.92

Performance on Hallucination, Perception, and Math Benchmarks.

Model	Size	POPE	HalluBench	HRBench-4k	MME-RW	VStarBench	MathVison
LLaVA-OneVision	7B	78.10	31.60	63.00	--	72.30	18.30
InternVL3	2B	89.60	42.50	61.75	43.88	68.59	21.71
InternVL3	8B	90.37	49.90	70.00	48.83	68.06	20.39
Qwen2.5-VL-3B	3B	86.21	63.09	50.25	42.15	72.77	25.66
Qwen2.5-VL-7B	7B	86.45	68.66	68.87	44.29	75.39	23.36
MIRROR (w/o tool)	7B	87.95	68.24	69.13	46.01	76.44	27.30
MIRROR (ours)	7B	94.42	82.02	72.88	51.49	83.77	28.29

⚔️ Comparison with Reasoning Paradigms

MIRROR addresses the inherent limitations of existing reasoning paradigms by incorporating a closed-loop verification process. All methods are fine-tuned on Qwen2.5-VL-7B for fair comparison.

Method	OCRBench	POPE	MME-RW	MM-Vet
Text Reflection
VL-Rethinker	85.40	84.19	47.21	56.19
Visual Reflection
LookBack (Solution)	87.50	88.20	49.80	63.50
LookBack (Semantic)	88.60	89.80	50.40	65.10
Thinking with Images
PixelReasoner-SFT	76.35	80.01	44.73	47.68
PixelReasoner	82.10	86.03	49.70	52.98
DeepEyes	88.10	87.70	49.50	60.28
Adaptive-CoF-SFT	85.62	82.53	50.10	62.73
Adaptive-CoF	86.00	89.30	50.90	66.21
MIRROR (ours)	92.00	94.42	51.49	66.70

🔍 Qualitative Case Study

MIRROR demonstrates superior Spatial Reasoning by identifying counting errors and Object Identification via active point-based querying.