Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance, and most notably; (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks. All code, checkpoints, and datasets are available.
We introduce GenRef, the first large-scale dataset tailored for text-to-image refinement tasks, comprising over 1 million triplets structured as (flawed image, refined image, textual reflection). Our dataset spans four diverse data domains:
Additionally, we collected 227K chain-of-thought reflection annotations using advanced LLMs like GPT-4o. These annotations detail image differences, label image preferences, and provide concise instructions for refinement. We fine-tuned a dedicated MLLM verifier (Qwen2.5-VL-7B) to generate accurate textual reflections at scale and trained an image reward model to rigorously evaluate image quality gaps.
We first efficiently fine-tune a pretrained diffusion transformer as our corrector model using the GenRef dataset, employing multimodal attention and targeted training strategies to learn image refinement.
We then introduce ReflectionFlow, a versatile inference-time scaling framework designed to maximize T2I diffusion model performance through iterative refinement along three complementary dimensions:
In each iterative round, ReflectionFlow leverages multimodal evaluations, textual reflections, and refined prompts to enhance generated image quality. The framework flexibly balances computational cost and performance by adjusting both the number of parallel image generations (search width) and iterative refinement rounds (reflection depth).
We evaluate our ReflectionFlow framework on the GenEval benchmark, progressively applying three scaling dimensions: noise-level, prompt-level, and reflection-level scaling. Starting from a baseline FLUX.1-dev model (0.67), incorporating noise-level scaling notably boosts performance to 0.85. Adding prompt-level scaling further increases performance to 0.87, while integrating reflection-level scaling via explicit textual reflections achieves a substantial improvement, reaching 0.91.
Our method significantly outperforms state-of-the-art text-to-image models and existing reflection-based approaches (e.g., Reflect-DiT), clearly demonstrating that iterative refinement guided by multimodal textual feedback is highly effective for enhancing image generation quality.
We evaluate ReflectionFlow with three verifiers: GPT-4o, our fine-tuned verifier, and SANA. All verifiers demonstrate strong performance, with GPT-4o quickly reaching its ceiling, our fine-tuned verifier showing steady improvement with increased sampling, and SANA achieving the highest score (0.91) at 32 samples. Compared to the oracle upper bound (0.98), there remains significant room for improvement through enhanced verifiers.
By varying reflection depth while maintaining search width, we find that ReflectionFlow rapidly improves generation quality as computational budget increases. Our approach consistently outperforms baseline methods (Noise Scaling and Prompt Scaling), reaching a GenEval score of 0.91 at 32 samples with potential for further gains at larger budgets.
Our analysis of different refinement strategies reveals that deeper sequential refinement consistently outperforms wider parallel generation. This confirms ReflectionFlow effectively leverages iterative reasoning to progressively correct complex errors, with optimal results achieved through higher refinement depth.
ReflectionFlow shows remarkable effectiveness on challenging tasks. When stratifying prompts by difficulty, we observe the most significant improvements on hard prompts (correctness from 0.10 to 0.81), moderate gains on medium prompts (0.55 to 0.85), and minimal changes on easy prompts (0.95 to 0.97). This suggests potential for dynamically allocating resources based on task difficulty.
Qualitative examples demonstrate that ReflectionFlow iteratively identifies and corrects detailed errors from initial image generations, progressively refining outputs to match complex prompts. This explicit, step-by-step refinement resembles interpretable chain-of-thought reasoning in language models.
@misc{zhuo2025reflectionperfectionscalinginferencetime,
title={From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning},
author={Le Zhuo and Liangbing Zhao and Sayak Paul and Yue Liao and Renrui Zhang and Yi Xin and Peng Gao and Mohamed Elhoseiny and Hongsheng Li},
year={2025},
eprint={2504.16080},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.16080},