From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Overall pipeline of **ReflectionFlow** framework with both qualitative and quantitative results of scaling compute at inference time.

Abstract

Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance, and most notably; (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks. All code, checkpoints, and datasets are available.

Reflection Dataset

We introduce GenRef, the first large-scale dataset tailored for text-to-image refinement tasks, comprising over 1 million triplets structured as (flawed image, refined image, textual reflection). Our dataset spans four diverse data domains:

Rule-based Data: Challenging object-centric prompts verified by rule-based methods, ensuring diverse, clearly-defined image errors. Images are generated, rigorously evaluated, and paired by difficulty to form refinement pairs.
Reward-based Data: Images generated from general-purpose prompts and scored using ensemble reward models (HPSv2, CLIP, PickScore). High-quality and lower-quality images are paired to enhance aesthetic and semantic alignment.
Long-Short Prompt Data: Pairs of images generated from detailed versus concise prompts, leveraging the insight that longer, descriptive prompts yield higher-quality images.
Editing Data: High-quality editing pairs from OmniEdit with precise, actionable textual editing instructions and corresponding refined images, further enriching dataset diversity.

Additionally, we collected 227K chain-of-thought reflection annotations using advanced LLMs like GPT-4o. These annotations detail image differences, label image preferences, and provide concise instructions for refinement. We fine-tuned a dedicated MLLM verifier (Qwen2.5-VL-7B) to generate accurate textual reflections at scale and trained an image reward model to rigorously evaluate image quality gaps.

Inference-Time Scaling Framework

We first efficiently fine-tune a pretrained diffusion transformer as our corrector model using the GenRef dataset, employing multimodal attention and targeted training strategies to learn image refinement.

We then introduce ReflectionFlow, a versatile inference-time scaling framework designed to maximize T2I diffusion model performance through iterative refinement along three complementary dimensions:

Noise-Level Scaling: Generate multiple candidate images from diverse initial noise samples to explore a broad range of image possibilities.
Reflection-Level Scaling: Iteratively refine images using our trained corrector model guided by explicit textual reflections, progressively improving image quality.
Prompt-Level Scaling: Dynamically evolve and enhance textual prompts via a multimodal verifier model, ensuring increasingly precise guidance for subsequent iterations.

In each iterative round, ReflectionFlow leverages multimodal evaluations, textual reflections, and refined prompts to enhance generated image quality. The framework flexibly balances computational cost and performance by adjusting both the number of parallel image generations (search width) and iterative refinement rounds (reflection depth).

Main Results

We evaluate our ReflectionFlow framework on the GenEval benchmark, progressively applying three scaling dimensions: noise-level, prompt-level, and reflection-level scaling. Starting from a baseline FLUX.1-dev model (0.67), incorporating noise-level scaling notably boosts performance to 0.85. Adding prompt-level scaling further increases performance to 0.87, while integrating reflection-level scaling via explicit textual reflections achieves a substantial improvement, reaching 0.91.

Our method significantly outperforms state-of-the-art text-to-image models and existing reflection-based approaches (e.g., Reflect-DiT), clearly demonstrating that iterative refinement guided by multimodal textual feedback is highly effective for enhancing image generation quality.

Ablations

Exploring Different Verifiers

We evaluate ReflectionFlow with three verifiers: GPT-4o, our fine-tuned verifier, and SANA. All verifiers demonstrate strong performance, with GPT-4o quickly reaching its ceiling, our fine-tuned verifier showing steady improvement with increased sampling, and SANA achieving the highest score (0.91) at 32 samples. Compared to the oracle upper bound (0.98), there remains significant room for improvement through enhanced verifiers.

Scaling Inference-Time Budgets

By varying reflection depth while maintaining search width, we find that ReflectionFlow rapidly improves generation quality as computational budget increases. Our approach consistently outperforms baseline methods (Noise Scaling and Prompt Scaling), reaching a GenEval score of 0.91 at 32 samples with potential for further gains at larger budgets.

Exploring Iterative Refinement Strategies

Our analysis of different refinement strategies reveals that deeper sequential refinement consistently outperforms wider parallel generation. This confirms ReflectionFlow effectively leverages iterative reasoning to progressively correct complex errors, with optimal results achieved through higher refinement depth.

Reflection Capability for Difficult Tasks

ReflectionFlow shows remarkable effectiveness on challenging tasks. When stratifying prompts by difficulty, we observe the most significant improvements on hard prompts (correctness from 0.10 to 0.81), moderate gains on medium prompts (0.55 to 0.85), and minimal changes on easy prompts (0.95 to 0.97). This suggests potential for dynamically allocating resources based on task difficulty.

Step by Step Complex Reasoning Results

Qualitative examples demonstrate that ReflectionFlow iteratively identifies and corrects detailed errors from initial image generations, progressively refining outputs to match complex prompts. This explicit, step-by-step refinement resembles interpretable chain-of-thought reasoning in language models.

BibTeX

@misc{zhuo2025reflectionperfectionscalinginferencetime,
      title={From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning}, 
      author={Le Zhuo and Liangbing Zhao and Sayak Paul and Yue Liao and Renrui Zhang and Yi Xin and Peng Gao and Mohamed Elhoseiny and Hongsheng Li},
      year={2025},
      eprint={2504.16080},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16080},