NoisyCLIP

Conditional diffusion models frequently suffer from language-image misalignments. Due to the ambiguity of intermediate noise-corrupted latents, assessing prompt adherence currently requires completing the entire sampling trajectory. This late-stage evaluation incurs even higher computational costs during test-time scaling strategies, such as Best-of-N (BoN) sampling, as all misaligned trajectories must finish generation before being discarded. To tackle this, we propose NoisyCLIP, a noise-aware twin-tower model that enables early language-to-latent alignment estimation. By learning a vision encoder on noise-corrupted latents, we allow the model to "see" through the ambiguity of intermediate diffusion steps. To facilitate this training, we investigate noise-data augmentation sampling strategies and introduce two new benchmark datasets: Noisy-Conceptual-Captions and Noisy-GenAI-Bench. When applied as an early-stopping criterion for BoN, NoisyCLIP at half cost matches or beats frozen CLIP at full cost. Ultimately, this transforms alignment assessment from an expensive final check into a continuous monitoring tool, drastically reducing compute costs without sacrificing semantic fidelity.

A prompt is used to generate an image, and at an intermediate step the generated latent and the original prompt are encoded and their similarity is measured. NoisyCLIP produces a similarity score closely aligned with the score assessed on the final image, but already at intermediate stages of generation, allowing for the early identification of misalignments. We fine-tune only the visual encoder with a contrastive (InfoNCE) objective over noise-corrupted latents, keeping the text encoder frozen and shared with the diffusion model's conditioning space.

Images generated using a Best-of-6 strategy. We compare NoisyCLIP with CLIP and LatentCLIP-4-plus under two computation budgets: 150 and 300 diffusion steps. Our approach surpasses the baseline at the 150-step limit and achieves comparable or superior alignment to the full 300-step baselines, preserving fine-grained compositional attributes at half the computation.

BibTeX

TO BE UPDATED

Early Estimation of Language to Latent Alignment in Diffusion Models

ECCV 2026

Abstract

Method

Language-to-Latent Alignment

Early-Stopping Best-of-N

Impact of Training Latent Ranges

Generalization to DiT Architectures

Qualitative Results

BibTeX