Diffusion
Mental
Averages

Phonphrm Thawatdamrongkit Sukit Seripanitkarn Supasorn Suwajanakorn

VISTEC, Thailand

CVPR 2026

TL;DR

We unveil what a diffusion model 'thinks' is typical of a given concept by aligning multiple denoising trajectories into a single, sharp mental average.

Abstract

Can a diffusion model produce its own “mental average” of a concept—one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

"Strength"

"Denver"

"Wizard"

"Marriage"

"Breakfast"

"Clock"

"Aurora"

"Alien"

"Fear"

"Jellyfish"

"CEO"

"Childhood"

"Santorini"

"Cat"

"Dance"

"Intelligence"

"Ferris Wheel"

"Peace"

"Giraffe"

"Kungfu Master"

"Strength"

"Denver"

"Wizard"

"Marriage"

"Breakfast"

"Clock"

"Aurora"

"Alien"

"Fear"

"Jellyfish"

"CEO"

"Childhood"

"Santorini"

"Cat"

"Dance"

"Intelligence"

"Ferris Wheel"

"Peace"

"Giraffe"

"Kungfu Master"

Approach

Given a concept prompt, we aim to synthesize an average image that captures the concept’s shared semantics under a probe diffusion model while preserving visual realism.

In diffusion models, semantic information is distributed along the denoising trajectory, evolving from coarse structure to fine details. This motivates performing averaging not in image space or within a single latent layer, but along the model’s denoising process itself. We therefore recast the problem as aligning multiple denoising trajectories so that they converge toward a shared semantic consensus.

Specifically, at each timestep, their h-space activations are averaged to form a semantic target, and each latent is optimized to match it before denoising to the next step. Repeating this process across timesteps aligns coarse-to-fine semantics, yielding a single "mental average" of the concept.

Mental Averages as Visual Summaries

Our method generates concrete visual summaries for direct concept comparison, revealing the specific features that define their differences and offering insights into questions such as:

What distinguishes French from English gardens?

"French garden"

"English garden"

What makes cheap and luxurious cars different?

"Cheap car"

"Luxurious car"

What makes a chair comfortable?

"Chair"

"Comfortable chair"

What makes one dog faster than another?

"Slow dog"

"Fast dog"

How does the model imagine Impressionism and Expressionism?

"Impressionist artwork"

"Expressionist artwork"

What makes a bedroom feel cozy?

"A bedroom"

"A cozy bedroom"`

What features define healthy vs unhealthy meals?

"A healthy meal"

"An unhealthy meal"

Same Prompt—Different Minds: Visualizing Variant Differences

Visualizing averages across model variants helps diagnose how fine-tuning modifies or preserves concept semantics.
By analyzing averages from different fine-tuned variants of Stable Diffusion, we discovered a Disneyfication of castles and that some variants shift the 'soldier' prototype from male to female—a shift in gender bias.

SD1.5	Realistic Vision V5.1	Dreamshaper PixelArt	Flat-2D-Animerge

"A photo of a castle"

"A photo of a stained-glass window"

"Photo portrait of a soldier"

"A photo of a dolphin"

"A photo of Italy"

Birds and Beams: Discovering Multimodal Prototypes

A single average may oversimplify concepts like "crane," which spans distinct meanings like a bird and a construction machine. To address this, we first cluster samples in a semantic space, such as CLIP (unsupervised) or BLIP-VQA (grounded) and bridge them to the diffusion space using Textual Inversion or LoRA. This steers the denoising process toward each cluster's subregion, allowing us to run DMA on each cluster to generate sharp, mode-specific prototypes.