Multimodal Diffusion in Latent Space

Diffusion remains one of the clearest examples of how probabilistic modelling and practical engineering met successfully in generative AI. The key practical step was moving denoising from pixel space into a compressed latent representation.

Forward and reverse processes

DDPMs define a forward corruption process that gradually adds Gaussian noise, together with a learned reverse process that removes it. In a simplified form:

$$q(x_t \mid x_{t-1}) = \mathcal{N}\left(\sqrt{1-\beta_t}x_{t-1}, \beta_t I\right), \qquad p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\left(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)\right).$$

Training is usually implemented through a noise-prediction objective, which avoids modelling the full reverse density directly.

Why latent diffusion matters

Rombach et al. showed that the denoising process can be run in the latent space of an autoencoder rather than on raw pixels. Let $z = \mathcal{E}(x)$ be a compressed representation and $x \approx \mathcal{D}(z)$ its decoder reconstruction. Then the diffusion model learns over $z$, reducing cost while preserving high-level semantics.

$$z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I).$$

Guidance and conditioning

Text conditioning and classifier-free guidance changed the usability of diffusion systems. In practice, the model is trained both with and without conditioning, and the two predictions are combined during sampling to strengthen alignment with the prompt.

eps_uncond = model(z_t, t, cond=None)
eps_cond = model(z_t, t, cond=text_embedding)
eps = eps_uncond + scale * (eps_cond - eps_uncond)

This is one reason diffusion models remain scientifically attractive: the mechanisms are modular and mathematically legible even when the surrounding systems become large and multimodal.