Forward and reverse processes
DDPMs define a forward corruption process that gradually adds Gaussian noise, together with a learned reverse process that removes it. In a simplified form:
Training is usually implemented through a noise-prediction objective, which avoids modelling the full reverse density directly.
Why latent diffusion matters
Rombach et al. showed that the denoising process can be run in the latent space of an autoencoder rather than on raw pixels. Let \(z = \mathcal{E}(x)\) be a compressed representation and \(x \approx \mathcal{D}(z)\) its decoder reconstruction. Then the diffusion model learns over \(z\), reducing cost while preserving high-level semantics.
Guidance and conditioning
Text conditioning and classifier-free guidance changed the usability of diffusion systems. In practice, the model is trained both with and without conditioning, and the two predictions are combined during sampling to strengthen alignment with the prompt.
eps_uncond = model(z_t, t, cond=None)
eps_cond = model(z_t, t, cond=text_embedding)
eps = eps_uncond + scale * (eps_cond - eps_uncond)
This is one reason diffusion models remain scientifically attractive: the mechanisms are modular and mathematically legible even when the surrounding systems become large and multimodal.