From log evidence to the ELBO

Starting from \(\log p_\theta(x)\), insert the variational posterior and rearrange:

$$\log p_\theta(x) = \mathcal{L}(x; \theta, \phi) + D_{\mathrm{KL}}\left(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\right),$$

where

$$\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}\left(q_\phi(z \mid x)\,\|\,p(z)\right).$$

Because the KL term is nonnegative, \(\mathcal{L}\) is a lower bound on the log evidence, hence the name evidence lower bound.

Why amortization matters

Classical variational inference solved a separate optimization problem for each datapoint. Amortized inference replaces that with a shared inference network \(q_\phi(z \mid x)\). This is computationally efficient, but it introduces an approximation gap because one encoder must represent many local posteriors.

Minimal implementation sketch

mu, logvar = encoder(x)
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
recon = decoder(z)

recon_loss = F.binary_cross_entropy(recon, x, reduction="sum")
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
loss = recon_loss + kl

The scientific value of this framework is broader than VAEs. ELBO-style reasoning appears across latent-variable modelling, probabilistic sequence models, and Bayesian deep learning.