Variational Inference and the ELBO

Variational inference turns posterior estimation into optimization. Instead of computing $p_\theta(z \mid x)$ exactly, we choose an approximating family $q_\phi(z \mid x)$ and minimize its divergence from the true posterior.

From log evidence to the ELBO

Starting from $\log p_\theta(x)$, insert the variational posterior and rearrange:

$$\log p_\theta(x) = \mathcal{L}(x; \theta, \phi) + D_{\mathrm{KL}}\left(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\right),$$

where

$$\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}\left(q_\phi(z \mid x)\,\|\,p(z)\right).$$

Because the KL term is nonnegative, $\mathcal{L}$ is a lower bound on the log evidence, hence the name evidence lower bound.

Why amortization matters

Classical variational inference solved a separate optimization problem for each datapoint. Amortized inference replaces that with a shared inference network $q_\phi(z \mid x)$. This is computationally efficient, but it introduces an approximation gap because one encoder must represent many local posteriors.

Minimal implementation sketch

mu, logvar = encoder(x)
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
recon = decoder(z)

recon_loss = F.binary_cross_entropy(recon, x, reduction="sum")
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
loss = recon_loss + kl

The scientific value of this framework is broader than VAEs. ELBO-style reasoning appears across latent-variable modelling, probabilistic sequence models, and Bayesian deep learning.