Contrastive Representation Learning

Contrastive learning operationalizes a simple principle: augmented views of the same object should land close in representation space, while views from different objects should be well separated.

InfoNCE objective

Given an anchor $i$, a positive pair $j$, and negatives from the rest of the batch, SimCLR optimizes an InfoNCE-style objective:

$$\ell_{i,j} = -\log \frac{\exp(\mathrm{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\mathrm{sim}(z_i, z_k)/\tau)}.$$

The temperature $\tau$ controls how sharply the softmax separates positives from negatives. In practice, the choice of augmentations is often as important as the loss itself because augmentations define which invariances the representation should learn.

Alignment and uniformity

A useful geometric interpretation is that the objective balances two tendencies: align positives and distribute embeddings broadly on the hypersphere. Good representations are not only close for matched views; they are also globally spread enough to avoid collapse.

Implementation sketch

views_a = augment(batch)
views_b = augment(batch)
z_a = F.normalize(projector(encoder(views_a)), dim=-1)
z_b = F.normalize(projector(encoder(views_b)), dim=-1)
logits = z_a @ z_b.T / temperature
labels = torch.arange(batch.size(0), device=batch.device)
loss = F.cross_entropy(logits, labels)

Contrastive learning matters beyond vision. The same logic appears in language, audio, time series, and biosignal pretraining, especially when labels are expensive and augmentations reflect domain knowledge.