Fusion as a representation problem

For multimodal physiological streams such as ECG, PPG, respiration, or PSG channels, a useful abstraction is to encode each modality separately and fuse them in a shared latent space:

$$z_m = f_m(x_m), \qquad z_{\mathrm{fused}} = g(z_1, z_2, \dots, z_M).$$

The fusion operator \(g\) may be concatenation, cross-attention, or a mixture-of-experts mechanism. The scientific challenge is ensuring the shared embedding preserves physiological meaning rather than only dataset-specific shortcuts.

What recent work suggests

Recent foundation-model work on PPG and multimodal sleep data suggests that scale helps, but only when the training setup respects signal morphology, temporal structure, and cross-cohort variability. In practice, model value depends on transfer across datasets and devices, not on one benchmark win.

Minimal fusion sketch

z_ecg = ecg_encoder(ecg)
z_ppg = ppg_encoder(ppg)
z_resp = resp_encoder(resp)
z = fusion_block(torch.stack([z_ecg, z_ppg, z_resp], dim=1))
prediction = head(z.mean(dim=1))

The long-term interest of this area is clear: multimodal biosignal models could support disease risk prediction, monitoring, and adaptation across wearable and clinical settings. But scientific caution matters here because robustness, fairness, and calibration remain open issues.