Multimodal Biosignal Foundation Models: From Cross-Modal Pretraining to Generalizable Health Inference

Abstract

Multimodal biosignal foundation models are becoming a serious research direction because modern physiological monitoring is no longer organized around a single waveform. Clinical and ambulatory systems increasingly acquire combinations of electrocardiography, photoplethysmography, electroencephalography, electromyography, respiration, airflow, oxygen saturation, phonocardiography, and structured clinical text. The resulting problem is not only one of signal classification, but one of representation learning under heterogeneity: different sensors operate at different sampling rates, channels are missing unpredictably, temporal alignment is imperfect, labels are sparse, and the target of inference is often a latent physiological state rather than a local pattern. This article develops the field from first principles to current systems. It argues that multimodal biosignal foundation models should be understood as latent-state learners with modality-robust interfaces, not merely as larger encoders with more channels. The article formalizes the problem, reviews the architectural and objective families now shaping the literature, summarizes the most important recent systems, and provides practical guidance for implementation, evaluation, and research design. The central claim is that the true frontier is not scale alone, but physiologically valid invariance: learning representations that remain useful when sensors, subjects, devices, and care settings change.

Introduction

The idea of a foundation model, in the modern machine-learning sense, is a model trained on broad data at scale and adapted to many downstream tasks [1]. In biosignal processing, this concept is especially attractive because raw physiological data are abundant while reliable labels are expensive, narrow, and institution-specific. Electrocardiograms, polysomnography, wearable photoplethysmography, intensive-care waveforms, and neurophysiological recordings all produce long, information-rich streams whose clinically useful structure extends beyond any single supervised endpoint.

Yet a unimodal view is no longer sufficient. In real systems, physiology is observed through interacting sensors. Cardiac electrical activity co-varies with vascular pulse transit; sleep state is reflected jointly in EEG, EOG, EMG, airflow, respiratory effort, oxygen saturation, and ECG; synchronized ECG and phonocardiography reveal the timing relation between electrical and mechanical cardiac events; bedside monitoring combines waveforms with text, alarms, and metadata. For this reason, the scientifically relevant object is increasingly a multimodal representation of latent physiological state.

Multimodal biosignal foundation models attempt to learn that representation through large-scale pretraining, usually with self-supervised or weakly supervised objectives. Their promise is not merely better classification accuracy. It is a shift in problem formulation:

from hand-built, task-specific pipelines to reusable pretrained representations;
from fixed channel assumptions to variable sensor configurations;
from single-task optimization to broad transfer across diagnosis, staging, risk, monitoring, and question answering;
from narrow waveform analysis to cross-modal physiological inference.

As of March 28, 2026, this shift is no longer speculative. Published systems now include cross-modal cardiovascular autoencoders [4], multimodal masked-autoencoding models for physiological data [5], synchronous PCG-ECG cardiac foundation models [6], large-scale multimodal sleep foundation models [7], multimodal cardiac sensing models spanning ECG, PPG, and text from 1.7 million individuals [8], and robustness-oriented multimodal physiological pretraining methods explicitly designed for arbitrary missing modalities [9]. The field is early, but its direction is now clear.

Overview of multimodal biosignal foundation models. — From heterogeneous sensor streams to reusable physiological representations.

Why multimodality is the correct formulation

Classical biosignal analysis often begins by treating each signal as an independent object. This is reasonable when the task itself is local, such as QRS detection in ECG or artifact suppression in a single EEG channel. However, many clinically consequential questions are not local in that sense. They concern state, coupling, coordination, and failure across physiological subsystems.

Let

\[ \mathcal{M} = \{1, \dots, M\} \]

denote the set of sensing modalities and let \(x^{(m)}_{1:T_m}\) denote the time series from modality \(m\). A latent physiological-state formulation writes

\[ z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, u_t), \qquad x_t^{(m)} \sim p_\theta^{(m)}(x_t^{(m)} \mid z_t, \epsilon_t^{(m)}), \]

where \(z_t\) is the latent state, \(u_t\) is an exogenous drive or intervention, and \(\epsilon_t^{(m)}\) captures sensor-specific nuisance structure. Under this view, each modality is a partial, noisy observation of the same evolving physiological process. The task of a multimodal foundation model is therefore to learn representations \(\phi_\theta(x^{(1)}, \dots, x^{(M)})\) that preserve the state-relevant structure while tolerating heterogeneity in sensor availability and acquisition.

This framing explains why multimodality matters scientifically:

it constrains representation learning by cross-system consistency rather than within-signal regularity alone;
it supports imputation, alignment, and transfer when one modality is scarce or missing;
it encourages latent variables that correspond more closely to physiology than to acquisition artifacts;
and it enables deployment across environments where the available sensors differ.

The field's strongest systems now exploit exactly these advantages. Radhakrishnan et al. showed that cross-modal autoencoding between ECG and cardiac MRI can produce a shared representation of cardiovascular state that improves phenotype prediction and modality translation [4]. Apple's 2024 multimodal physiological work argues that cross-modal reconstruction objectives and modality dropout are critical because multimodal health data are heterogeneously informative and often incomplete [5]. SleepFM operationalizes the same principle in polysomnography by aligning multiple sleep-related modalities through leave-one-out contrastive learning while remaining resilient to heterogeneous channel configurations [7].

From unimodal pretraining to multimodal foundation models

The development of multimodal biosignal foundation models can be understood as a progression of representational ambition.

The first phase centered on unimodal scaling. EEG and ECG foundation models such as Neuro-GPT and ECGFM showed that large-scale pretraining can improve downstream performance under label scarcity and cross-dataset heterogeneity [2, 3]. These systems established that biosignal pretraining is viable, but they remained tied to one signal family at a time.

The second phase introduced cross-modal representation learning. Cross-modal autoencoders for cardiovascular state [4] and synchronous ECG-PCG masked-autoencoding systems [6] showed that integrating paired physiological modalities can yield richer representations than isolated training.

The third phase, which is now emerging, seeks true multimodal biosignal foundation models. These systems are designed from the start to support variable sensor subsets, large-scale heterogeneous pretraining, and transfer across multiple downstream settings. SleepFM [7], CSFM [8], and PhysioOmni [9] are emblematic because they move beyond narrow paired-modality experiments toward scalable, reusable physiological pretraining.

This transition mirrors earlier developments in vision and language, where masked modeling [10] and cross-modal alignment [11] established practical pretraining regimes. The difference in physiology is that the invariances are more delicate. A crop in computer vision usually preserves object identity. A temporal distortion in ECG may alter conduction intervals; a spectral perturbation in EEG may erase clinically meaningful oscillatory structure; dropping airflow from a PSG study may remove the decisive signal for one sleep-related condition but not another. Physiological multimodality therefore requires more explicit scientific discipline than generic multimodal AI.

Core formulation and objective design

The most important design choice in a multimodal biosignal foundation model is not the backbone alone. It is the objective: the mathematical statement of what structure should be preserved, aligned, reconstructed, or predicted.

Let \(h^{(m)} = f_\theta^{(m)}(x^{(m)})\) be a modality-specific embedding and let \(g_\theta\) be a shared fusion backbone producing a latent representation \(z\). A general pretraining objective can be written as

\[ \mathcal{L} = \lambda_{\mathrm{mask}} \mathcal{L}_{\mathrm{mask}} + \lambda_{\mathrm{align}} \mathcal{L}_{\mathrm{align}} + \lambda_{\mathrm{cross}} \mathcal{L}_{\mathrm{cross}} + \lambda_{\mathrm{miss}} \mathcal{L}_{\mathrm{miss}}. \]

Each term corresponds to a different scientific assumption.

Masked modeling

Masked modeling assumes that missing signal content is inferable from context. In multimodal form,

\[ \mathcal{L}_{\mathrm{mask}} = \sum_{m \in \mathcal{M}} \ell\!\left( \hat{x}^{(m)}_{\Omega_m}, x^{(m)}_{\Omega_m} \right), \]

where \(\Omega_m\) denotes masked indices for modality \(m\). The intuition is that if a model can reconstruct masked ECG segments from the available context, or infer masked respiratory content from surrounding signals, then the learned representation must retain structured information. Masked autoencoding has been central to recent physiological foundation-model work, including multimodal pretraining on PhysioNet-style data [5], synchronous PCG-ECG pretraining [6], and multimodal cardiac sensing at scale [8].

Cross-modal alignment

Alignment objectives assume that synchronized modalities share a latent physiological cause. In contrastive form, one may optimize

\[ \mathcal{L}_{\mathrm{align}} = - \log \frac{ \exp(\mathrm{sim}(h^{(i)}, h^{(j)})/\tau) }{ \sum_{k} \exp(\mathrm{sim}(h^{(i)}, h_k^{(j)})/\tau) }, \]

for paired modalities \(i\) and \(j\). SleepFM refines this idea with leave-one-out contrastive learning: one modality is aligned against an aggregate embedding of the remaining modalities, improving robustness when channel configurations vary [7]. In biosignals, this objective is appealing because it encourages shared physiological structure rather than modality-specific noise.

Cross-modal reconstruction

Cross-modal reconstruction assumes that one modality can partially explain another. This is not always valid, but when it is valid it is powerful. Synchronously acquired ECG and PCG are a canonical example because electrical activation and heart sounds are coupled in time [6]. Cross-modal autoencoder frameworks further show that accessible modalities can support inference about expensive or sparse modalities, as in ECG-conditioned representation learning for cardiac MRI-related phenotypes [4].

Missing-modality robustness

A serious multimodal biosignal foundation model cannot assume all sensors are always present. The objective must therefore make missingness part of training. One direct formulation is

\[ \mathcal{L}_{\mathrm{miss}} = \mathbb{E}_{\tilde{\mathcal{M}} \subseteq \mathcal{M}} \left[ d\!\left( r_\theta(x^{(\tilde{\mathcal{M}})}), r_\theta(x^{(\mathcal{M})}) \right) \right], \]

where \(\tilde{\mathcal{M}}\) is a sampled subset of available modalities and \(r_\theta\) denotes the representation induced by a modality subset. The idea is that the latent representation inferred from incomplete input should remain close to the representation inferred from the fuller sensor set whenever that comparison is physiologically justified. Apple's multimodal physiological work shows that modality dropout improves downstream performance [5], and PhysioOmni makes missing-modality compatibility a central design goal rather than a late-stage patch [9].

Architectural patterns that matter

Several architectural motifs now recur across the strongest systems.

Modality-specific front ends with a shared latent backbone

Raw biosignals are not commensurate at the sensor level. ECG, EOG, and airflow do not share a natural patch embedding. The common pattern is therefore modality-specific tokenization followed by a shared temporal backbone. This allows the model to respect low-level modality physics while still learning a fused latent space. Apple's multimodal physiological work uses modality-specific tokenizers before masked autoencoding [5]; the multi-signal digital-stethoscope model likewise uses separate projection paths for PCG and ECG before joint transformer processing [6].

Channel-agnostic or configuration-agnostic design

In physiological practice, sensor sets change. SleepFM is explicitly channel-agnostic across multiple PSG configurations [7]. CSFM is designed to adapt across 12-lead ECG, reduced-lead ECG, PPG-only, or combined ECG-PPG settings [8]. These are not implementation details. They are the difference between a publishable model and a deployable one.

Temporal alignment without naive synchronization assumptions

Some modalities are truly synchronous; others are only partially aligned. The architecture must therefore distinguish between strict simultaneity and looser temporal coupling. Overly rigid alignment can penalize physiologically meaningful lag. Overly weak alignment collapses multimodality into late fusion. This is one reason leave-one-out objectives and cross-modal aggregation are attractive: they encourage shared state without requiring identity at the waveform level [7].

Support for multimodal plus textual context

The most ambitious systems no longer stop at waveform fusion. CSFM jointly pretrains on cardiac biosignals and associated clinical or machine-generated text reports, then transfers to question answering and downstream cardiovascular tasks [8]. This matters because text often carries semantic abstractions not explicit in the waveform. The long-term implication is clear: future biosignal foundation models may become waveform-language systems rather than signal-only encoders.

Current systems and what they contributed

Table 1. Representative milestones in multimodal biosignal foundation modeling

System	Modality scope	Core pretraining idea	Main technical contribution	Why it matters
Cross-modal cardiovascular autoencoder, 2023 [4]	ECG + cardiac MRI	reconstruction plus cross-modal latent alignment	holistic cardiovascular representation from paired modalities	established that cross-modal latent spaces can improve phenotype prediction and modality translation
Multisignal digital-stethoscope foundation model, 2024 [6]	PCG + ECG	masked autoencoding on synchronously captured signals	extension of MAE to synchronized mechanical and electrical cardiac signals	showed that paired biosignals can encode timing relations not available in a single modality
Apple multimodal physiological FM study, 2024 [5]	diverse physiological channels in PhysioNet 2018	multimodal masked autoencoding with modality dropout	emphasized cross-modal reconstruction and robustness to incomplete input sets	clarified that naive late-fusion contrastive baselines are insufficient in physiological multimodal learning
PhysioOmni, 2025 [9]	EEG, ECG, EOG, EMG	decoupled tokenizer, masked signal modeling, resilient fine-tuning	explicit treatment of homogeneous versus heterogeneous features and arbitrary missing modalities	pushed the field toward modality-robust physiological pretraining
SleepFM, 2026 [7]	multimodal PSG including EEG, ECG, EMG, respiration and related channels	leave-one-out contrastive learning	channel-agnostic sleep pretraining over 585,000 hours from approximately 65,000 participants	demonstrated that multimodal physiological pretraining can support broad disease-risk prediction at scale
CSFM, 2026 [8]	ECG, PPG, and clinical text	generative masked pretraining on heterogeneous records	unified representations from 1.7 million individuals across devices and care settings	established strong evidence for broad transfer across diagnosis, monitoring, risk prediction, and ECG QA

The common thread across these systems is that multimodality is being used not just for accuracy gains, but to reframe representation learning around physiological consistency, transfer, and sensor flexibility.

Practical implementation: what an actual research stack should look like

The practical difficulty of this field is underestimated. Most failures do not come from transformer depth. They come from dataset assembly, synchronization, missingness, evaluation leakage, and physically implausible preprocessing.

At a minimum, a serious research stack should include the following elements:

waveform I/O and metadata handling with tools such as wfdb, mne, h5py, or zarr;
explicit patient-level and study-level indexing, not only file-level iteration;
modality-specific preprocessing using scipy.signal, mne, or carefully audited Torch operators;
a PyTorch training stack with deterministic data loaders, mixed precision, and logging;
experiment tracking that records modality subsets, missingness rates, channel maps, and preprocessing versions;
evaluation scripts that report both complete-modality and missing-modality performance.

The decisive engineering choice is to keep the modality interface explicit. In code, the batch should be a dictionary keyed by modality, plus a mask that records which sensors are present. Treating missing sensors as if they were zero-valued observations is usually a modeling mistake.

A compact implementation sketch

import torch
import torch.nn.functional as F


def random_keep(modalities, keep_prob=0.75):
    kept = {}
    for name, x in modalities.items():
        if torch.rand(()) < keep_prob:
            kept[name] = x
    if not kept:
        name = next(iter(modalities))
        kept[name] = modalities[name]
    return kept


def masked_mse(pred, target, mask):
    diff = (pred - target) ** 2
    return diff[mask].mean()


def pretrain_step(model, batch, mask_ratio=0.6, keep_prob=0.75):
    signals = random_keep(batch["signals"], keep_prob=keep_prob)

    encoded = {}
    recon_loss = 0.0
    for name, x in signals.items():
        tokens, mask = model.patchify_and_mask(name, x, mask_ratio=mask_ratio)
        z = model.encode_modality(name, tokens)
        x_hat = model.decode_modality(name, z, mask)
        recon_loss = recon_loss + masked_mse(x_hat, x, mask)
        encoded[name] = model.pool(z)

    align_loss = 0.0
    names = list(encoded.keys())
    for i in range(len(names)):
        for j in range(i + 1, len(names)):
            zi = F.normalize(encoded[names[i]], dim=-1)
            zj = F.normalize(encoded[names[j]], dim=-1)
            align_loss = align_loss + (1.0 - (zi * zj).sum(dim=-1).mean())

    loss = recon_loss + 0.2 * align_loss
    return loss

This sketch exposes three design commitments.

First, modality dropout is part of pretraining rather than a downstream afterthought. Second, reconstruction is modality-specific because each sensor has its own observational physics. Third, alignment is applied in latent space, where cross-modal agreement should reflect shared physiology rather than raw waveform identity.

Engineering pressures unique to this field

Table 2. Recurring design pressures in multimodal biosignal foundation models

Pressure	Why it is hard	Typical response in current systems	Residual risk
sampling-rate mismatch	ECG may be hundreds of Hz while SpO2 or airflow can be far slower	modality-specific tokenizers and patch lengths [5, 7, 8]	temporal fusion may hide clinically meaningful lag
variable sensor configurations	devices and studies rarely share the same channels	channel-agnostic architectures, modality dropout, subset training [5, 7, 9]	robustness can conceal heavy dependence on one dominant modality
structured missingness	sensor absence is often systematic, not random	explicit missing-modality objectives and resilient fine-tuning [5, 9]	model may still fail under deployment missingness not seen in pretraining
cross-cohort heterogeneity	devices, protocols, populations, and care settings differ materially	large-scale heterogeneous pretraining [7, 8]	gains may be dataset-composition artifacts if evaluation is weak
semantic sparsity of labels	downstream labels are expensive and narrow	self-supervised or weakly supervised pretraining	latent space may optimize transfer while ignoring physiological interpretability
modality dominance	one modality can overwhelm the shared latent space	cross-modal reconstruction, balanced masking, alignment losses [5, 6]	fused representations can become pseudo-unimodal

Evaluation discipline: where many papers still underperform

The evaluation of multimodal biosignal foundation models must be stricter than the evaluation of ordinary deep classifiers. The model class is more flexible, the datasets are more entangled, and the apparent gains are easier to inflate.

At minimum, strong evaluation should include:

patient-level splits rather than random window splits;
cross-device or cross-cohort transfer when possible;
ablations over modality subsets at inference time;
calibration and uncertainty assessment for downstream clinical tasks;
and reporting that separates gains from pretraining, scale, and architecture.

Recent strong papers increasingly take this seriously. Apple's multimodal physiological study explicitly splits by patient identity [5]. CSFM evaluates across multiple scenarios, devices, and sensing configurations [8]. SleepFM emphasizes generalization across several cohorts and PSG configurations [7]. These are good signs, but the broader field still needs more disciplined reporting of leakage, modality dependence, and failure under sensor dropout.

What the field still does not know

Despite rapid progress, several core questions remain unresolved.

Are the learned invariances physiologically correct?

A representation may transfer well while still erasing clinically important distinctions. This is a central danger in biosignals because nuisance variation and pathology can overlap in frequency content, temporal morphology, or cross-modal timing. The field needs more work on invariance auditing, not only downstream benchmarking.

What is the right latent unit?

Some systems learn sequence-level embeddings, others patch-level tokens, others state-like temporally pooled latents. It is still unclear whether the most transferable unit is beat-level, epoch-level, subject-level, modality-level, or hierarchical across all of them.

How much scale is enough?

In vision and language, scale laws became a defining principle. In physiology, data quality, sensor diversity, and paired-modality coverage may matter as much as raw volume. SleepFM and CSFM strongly suggest that scale helps [7, 8], but the relevant axis may be physiologically diverse coverage rather than only sample count.

How should text be incorporated?

CSFM shows that text-conditioned cardiac modeling is feasible [8]. But biosignal-text alignment raises additional concerns: report noise, institution-specific vocabulary, and shortcut learning from machine-generated annotations. The future of biosignal foundation models may well be multimodal and multimessage, but the scientific criteria for success are not yet settled.

Can these models support causal and mechanistic reasoning?

Most current systems are still correlational. They learn transferable representations, but not necessarily mechanisms. The next level of the field may require hybrid models that combine foundation-model pretraining with explicit physiological structure, state-space constraints, or differentiable simulators.

Conclusion

Multimodal biosignal foundation models represent a real advance in biosignal processing, but their significance is not exhausted by model scale. They matter because they force a better formulation of the scientific problem. Human physiology is not unimodal, and many practical sensing environments are not fixed. The right model must therefore learn under heterogeneity, support variable sensor availability, preserve cross-system physiological structure, and transfer across tasks, devices, and care settings.

The best recent work already points in this direction. Cross-modal cardiovascular representation learning [4], synchronous ECG-PCG pretraining [6], multimodal masked autoencoding with modality dropout [5], channel-agnostic sleep pretraining [7], large-scale cardiac sensing with waveform-text integration [8], and missing-modality-robust physiological pretraining [9] are not isolated ideas. They are pieces of an emerging architecture for biosignal AI.

The technical challenge ahead is clear. The field must move from "multimodal because more inputs help" to "multimodal because physiology is a coupled dynamical system." When that shift is taken seriously, multimodal biosignal foundation models become more than a fashionable extension of self-supervised learning. They become a plausible foundation for robust, transferable, and scientifically meaningful health inference.