Abstract
Self-supervised representation learning has become one of the most consequential developments in modern machine learning because it offers a route to scalable pretraining without dependence on dense annotation. In the domain of human physiological data, this promise is particularly compelling. Biosignal datasets are often large in raw duration but poor in labels, expensive to annotate, heterogeneous across devices and institutions, and structured by latent physiological dynamics that cannot be reduced to simple class labels. Under these conditions, self-supervised learning is not merely a workaround for annotation scarcity. It is a research program for learning signal representations that preserve temporal structure, cross-channel dependencies, and subject-specific or state-specific regularities while remaining useful for downstream diagnosis, monitoring, and scientific inference. This article develops that perspective from first principles to current advances, with emphasis on objective design, modality-specific inductive bias, evaluation discipline, and open research challenges in electrocardiography, electroencephalography, and multimodal human physiological sensing.
Introduction
Human physiological data occupy a difficult position in machine learning. They are abundant in raw form yet scarce in curated labels; richly structured yet heavily contaminated by artifact; informative at multiple time scales yet often nonstationary across state, subject, device, and context. Traditional supervised pipelines are therefore constrained not only by label availability but by the mismatch between a narrow task label and the broader structure present in the signal.
Self-supervised representation learning changes the problem formulation. Instead of asking only how to map a signal directly to a target label, it asks how to learn a representation that captures what is stable, predictive, and transferable in the signal before a specific label is imposed. In domains such as vision and language, this shift has already transformed the scale and quality of learned representations. In physiology, the stakes are arguably even higher because labels are expensive, biomedical datasets are fragmented, and much of the relevant structure is latent, longitudinal, and multimodal.
The central claim of this article is that self-supervised learning is especially well aligned with physiological data because physiology itself is structured. Rhythms recur. Signals co-vary across channels. Dynamics unfold over time. Artifacts have their own structure. Clinical outcomes often reflect latent state rather than single local patterns. The right self-supervised objective can therefore act as a compressed scientific hypothesis about what aspects of the signal should remain invariant, predictable, or reconstructible.
Why self-supervision is a natural fit for physiological data
At a high level, self-supervised learning seeks representations \(z = f_\theta(x)\) from unlabeled input \(x\) by defining supervision from the data itself. The learned encoder \(f_\theta\) is then transferred to downstream tasks via linear evaluation, finetuning, or other protocols.
For physiological data, this strategy is compelling for several structural reasons.
First, annotation is intrinsically expensive. ECG rhythm labels, sleep-stage labels, seizure annotations, artifact masks, and clinically validated outcome labels all require expert time and often disagreement resolution. Second, raw data volume is usually much larger than annotated volume: hours or days of ECG, EEG, respiration, or wearable data may be available long before reliable labels are. Third, labels are often task-local while the data contain broader physiological structure. A label such as arrhythmia presence or sleep stage may not exhaust the information encoded in morphology, temporal coupling, patient-specific dynamics, or state transitions.
From a signal-processing perspective, this means that the classical supervised pipeline throws away too much structure too early. Self-supervised learning offers a way to retain that structure in a learned representation before downstream compression into a task-specific decision rule.
Representation learning as objective design
The critical question in self-supervised learning is not whether labels are absent. It is what surrogate objective replaces them. In practice, modern self-supervised representation learning can be organized into several broad families:
- predictive objectives, which learn by forecasting or inferring future latent structure;
- contrastive objectives, which bring positive pairs together and push negatives apart;
- non-contrastive or self-distillation objectives, which align multiple views without explicit negatives;
- reconstruction and masked modeling objectives, which infer missing content from context;
- multimodal alignment objectives, which learn shared structure across different physiological or contextual streams.
These are not merely algorithmic variants. They encode different assumptions about the signal class.
Predictive coding
Contrastive Predictive Coding (CPC) is one of the most influential early formulations in modern self-supervised learning [1]. It combines latent encoding with autoregressive prediction and uses a contrastive objective to favor representations that are informative about future latent states. In compact form, if \(c_t\) is a context representation summarizing the past and \(z_{t+k}\) is a future latent representation, CPC optimizes a contrastive objective that rewards correct future prediction among distractors.
This formulation is particularly natural for physiology because many signals are driven by temporally coherent latent processes: cardiac cycles, respiratory modulation, neural rhythms, autonomic fluctuations, and behavioral state transitions. A representation that is useful for predicting near-future latent structure may be more physiologically faithful than one trained only on local classification labels.
Contrastive learning
The most familiar modern contrastive objective takes the form of the InfoNCE loss:
where \(z_i\) and \(z_j\) form a positive pair, \(z_k\) are candidate negatives, \(\mathrm{sim}\) is typically cosine similarity, and \(\tau\) is a temperature parameter. SimCLR made this formulation widely influential by showing how augmentations and representation heads interact to produce strong transfer representations [2].
In physiological data, however, the meaning of a positive pair is more subtle than in vision. Two augmentations of the same signal window are not necessarily the most meaningful positive pair. Positives may instead be defined across:
- time: nearby windows from the same recording,
- space: different leads or sensors observing related physiology,
- subject: windows from the same patient,
- modality: cross-signal views such as ECG and PPG,
- state: matched views under the same physiological regime.
CLOCS is a particularly important example because it explicitly defined contrastive objectives across space, time, and patients for cardiac signals, showing that cardiac representation learning benefits from physiological rather than purely generic view design [3].
Non-contrastive self-distillation
BYOL demonstrated that useful representations can be learned by aligning multiple views even without explicit negative examples [4]. This matters for physiology because negative sampling can be problematic: two recordings from different patients may be more physiologically similar than two states from the same patient, and aggressive negative construction may inadvertently penalize clinically relevant invariances.
For biosignals, non-contrastive methods raise a productive question: which invariances should be enforced, and at what granularity? The answer cannot be borrowed mechanically from computer vision.
Masked modeling and reconstruction
Masked autoencoders and related masked modeling approaches treat the signal as partially observed and train the model to infer missing portions from context [5]. This family is attractive for physiological data because it aligns naturally with structured temporal continuity, lead redundancy, and multimodal coupling.
If a sequence \(x\) is corrupted by a masking operator \(M\), one may train an encoder-decoder pair to minimize a reconstruction objective
or a task-specific variant. In physiology, masked modeling can be interpreted not only as signal completion but as a test of whether the representation captures local morphology, long-range temporal context, or cross-channel dependency.
Recent work in EEG representation learning, such as graph-masked autoencoder approaches, extends this idea by respecting non-Euclidean spatial relations between electrodes rather than treating the signal as a flat array [6].
Table 1. Taxonomy of objective families, assumptions, and physiological failure modes
| Objective family | Core assumption | Useful invariance or structure | Typical physiological strength | Typical physiological failure mode |
|---|---|---|---|---|
| Predictive coding | future latent structure is informative and locally predictable | temporal continuity, short-horizon dynamics | strong for rhythm, morphology progression, state continuity | fails when abrupt interventions, arrhythmias, or artifact bursts break predictability |
| Contrastive learning | positive pairs encode semantic similarity better than negatives | augmentation-defined invariance, lead/time/subject consistency | strong when positive-pair design reflects physiology | fails when augmentations erase pathology or negatives penalize clinically meaningful similarity |
| Non-contrastive self-distillation | two views should align without explicit negatives | stability across benign transformations | avoids some false-negative problems in biosignals | can collapse toward nuisance invariances if view design is weak |
| Masked modeling | missing signal content is inferable from context | local morphology, long-range context, channel redundancy | strong for structured multilead or multichannel signals | reconstructs easy low-level statistics while missing clinically important latent structure |
| Multimodal alignment | synchronized modalities share latent physiological state | cross-system agreement, network physiology | promising for ECG-PPG-respiration or neurophysiological fusion | fails when modalities are only weakly aligned or one modality is artifact-dominated |
| Patient-aware objectives | subject identity stabilizes some signal structure | intra-subject consistency | useful for personalized baselines and longitudinal monitoring | risks overfitting to identity rather than transferable physiology |
What makes physiological self-supervision difficult
The transfer of self-supervised learning from vision or speech into physiology is not trivial. Several domain-specific issues matter.
Positive-pair construction is a scientific decision
In images, common augmentations such as cropping or color jitter often preserve semantic identity. In physiological data, an augmentation may destroy the very phenomenon one wishes to preserve. Time warping can alter interval structure. Aggressive filtering can erase pathology. Channel dropout may remove clinically decisive evidence. Window shuffling may break physiological causality.
Thus, view construction is not a technical nuisance. It is a scientific statement about what should remain invariant. Poorly chosen invariances create shortcut learning rather than robust representation learning.
Artifact is structured, not random
Motion contamination, electrode displacement, line interference, myogenic activity, and sensor saturation often have regular structure. A self-supervised objective can accidentally learn artifact persistence instead of physiology if training data are not curated or if augmentations preserve nuisance structure too effectively. This is especially relevant in wearable sensing and ambulatory monitoring.
Subject leakage can inflate apparent success
Representation learning is particularly vulnerable to evaluation mistakes when windows from the same subject appear across pretraining, validation, and downstream splits. A representation may appear highly transferable while mainly encoding subject identity or acquisition context. In physiology, evaluation design therefore matters as much as objective design.
ECG as a leading case study
ECG is one of the most fertile domains for physiological self-supervision because it combines high data availability, meaningful multi-lead structure, strong morphology, and clear downstream tasks. Mehari and Strodthoff provided one of the first comprehensive assessments of self-supervised learning on 12-lead clinical ECG, showing that self-supervised pretraining can improve label efficiency and robustness while making contrastive predictive coding especially effective in that setting [7].
CLOCS then pushed the field further by showing that physiology-specific contrastive structure matters. Rather than treating ECG as a generic one-dimensional sequence, it used space (lead relationships), time (temporal neighborhoods), and patient structure as positive-pair design cues [3]. This was an important conceptual advance because it demonstrated that self-supervised learning for biosignals benefits from domain-aware view semantics rather than direct transplantation of image recipes.
ECG-specific design opportunities
| Signal property | Self-supervised opportunity | Failure mode if ignored |
|---|---|---|
| multi-lead spatial redundancy | cross-lead alignment or prediction | representation collapses to lead-specific nuisance structure |
| recurring morphology | beat-level or segment-level temporal consistency | model learns only coarse rhythm, not morphology |
| patient-specific baselines | patient-aware positive definitions | model overfits identity rather than physiology |
| sparse labeled pathology | pretraining for finetuning and rare-disease transfer | supervised model remains label-starved |
EEG and the problem of mixtures
EEG presents a different challenge. Channels are mixtures of neural and non-neural sources, electrode geometry matters, and the signal is highly nonstationary across task, subject, recording hardware, and artifact conditions. This makes EEG a strong test case for whether self-supervised learning can move from channel-level memorization toward transferable latent representation.
BENDR was influential because it brought transformer-based contrastive pretraining to large-scale EEG and argued for a foundation-model-like approach to encephalographic data [8]. Its significance was not only performance, but the proposal that EEG pretraining can be organized analogously to large-scale sequence modeling in other domains while still respecting biosignal structure.
More recent work such as GMAEEG highlights another important direction: masked modeling that explicitly incorporates graph structure over electrodes rather than flattening spatial relationships [6]. This is a critical step for physiology because EEG is not simply a 1D sequence; it is a spatiotemporal measurement of coupled field activity.
Multimodal physiological self-supervision
The next major advance lies beyond single-modality pretraining. Human physiology is intrinsically multimodal. Cardiac, respiratory, vascular, electrodermal, and neural signals interact, and these interactions are often more informative than isolated channels.
This suggests several multimodal self-supervised objectives:
- cross-modal agreement: align ECG and PPG representations over matched time windows;
- cross-modal prediction: use one signal stream to predict latent structure in another;
- multimodal masked modeling: reconstruct masked segments using complementary channels;
- state-consistency objectives: learn representations stable across multiple synchronized physiological measurements.
The conceptual bridge here is strong: multimodal self-supervision is not only a data-fusion technique. It is a way to encode network physiology into the representation objective.
From pretext tasks to physiological foundation models
The field is now moving from isolated pretext-task studies toward the idea of reusable biosignal foundation models. This does not simply mean scaling model size. It means pretraining on large heterogeneous physiological corpora so that downstream tasks inherit richer priors about dynamics, morphology, coupling, and subject variation.
TS2Vec, though not physiology-specific, is relevant because it framed self-supervised representation learning for time series at arbitrary semantic levels and showed that hierarchical contrastive structure can produce broadly useful temporal embeddings [9]. Physiological research can be interpreted as the specialization of this broader time-series agenda under stronger domain constraints.
The frontier question is therefore not only whether self-supervised learning works on biosignals. It is what class of physiological foundation model is possible:
- modality-specific or multimodal,
- sequence-based or graph-aware,
- contrastive or masked-reconstructive,
- patient-agnostic or patient-adaptive,
- and general-purpose or clinically specialized.
A compact implementation sketch
import torch
import torch.nn.functional as F
def info_nce(z1, z2, temperature=0.1):
z1 = F.normalize(z1, dim=-1)
z2 = F.normalize(z2, dim=-1)
logits = z1 @ z2.T / temperature
labels = torch.arange(z1.size(0), device=z1.device)
return F.cross_entropy(logits, labels)
def augment_ecg(x):
noise = 0.01 * torch.randn_like(x)
scale = 0.9 + 0.2 * torch.rand(x.size(0), 1, 1, device=x.device)
return scale * x + noise
def ssl_step(encoder, batch):
view1 = augment_ecg(batch)
view2 = augment_ecg(batch)
z1 = encoder(view1)
z2 = encoder(view2)
loss = info_nce(z1, z2)
return loss
What matters here is not the simplicity of the loss, but the difficulty of the augmentation. In physiology, augment_ecg is the scientific bottleneck. If the augmentation destroys clinically meaningful morphology, the learned invariance is wrong. If it leaves nuisance variation untouched, the representation may become brittle or shortcut-driven.
Evaluation: the part that determines whether the field matures
A major portion of the scientific value of self-supervised learning for physiological data now depends on evaluation rigor. Several issues should be regarded as non-negotiable:
- patient-level splits rather than random window-level leakage,
- cross-device and cross-cohort testing where possible,
- linear evaluation plus finetuning rather than one protocol only,
- robustness analysis under artifact and domain shift,
- careful distinction between representation quality and downstream classifier strength.
This is where many apparently strong results can become fragile. In physiology, the representation is useful only if it transfers across the shifts that matter clinically or scientifically.
Progression of the field
Table 1. Progression of self-supervised representation learning for physiological data
| Stage | Main idea | Representative advance | Research implication |
|---|---|---|---|
| generic transfer of SSL | adapt vision/speech SSL objectives to biosignals | CPC- and contrastive-based ECG pretraining [1,7] | tests whether unlabeled physiological structure is exploitable at all |
| physiology-aware objective design | define positives using time, leads, patient structure | CLOCS [3] | objective design becomes domain science, not only ML engineering |
| sequence foundation modeling | large-scale sequence pretraining for EEG and related data | BENDR [8] | supports transfer across tasks and recording conditions |
| structure-aware masked modeling | incorporate electrode or lead geometry into reconstruction | GMAEEG [6] | graph and spatial priors become central |
| multimodal physiological SSL | align or predict across synchronized biosignals | emerging 2024 direction [10,11] | representation learning begins to encode organ-system interaction |
Open research problems
What should representations be invariant to?
The central design question is not only how to learn a representation, but which equivalence classes matter physiologically. Invariance to sensor gain may be desirable; invariance to ST-segment morphology may be disastrous. The theory of augmentations in biosignals remains underdeveloped.
How should multimodal coupling be represented?
Current methods are still largely modality-specific. Yet physiology is relational. Future progress likely depends on objectives that capture cross-system dependencies rather than only within-channel continuity.
Can self-supervised learning support clinically trustworthy transfer?
The field has shown label-efficiency gains, but clinically meaningful robustness under device shift, demographic heterogeneity, comorbidity, and artifact remains a harder benchmark. Representation quality must ultimately be evaluated under these shifts.
What is the right scale?
Foundation-model thinking is attractive, but physiological datasets are fragmented by institution, protocol, device, and privacy constraints. This makes scale a sociotechnical problem as much as a modeling problem.
Conclusion
Self-supervised representation learning is one of the most promising research directions in human physiological data because it addresses a real structural bottleneck: biosignals are information-rich but label-poor. The field now has enough evidence to say that self-supervision is not merely portable to physiology. It is well matched to physiology, provided its objectives are designed with respect to temporal continuity, morphology, spatial structure, multimodal coupling, and the realities of biomedical evaluation.
The deeper lesson is that representation learning in physiology should not be treated as a generic transfer of machine-learning fashion. It is a signal-science problem. The success of the next generation of methods will depend less on adopting new acronyms than on designing self-supervised objectives that are faithful to how human physiological signals are generated, measured, and interpreted.