Semler Scientific

Our client’s device captures photoplethysmography (PPG) signals using a standard pulse oximeter during a brief Valsalva maneuver. RGT was brought in to build the AI system that turns those signals into a clinically useful prediction: whether a patient’s myocardial strain falls below a threshold associated with early cardiac dysfunction. That meant building everything from scratch. This involved data pipelines, signal preprocessing, deep learning models, evaluation infrastructure, and a web application for clinical analysis, all within a regulated medical device context where reproducibility is not optional.

THE PROBLEM

What problem was the client facing?

Measuring myocardial strain with any precision normally requires echocardiography, specialist equipment, trained operators, and a clinic visit that many patients never get around to making. The client had already deployed their pulse oximetry device widely for Peripheral Arterial Disease (PAD) screening, and the clinical question on the table was whether those same PPG signals captured during a Valsalva maneuver contained enough cardiac signal to predict GLS (Global Longitudinal Strain). If they did, it would mean a low-cost, non-invasive cardiac screening test that could work in a community clinic or, eventually, at home.

The problem was that no one had built the AI system to test that hypothesis rigorously. The raw data from four clinic sites was sitting in inconsistent formats with mismatched metadata, no quality filtering, and no agreed preprocessing standard. Early modelling work had been done in scattered notebooks with no reproducibility. There was no evaluation framework, no experiment tracking, and no cross-clinic validation protocol.

Why did it matter?

Heart failure with preserved ejection fraction (HFpEF) is one of the more frustrating diagnoses in cardiology because it often goes undetected until it’s advanced. Reduced myocardial strain is an early marker that precedes visible symptoms, and standard echocardiography is simply out of reach for a large portion of the at-risk population. A validated AI model on top of the client’s device would change that accessibility equation meaningfully.

For the client specifically, this was also about building a defensible product on new scientific ground, requiring the kind of documented, reproducible development process that could support a future FDA premarket submission.

Who was affected?

At the patient level: people at risk of undetected cardiac dysfunction who lack access to affordable early screening. At the clinical level: practitioners in community settings who cannot offer echo-equivalent diagnostics. Internally: the client’s product and regulatory teams needed a pipeline that could be documented for FDA review, and their technical partner needed a team who could actually deliver it.

Business, technical, and operational challenges.

The regulatory context was the constant overhead. Everything had to be reproducible, documented, and defensible, which ruled out the usual exploratory shortcuts. The dataset was small for the task (a few hundred patients across four sites), which made generalisation hard and class imbalance a persistent problem. Signal quality varied substantially between clinics. And the team was distributed internationally, spanning Ghana and the US, which required tooling and practices that most ML teams do not need to think about.

Project Goals & Objectives

Main objective

Build and validate a machine learning model that classifies patients as healthy or at-risk based on predicted myocardial strain from PPG signals, using a binary threshold (GLS ≥ 16 = healthy, GLS < 16 = at-risk), with performance that holds across multiple independent clinic cohorts.

What success looked like

A model that generalises across clinic sites, not just one. Balanced Accuracy above 0.75 with Sensitivity above 0.85 (the priority metric clinically, since missing a sick patient is the worse error) and Specificity above 0.80. A full reproducible training and inference pipeline ready for integration into the client’s product infrastructure. And a working web application giving the clinical and product teams direct access to their data.

Secondary goals

Build a signal quality framework that could catch bad recordings automatically rather than requiring manual review. Develop interpretability tooling (occlusion sensitivity, Grad-CAM) so clinicians could have some visibility into what the model was responding to. Create a patient clustering system to understand phenotypic sub-populations in the data. And establish an MLOps foundation with experiment tracking, automated leaderboards, and reporting that could sustain ongoing model development.

KPIs and measurable targets

Primary: Balanced Accuracy, Sensitivity, Specificity, and Mean Absolute Error on strain regression, all tracked continuously with automated weekly and monthly leaderboard updates. Secondary: per-clinic performance breakdown (no single site should be carrying the overall result), signal rejection rate through the quality filter, and model inference latency.

Our Approach & Strategy

The core bet was on InceptionTimePlus, a 1D convolutional architecture that has performed well on time series classification tasks, combined with a pre-trained checkpoint (PTCP) fine-tuning protocol. The idea: rather than training from scratch each time, which the dataset size makes risky, load a strong prior checkpoint, then fine-tune on the new data with a lower learning rate. That gave the experiments a better starting point and helped control overfitting.

In parallel, we ran a second track using a PPG foundation model to generate 512-dimensional signal embeddings per patient phase (baseline, whistle, recovery), then applied UMAP, HDBSCAN, and Gaussian Mixture Models to characterise the signal space in an unsupervised way. That analysis did not depend on strain labels, which made it valuable for understanding the data independent of model performance.

Phase 1

Data Infrastructure & Quality Framework.

Multi-clinic data ingestion, signal pipeline construction, metadata alignment, and automated quality filtering: noise detection, cyclic range validation, Valsalva window detection.

Phase 2

Baseline Modelling & Evaluation Framework.

5-fold cross-validation protocol, classification and regression metrics, confusion matrix reporting, and the first baseline model.

Phase 3

Model Development & Optimisation

PTCP fine-tuning, multi-input model variants (signal + tabular features), hyperparameter sweeps, learning curve analysis, and per-clinic drill-down evaluation.

Phase 4

Foundation Model Integration.

PPG embedding pipeline, UMAP/HDBSCAN clustering, intrinsic dimensionality estimation, and cohort characterisation.

Phase 5

Web Application.

Full-stack analysis application: population overview, cluster analysis, signal visualisation, scalar feature comparison across health status and clinic, and signal reconstruction

Phase 6

Clinical Validation & Documentation.

Cross-clinic performance analysis, threshold optimisation, and technical documentation prepared for regulatory review.

What informed our decisions

Clinical domain expertise from the client’s team shaped signal selection, the Valsalva protocol design, and the logic behind the strain thresholds. Experiment analytics drove architecture and hyperparameter choices. Signal quality analysis, including reconstruction studies and fiducial point analysis, drove preprocessing decisions. FDA feedback on the first premarket submission, which specifically asked for detailed model descriptions, training paradigms, regularisation techniques, and quality control criteria, shaped how the pipeline was documented.

Constraints

Dataset size was the binding constraint throughout. Hundreds of patients spread across four clinics is not a lot when you’re trying to generalise across clinic environments with different hardware and patient demographics. The regulatory context added documentation overhead. Coordinating a distributed team across Ghana and the US meant that session handoff documentation and reproducible configs were not nice-to-haves.

EXECUTIONS AND DELIVERABLES

What we delivered

ML Training Library: end-to-end training pipeline with 5-fold CV, InceptionTimePlus with PTCP fine-tuning, multi-input tabular fusion, evaluation, and experiment tracking.

Signal Processing & Feature Extraction Library: shared module with 12+ feature categories including scalar PPG metrics, HRV, spectral, entropy, and autoregressive features.

Clinical Analysis Web Application: 7-page app covering population overview, PPG embedding cluster analysis, signal visualisation, scalar feature comparison, and signal reconstruction.

Automated Experiment Reporting: HTML dashboards, per-run metric reports, learning curve analysis, and automated performance leaderboards.

Data Quality Framework: automated rejection pipeline covering noise detection, low cyclic range filtering, Valsalva window detection, and clinic-level exclusion logic.

Technical Documentation: full documentation covering the pipeline, model architectures, evaluation metrics, and operating procedures.

Technologies and tools

Python, PyTorch, PyTorch Lightning, tsai (InceptionTimePlus), Hydra, Weights & Biases, Streamlit, UMAP, HDBSCAN, scikit-learn, pyPPG, pandas, scipy, statsmodels, Poetry.

Standout technical decisions

PTCP Fine-Tuning Protocol

Loading a prior best checkpoint and fine-tuning with conservative learning rates let us improve incrementally without the overfitting risk of full retraining on a small dataset. Each experiment built on evidence from the last.

Foundation Model Embedding Track

Embedding PPG signals into a 512-dimensional space gave us a label-independent way to characterise the patient population. The resulting clusters revealed phenotypic structure in the data that the supervised models could not surface on their own.

Automated Valsalva Detection

Rather than relying on manual annotation of the whistle phase window, we built a signal processing algorithm to detect it automatically. At the dataset scale we were working with, that was the difference between a reproducible pipeline and one that required constant human intervention.

Leave-One-Out Permutation Analysis

For the embedding analysis, this approach produced mean clustering performance very close to full permutation at a fraction of the compute cost, an insight that came directly from the team’s own experimentation.

Challenges & Solutions

Signal quality heterogeneity

PPG signals from four clinic sites looked meaningfully different from each other: different noise profiles, different baseline wanders, different Valsalva compliance. Pooling them naively hurt performance and obscured where the model was actually struggling. The fix was a clinic-aware quality framework with per-subject rejection logging and per-clinic performance drill-downs in every evaluation report, so signal quality issues showed up as data problems rather than model problems.

Class imbalance

The healthy and at-risk classes were not evenly distributed, which made standard accuracy a misleading metric. We standardised on Balanced Accuracy and weighted cross-entropy loss across all experiments from the beginning, which kept the model honest about both classes.

Small dataset, large model capacity

InceptionTimePlus has enough capacity to memorise a small dataset. The PTCP protocol was partly an answer to this: starting from a stronger prior and fine-tuning carefully reduced the window for overfitting. The learning curve analysis also gave us an evidence-based view of where data saturation was setting in, which informed the client’s decisions about data collection priorities.

Metadata alignment across clinics

Patient identifiers, clinic codes, and strain measurement sources varied by site in ways that were not always documented. We built a multi-fallback metadata lookup that tried several resolution strategies in sequence and logged every failure for manual review. Not elegant, but it was the only approach that actually caught everything.

Major pivots

The original scope was primarily supervised regression and classification from raw signals. After learning curve experiments showed model saturation at current dataset sizes, the foundation model embedding track was added as a parallel strategy. The unsupervised structure in the embeddings told a different story than the supervised metrics alone. The web application also grew well beyond its initial prototype scope into a full clinical analysis platform the client now uses for research and partner demonstrations.

RESULTS & IMPACT

What changed

The client now has a reproducible ML pipeline that did not exist before. The analysis web application gives their clinical and product teams direct access to patient cohort characterisation, signal quality review, and model performance analysis. Before RGT’s engagement, none of that existed in any operational form.

Model performance

Best single model (InceptionTimePlus, signal-only): Balanced Accuracy 0.793, MAE 2.19. This is an 80th-percentile result across replicated runs, meaning it is a conservative and reproducible number, not a cherry-picked outlier.

Best ensemble configuration (Standard ensemble, 3 operating cutpoints): Balanced Accuracy 0.839, MAE 1.81. The ensemble approach materially improves both metrics and reflects how the system is designed to operate in production.

Over 50 experimental runs tracked across multiple architectures (InceptionTimePlus, Transformer, ResNet), feature configurations, and preprocessing variants, including regression-only, classification-only, and multi-input variants.

Learning curve analysis quantified the relationship between dataset size and model performance, giving the client a data-driven basis for their data collection strategy.

Regulatory impact

FDA feedback on the first premarket submission specifically asked for detailed documentation of model architecture, training paradigms, feature selection, loss functions, regularisation, and quality control criteria. The pipeline RGT built, with reproducible configs, run artifacts, per-fold evaluation reports, and the signal quality framework, is designed to generate exactly that documentation. The next submission will have the technical substance the FDA is asking for.

Client feedback

“This is excellent, easy to follow work. Thank you.” — Client domain consultant, on the signal preprocessing and reconstruction work

“This is an outstanding study and shows the potential of different approaches of finding the optimal combinations that we have discussed historically, plus many more... your work is yielding excellent results and the full permutation, or leave one out, is performing better than our top performers, which suggests that this mining process was extremely fruitful.” — Client domain consultant, on the embedding cluster analysis

“This is excellent! Nice work. I think I may have started drooling while looking at the 3D HTML.” — Client domain consultant, on the 3D cluster visualisation