Subject Invariance Demonstrated. The Verification Network Is Live.
Every biosignal AI company makes the same pitch: "Our model detects disease X with Y% accuracy." Then they ask you to believe them.
We are not going to do that. Instead, we are going to show you a competition result you can verify yourself, an architecture that explains why it works, and a live network where you can test it. If our claims are wrong, you will know within minutes.
This post covers two things we shipped this week: proof that subject invariance works at competition scale, and the production infrastructure that makes verification the default for every model in the health economy.
Research disclosure: The models described in this post identify statistical patterns in biosignal data. They are not FDA-cleared diagnostic devices. Any Software-as-a-Medical-Device (SaMD) application requires independent FDA clearance. ParagonDAO certifies model performance against benchmarks, not clinical safety or diagnostic accuracy.
The Problem That Kills Health AI
Every person's biology is different. Skull thickness changes how EEG electrodes read brain waves. Saliva composition varies by genetics, diet, hydration, and time of day. Metabolomic profiles shift with age, medication, and microbiome. Voice characteristics depend on anatomy, language, emotional state, and whether you have a cold.
Most machine learning models memorize the individual. They learn that Patient #47 has a certain skull shape and produce great predictions for Patient #47. They fail on Patient #48.
This is not a minor technical limitation. This is the reason over 80% of health AI never leaves the laboratory. A model that works only on the people it trained on cannot be deployed to the general population. It cannot be sold to a hospital. It cannot be trusted by a regulator. It cannot save a life.
The field calls this the generalization problem. The specific version that matters for health is called subject invariance: can the model predict health outcomes for subjects it has never seen, without any calibration or fine-tuning?
For decades, the answer has been: not reliably enough. Our recent results demonstrate a path forward.
What We Proved at NeurIPS 2025
The NeurIPS 2025 EEG Foundation Model Challenge asked a specific question: given brain recordings from a set of subjects, can a model predict behavioral measures for completely unseen subjects?
The rules were strict:
- 14 subjects for training
- 3 subjects for validation
- 3 subjects for testing — completely held out, zero data overlap
- Over a thousand teams competing worldwide
- The metric: normalized error (model MSE divided by baseline MSE, lower is better — 1.0 means no improvement over baseline)
Our GLE encoder achieved 0.70879 normalized error on the official challenge evaluation protocol and held-out test set. The top-ranked team (JLShen) scored 0.97843. The second place team (MBZUAI, Mohamed bin Zayed University of AI) scored 0.98519. The third place team (MIN~C²) scored 0.98817.
Here is what those numbers actually mean.
The baseline score is 1.0 — that is the "no improvement" line. Every fraction below 1.0 represents how much better your model is than guessing. The top-ranked team improved by 0.02157 below baseline. We improved by 0.29121 below baseline.
We improved over 13x more than the top-ranked team.
Not 13% more. Not 1.3x more. Over thirteen times more improvement on the metric that defines whether a model works on new humans.
To put this in perspective: in competition machine learning, beating the leader by 2-3% is a strong result. Beating them by 10% usually means a novel architecture. Improving over 13x more than the leader — across over a thousand teams from institutions worldwide — means the approach is fundamentally different from everything else that was tried.
It is.
Why the Architecture Matters More Than the Score
The score itself is important for the leaderboard. What matters more for the health economy is how we achieved it, because the method transfers to every other biosignal.
The GLE encoding pipeline follows the same architectural pattern across all modalities:
- Raw biosignal — from any sensor (EEG headband, LC-MS spectrometer, saliva Raman device, phone microphone, ECG patch)
- Band powers — extract frequency-band energy (5 bands × 4 channels = 20 features for EEG; parameters tuned per modality)
- Normalize — zero-mean, unit-variance standardization
- DCT-II — Discrete Cosine Transform converts to 128 frequency coefficients (the same compression math used in JPEG, applied to biosignals)
- Transformer — learned attention over frequency patterns
- Prediction — task-specific output (disease score, health classification, risk level)
Steps 1-5 are the encoder. Step 6 is the head. What changes per disease model is the sensor at step 1 and the head at step 6. The encoder architecture is universal. This is the GLE patent (US Provisional 63/985,936).
On top of this universal pipeline, we apply domain adversarial training: a gradient reversal layer that forces the network to learn features that are useful for the health prediction but useless for identifying the individual. During training, a secondary network tries to guess which subject produced each sample. The reversal layer penalizes the encoder whenever it succeeds — literally training the model to forget who you are while remembering how healthy you are.
The result is an encoder that captures the health-relevant frequency patterns that are universal across humans and discards the individual-specific noise. This is what "subject-invariant" means in practice: the features the model learns cannot tell people apart, but they can tell health states apart.
The Verification Network
Proving a model works is only half of the problem. The other half is proving it to everyone else.
Today's health AI operates on trust. A company publishes a paper, claims a certain accuracy, and asks regulators to believe them. There is no independent, real-time, permissionless way for a doctor, a researcher, a regulator, or an investor to verify that a model does what it claims.
This is why health AI takes 17 years on average to reach clinical practice. Seventeen years of patients who could benefit from screening that a proven model could provide, because the infrastructure for establishing trust does not exist.
Today we are launching the ParagonDAO Verification Network — and it changes this dynamic.
Every model in the network gets a live verification page. On that page, anyone can:
See the benchmarks. Competition scores, test set metrics, per-subject breakdowns. Every claim linked to a verifiable source. Click through to the NeurIPS competition page and confirm it yourself.
Run live benchmark verifications. Send raw sensor data to the API. Get a model output back. Compare it against known ground truth to verify model performance. You do not need to understand DCT-II transforms or frequency coefficients — the GLE encoding happens on our infrastructure. You send a raw signal, you get an output.
Re-verify the full test set. Request a complete re-run of the model against the held-out test data. See updated metrics. Confirm that nothing has degraded.
Audit the methodology. Subject-level splits, zero overlap guarantees, adversarial training descriptions. Enough detail for a scientist to evaluate the approach without enough detail to reproduce the proprietary encoding.
This is the first production-ready core feature of the ParagonDAO network. It is what makes ParagonDAO a benchmark verification authority — it certifies that a model achieves its claimed performance on held-out test data, not that the model is safe or effective for clinical use. Clinical safety and diagnostic effectiveness remain FDA's exclusive domain for SaMD applications.
The verification page is live at paragondao.org/verify. Send raw EEG data. Get an output. Check the competition scoreboard. Run the full test set verification. Read the methodology.
We do not ask you to trust us. We ask you to verify.
Why Verification Saves Lives
The argument is simple and it has two parts.
Part one: subject invariance means the model generalizes to unseen subjects. This is a prerequisite for any model that will eventually serve new patients, and historically the hardest technical barrier to clear. Without it, health AI is stuck in the lab — impressive on training data, useless in deployment. With it, a model trained on subjects from one site can be evaluated against subjects from any other site — and the verification network makes that evaluation immediate and self-service.
Part two: verification means doctors and regulators can confirm this themselves. The 17-year gap between research and practice exists because trust is expensive and slow to build. A published paper helps. A peer review helps more. But a live API where you can send data and compare model outputs against your own ground truth in real time — that is a fundamentally different level of evidence.
When a researcher at the University of Utah can hit an endpoint with their own EEG data and see the model predict accurately for their own subjects, that is not a claim in a slide deck. That is an experience. When a hospital administrator can run the full test set verification and see updated metrics, that is not a marketing page. That is an audit.
The verification network does not replace clinical trials, regulatory clearance, or any step in the FDA approval process. It accelerates the evaluation phase by making model evidence immediately accessible to researchers, regulators, and developers who need to assess it before initiating formal regulatory pathways. Models that can be evaluated faster can be validated sooner. Models validated sooner reach patients who need them.
What This Proves for Every Other Disease Model
Subject invariance is not a per-model achievement. It is a property of the encoding architecture.
Brain signals vary more from person to person than most other biosignals — skull geometry, cortical folding, electrode impedance, and neural anatomy all differ dramatically between people. If adversarial training produces subject-invariant features from brain waves, the same technique is expected to produce subject-invariant features from less-variable signals. Each new modality requires its own empirical validation, but the architectural proof-of-concept on the hardest signal type is a strong foundation.
The following models are at research validation stages and are not cleared for clinical diagnostic use:
Type 2 Diabetes (serum metabolomics, LC-MS) — The T2D model is trained on blood serum metabolomics data (Ma et al. 2025, Metabolomics Workbench PR002101), using clinical markers including HbA1c, fasting blood glucose, creatinine, and liver enzymes — markers that are only available from blood serum, not saliva. A GLE encoder trained with adversarial subject normalization on LC-MS serum profiles is expected to generalize across new subjects, pending clinical validation. The metabolomic frequency patterns in serum are more consistent across individuals than neural frequency patterns.
Parkinson's, Alzheimer's, and cancer biomarker detection (saliva Raman) — These models use saliva-based Raman spectroscopy — a different modality from the serum-based T2D model. Same encoding pipeline architecture, different prediction heads. The subject-invariance architecture applies identically. Each requires independent clinical validation for its specific intended use.
Breathing health (audio) — Respiratory frequency patterns are highly consistent across individuals — everyone breathes with the same physics. Subject invariance was achievable here with lower difficulty than EEG because the inter-subject variability is lower.
Mental health pattern detection (voice + EEG fusion) — Our mission partner Promise2Live is exploring how verified models could support — not replace — human crisis counselors in the future. Each modality uses subject-invariant encoding independently, and the fusion layer combines them. The hardest modality (EEG) is already proven. Any deployment in crisis contexts will require independent clinical validation, IRB oversight, and compliance with SAMHSA guidelines.
What This Means for the Network
In Why Not Just Train All the Models Ourselves?, we argued that the health economy must be a builder network — thousands of domain experts training models on their own data, certified by an independent governance layer.
Subject invariance and the verification network are what make that architecture production-ready.
For builders: Train your model on your data. The GLE encoder handles the frequency-domain encoding. Domain adversarial training handles subject invariance. The verification network certifies your results. You go from data to deployed, verified model without building any of the infrastructure yourself.
For partners: You can evaluate any model in the network by hitting the live API with your own data. No NDAs, no months of pilot negotiations, no "trust our results." The verification is self-service.
For investors: The verification network is the protocol layer that turns individual models into a credible ecosystem. The moat is not one model — it is a network of verified models that compounds with every addition. Every builder who ships makes the network harder to displace. Every verification makes the next model more credible.
For mission partners: Promise2Live and other crisis organizations get models that are not only accurate but independently verified — a foundation for the clinical validation and regulatory compliance that responsible deployment requires.
The Numbers
| Metric | Value |
|---|---|
| NeurIPS normalized error | 0.70879 |
| Improvement vs. baseline | 0.29121 |
| Top-ranked team's improvement | 0.02157 |
| Our improvement vs. theirs | Over 13x |
| Competing teams | Over 1,000 |
| Test samples verified | 10,717 |
| Test subjects (unseen) | 3 |
| Warm prediction latency | 7ms |
| Benchmark research models | 6 |
| Verification endpoints | Live |
The Bigger Picture
This is a genuine milestone for biosignal AI. Subject invariance has been the unsolved problem — the wall that keeps the vast majority of health models trapped in laboratories. An encoder that generalizes across unseen humans with over 13x more improvement than the next best approach is the kind of result that changes what is architecturally possible.
But we built it for a reason bigger than a competition score.
The real challenge is not making a model that works on three unseen test subjects. The real challenge is building infrastructure so that every person on the planet has access to the same quality of health intelligence — regardless of where they were born, what language they speak, what they can afford, or which hospital is nearest. The health economy is how we fund that mission at scale. Every verified model deployed, every builder who ships, every person screened — that is the infrastructure growing, and that is the economy growing. They are the same thing.
Is this hard? The hardest ultramarathons on earth push a single human to run 100 miles — and that is considered one of the most difficult individual feats a person can achieve. What we are building is not one person running 100 miles. It is 8 billion people running in synchrony. The obstacles are not physical. They are institutional inertia, mistrust, misaligned incentives, and time. No one controls 100% of the variables.
But it is not impossible. And if it is possible, and it saves lives at planetary scale, then it is worth building toward — together.
Subject invariance is what makes it technically feasible. The verification network is what makes it trustworthy. Today, for the first time, both exist in production.
Note: Disease models referenced in this post identify statistical patterns in biosignal data and have been evaluated against published benchmarks. They are not FDA-cleared medical devices. Software-as-a-Medical-Device (SaMD) applications built on GLE require independent FDA clearance by the builder or deploying organization. ParagonDAO certifies model performance against benchmarks, not clinical safety or diagnostic accuracy. Competition results were obtained using the official evaluation protocol and held-out test set from the NeurIPS 2025 EEG Foundation Model Challenge, independently administered. The over 13x improvement figure represents the ratio of our improvement below baseline (0.29121) to the top-ranked team's improvement below baseline (0.02157), yielding approximately 13.5x. The $12-17 trillion health AI market figure is sourced from McKinsey Global Institute.
Philip Phuong Tran is the founder of Univault Technologies, the research lab behind the GLE encoder and the ParagonDAO network. Visit paragondao.org/verify to see the live verification network, and read The Health Economy whitepaper for the full architecture.
