What if results regress after adding synthetic data?

Isolate the cohort causing regressions, adjust weights, or regenerate with corrected protocol parameters. Document decisions in the change log.

Regulatory Readiness with Synthetic Data — Part 3: A Playbook for AI Developers

Q: How much synthetic data is too much?

Use synthetic to close documented coverage gaps and prove generalization on real external datasets. Reviewers care about the evidence chain, not a quota.

Q: Do we need clinician review of synthetic images?

Yes. Include a spot-check with predefined criteria, log pass/fail, and attach examples in the QC appendix.

Written by Rayan Sadri · Estimated read time: 10–12 minutes

Part 3 of a 3-part series on regulatory readiness with synthetic data.

TL;DR — Pair real-world validation with targeted synthetic cohorts. Log every generation parameter, audit subgroup performance, prove non-memorization, and ship a single “evidence bundle” that a reviewer can rerun.

Principles regulators reward

Transparency: Every artifact is reproducible from logged seeds, prompts/parameters, and versions.
Coverage: You can show why cohorts were chosen and how they close gaps in real data.
Robustness: Stress tests and subgroup audits are first-class, not an appendix.
Comparability: Synthetic results are always paired with real external validation.

Provenance & traceability

Create a minimal “data card” for every synthetic batch and store it in Git or your registry. At minimum:

Generation

Engine/version & commit
Seed range & sampler
Physics/protocol params (e.g., vendor, field strength, dose)
Pathology recipe & prevalence

Outputs

Count by class, modality, anatomy
Masks/labels schema version
QC results (fidelity score, clinician pass rate)

Governance

Access controls & hashes
Non-memorization report ID
Reviewer rerun script path

Coverage matrix & cohort design

Define a simple matrix that maps Patient × Acquisition × Disease. Mark real-data empties, then design synthetic cohorts to fill them.

Axis	Examples	Real gaps	Synthetic cohort target
Patient	Pediatrics, pregnancy, BMI>35	Pediatrics, very high BMI	2k pediatrics; 1.5k BMI 35–45
Acquisition	Vendor A/B, 1.5T/3T, low-dose CT	Low-dose CT, Vendor B 3T	3k low-dose CT; 1k Vendor B 3T
Disease	Stage I–IV, multifocal	Early stage, subtle lesions	1k Stage I subtle; 800 multifocal

Bias audits & stress tests

Ship a fixed battery of tests so results are comparable across releases:

Subgroup metrics: AUROC/F1 by age band, sex, site, vendor, BMI.
Protocol shift: Train on Vendor A, test on Vendor B (synthetic + real).
Edge conditions: Motion, low dose, borderline lesions.
MRMC (if reader study): Reader variability vs model, with synthetic hard cases.

Reproducibility & non-memorization checks

Near-dup search: Perceptual hash + deep features against all source real datasets. Report nearest-neighbor distances & thumbnails.
Leak tests: Train on synthetic-only; verify no suspicious spikes on real test for specific IDs.
Determinism: Rerun with same seeds to prove identical outputs; bump engine version to show controlled drift.

Documentation: model card & data card

Bundle concise documents reviewers can skim quickly:

Model Card (4–6 pages)

Intended use & contraindications
Training datasets (real + synthetic) with counts
Evaluation plan & success thresholds
Subgroup/bias results & mitigations
Monitoring & update policy

Data Card (per cohort)

Provenance & parameters (see above)
Coverage matrix deltas
QC/clinical review notes
Non-memorization evidence

Submission bundle checklist

Item	Contents	Where
Executive memo	Intended use, risk profile, summary of evidence	/docs/memo.pdf
Model card	Training data, methods, results, monitoring	/docs/model-card.pdf
Data cards	Real + synthetic cohorts with provenance	/docs/data-cards/*.pdf
Audit pack	Subgroup tables, stress tests, MRMC (if any)	/evidence/audits/*
Non-memorization	NN search, hash stats, leak tests	/evidence/privacy/*
Repro scripts	One-click rerun with pinned seeds	/repro/run.sh
External validation	Real-world site(s) results, CIs	/evidence/external/*
Change log	Version diff, dataset deltas	/CHANGELOG.md

Suggested timeline (6–8 weeks)

Week 1: Lock coverage matrix, draft evaluation plan, freeze protocol settings.
Week 2–3: Generate cohorts + QC; run first subgroup/stress battery.
Week 4: Iterate cohorts to close remaining gaps; finalize non-memorization report.
Week 5: External real-world validation; compute CIs & delta vs baseline.
Week 6: Package bundle (model/data cards, audits, repro scripts); internal review.
Week 7–8 (buffer): Address reviewer questions; prepare addenda.

FAQ

How much synthetic data is “too much”?

There’s no single ratio. Use synthetic to close documented gaps, then prove generalization on real external datasets. Reviewers care about the evidence chain, not a quota.

Do we need clinician review of synthetic images?

Yes, at least a spot-check with predefined criteria (artifact rate, anatomical plausibility, lesion realism). Log pass/fail and show examples.

What if results regress when we add synthetic?

Down-weight or isolate problematic cohorts; use your coverage matrix to identify the cause (e.g., protocol mismatch). Document the decision trail in your change log.

← Back to Part 1: The Bottleneck

← Back to Part 2: FDA & EMA Case Studies

Back to Blog