Foraker, R. E. et al. Spot the difference: Comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open https://doi.org/10.1093/jamiaopen/ooaa060 (2020).
Google Scholar
Tucker, A. et al. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 1–13. https://doi.org/10.1038/s41746-020-00353-9 (2020).
Google Scholar
Wang, Z., Myles, P. & Tucker, A. Generating and evaluating synthetic UK primary care data: Preserving data utility patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba. 126–31. https://doi.org/10.1109/CBMS.2019.00036 (2019).
Wang, Z., Myles, P. & Tucker, A. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Comput. Intell. 37, 819–851 (2021).
Google Scholar
Reiner Benaim, A. et al. Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med. Inform. 8, e16492 (2020).
Google Scholar
Mendelevitch, O. & Lesh, M.D. Fidelity and Privacy of Synthetic Medical Data. arXiv:210108658 [cs] (2021).
Muniz-Terrera, G. et al. Virtual cohorts and synthetic data in dementia: An illustration of their potential to advance research. Front. Artif. Intell. 4, 613956 (2021).
Google Scholar
Foraker, R. et al. Analyses of original and computationally-derived electronic health record data: The National COVID Cohort Collaborative. J. Med. Internet Res. https://doi.org/10.2196/30697 (2021).
Google Scholar
Azizi, Z. et al. Can synthetic data be a proxy for real clinical trial data ? A validation study. BMJ Open 11, e043497 (2021).
Google Scholar
El Emam, K. et al. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open. 4, ooab012 (2021).
Google Scholar
Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes 12, e005122 (2019).
Google Scholar
Polonetsky, J. & Renieris, E. 10 Privacy Risks and 10 Privacy Technologies to Watch in the Next Decade. Future of Privacy Forum (2020).
Guo, A. et al. The use of synthetic electronic health record data and deep learning to improve timing of high-risk heart failure surgical intervention by predicting proximity to catastrophic decompensation. Front. Digit. Health https://doi.org/10.3389/fdgth.2020.576945 (2020).
Google Scholar
Haendel, M. A. et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J. Am. Med. Inform. Assoc. 28, 427–443 (2021).
Google Scholar
CMS. CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF). https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF. Accessed 17 July 2022 (2022).
Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & Patient Privacy-IEEE Conference Publication. https://ieeexplore-ieee-org.proxy.bib.uottawa.ca/abstract/document/8787436. Accessed 31 Aug 2019 (2019).
Synthetic data at CPRD. Medicines & Healthcare products Regulatory Agency. https://www.cprd.com/content/synthetic-data. Accessed 24 Sep 2020 (2020).
NHS England. A&E Synthetic Data. https://data.england.nhs.uk/dataset/a-e-synthetic-data. Accessed 16 July 2022 (2022)
Synthetic dataset. Integraal Kankercentrum Nederland. https://iknl.nl/en/ncr/synthetic-dataset . Accessed 20 Nov 2021 (2021).
The Simulacrum. The Simulacrum. https://simulacrum.healthdatainsight.org.uk/ . Accessed 27 Nov 2021 (2021).
SNDS synthétiques. Systeme National des Donnees de Sante. https://documentation-snds.health-data-hub.fr/formation_snds/donnees_synthetiques/. Accessed 20 Jan 2022 (2021).
#opendata4covid19 Website User Manual. https://rtrod-assets.s3.ap-northeast-2.amazonaws.com/static/tools/manual/COVID-19+website+manual_v2.1.pdf . Accessed 8 Apr 2020 (2020).
Lun, R. et al. Synthetic data in cancer and cerebrovascular disease research: A novel approach to big data. PLOS ONE. 19, e0295921 (2024).
Google Scholar
Karr, A. et al. A framework for evaluating the utility of data altered to protect confidentiality: The American Statistician: Vol. 60, No. 3. Am. Stat. 60, 224–232 (2006).
Emam, K. E. et al. Utility metrics for evaluating synthetic health data generation methods: Validation study. JMIR Med. Inform. 10, e35734 (2022).
Google Scholar
Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. https://doi.org/10.1186/s12874-020-00977-1 (2020).
Google Scholar
Platzer, M. & Reutterer, T. Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data. arXiv:210400635 [cs, stat] (2021).
El Emam, K., Mosquera, L. & Zheng, C. Optimizing the synthesis of clinical trial data using sequential trees. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocaa249 (2020).
Google Scholar
National Academies of Sciences, Engineering, and Medicine. Reproducibility and Replicability in Science. http://www.ncbi.nlm.nih.gov/books/NBK547537/. Accessed 28 July 2023 (National Academies Press (US), 2019).
Grund, S., Lüdtke, O. & Robitzsch, A. Using synthetic data to improve the reproducibility of statistical results in psychological research. Psychol. Methods (2022).
Morris, T. P., White, I. R. & Crowther, M. J. Using simulation studies to evaluate statistical methods. Stat. Med. 38, 2074–2102 (2019).
Google Scholar
Rubin, D. Discussion: Statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993).
Raghunathan, T., Reiter, J. & Rubin, D. Multiple imputation for statistical disclosure control. J. Off. Stat. 19, 1–16 (2003).
Reiter, J. P. Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–543 (2002).
Raab, G. M., Nowok, B. & Dibben, C. Practical data synthesis for large samples. J. Priv. Confident. 7, 67–97 (2016).
Reiter, J. P. New approaches to data dissemination: A glimpse into the future (?). Chance 17, 11–15 (2004).
Google Scholar
Park, N. et al. Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11, 1071–1083 (2018).
Hu, J. Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data. arXiv:180402784 [stat] (2018).
Taub, J. et al. Differential correct attribution probability for synthetic data: An exploration. In Privacy in Statistical Databases (eds Domingo-Ferrer, J. & Montes, F.) 122–137 (Springer, 2018).
Hu, J., Reiter, J. P. & Wang, Q. Disclosure risk evaluation for fully synthetic categorical data. In Privacy in Statistical Databases (ed. Domingo-Ferrer, J.) 185–199 (Springer, 2014).
Wei, L. & Reiter, J. P. Releasing synthetic magnitude microdata constrained to fixed marginal totals. Stat. J. IAOS 32, 93–108 (2016).
Google Scholar
Ruiz, N., Muralidhar, K. & Domingo-Ferrer, J. On the privacy guarantees of synthetic data: A reassessment from the maximum-knowledge attacker perspective. In Privacy in Statistical Databases (eds Domingo-Ferrer, J. & Montes, F.) 59–74 (Springer, 2018).
Reiter, J. P. Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. J. R. Stat. Soc. Ser. A (Statistics in Society) 168, 185–205 (2005).
Google Scholar
Zhang, Z. et al. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocz161 (2021).
Google Scholar
Zhang, Z. et al. SynTEG: A framework for temporal structured electronic health data simulation. J. Am. Med. Inform. Assoc. https://doi.org/10.1093/jamia/ocaa262 (2020).
Google Scholar
Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20, 108 (2020).
Google Scholar
Hilprecht, B., Härterich, M. & Bernau, D. Monte Carlo and reconstruction membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019, 232–249 (2019).
Taub, J., Elliot, M. & Sakshaug, W. The impact of synthetic data generation on data utility with application to the 1991 UK samples of anonymised records. Trans Data Priv. 13, 1–23 (2020).
Drechsler, J. et al. A new approach for disclosure control in the IAB establishment panel—Multiple imputation for a better data access. AStA Adv. Stat. Anal. 92, 439–458 (2008).
Google Scholar
Loong, B. & Rubin, D. B. Multiply-imputed synthetic data: Advice to the imputer. J. Off. Stat. 33, 1005–1019 (2017).
Loong, B. et al. Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS. Stat. Med. 32, 4139–4161 (2013).
Google Scholar
Reiter, J. Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003).
van der Ploeg, T., Austin, P. C. & Steyerberg, E. W. Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Med. Res. Methodol. 14, 137 (2014).
Google Scholar
CEO Life Sciences Consortium. Share, Integrate & Analyze Cancer Research Data. Project Data Sphere. https://projectdatasphere.org/projectdatasphere/html/home. Accessed 11 July 2019 (2019).
Alberts, S. R. et al. Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: A randomized trial. JAMA 307, 1383–1393 (2012).
Google Scholar
El-Hussuna, A. et al. Extended right-sided colon resection does not reduce the risk of colon cancer local-regional recurrence: Nationwide population-based study from Danish Colorectal Cancer Group Database. Dis. Colon Rectum 6, 10–1097 (2022).
Chen, H., Cohen, P. & Chen, S. How big is a big odds ratio? Interpreting the magnitudes of odds ratios in epidemiological studies. Commun. Stat.-Simul. Comput. 39, 860–864 (2010).
Google Scholar
Schäfer, T. & Schwarz, M. A. The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases. Front. Psychol. 10, 113 (2019).
Song, F. et al. Dissemination and publication of research findings : An updated review of related biases. Health Technol. Assess. 14, 1–220 (2010).
Demidenko, E. Sample size determination for logistic regression revisited. Stat. Med. 26, 3385–3397 (2007).
Google Scholar
Hsieh, F. Y., Bloch, D. A. & Larsen, M. D. A simple method of sample size calculation for linear and logistic regression. Stat. Med. 17, 1623–1634 (1998).
Google Scholar
Collins, G. S. et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ 350, g7594 (2015).
Google Scholar
Christodoulou, E. et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22 (2019).
Google Scholar
Dankar, F. K. & Ibrahim, M. Fake it till you make it: Guidelines for effective synthetic data generation. Appl. Sci. 11, 2158. https://doi.org/10.3390/app11052158 (2021).
Google Scholar
Dahdaleh, F. S. et al. Obstruction predicts worse long-term outcomes in stage III colon cancer: A secondary analysis of the N0147 trial. Surgery 164, 1223–1229 (2018).
Google Scholar
Maclagan, L. C. et al. The CANHEART health index: A tool for monitoring the cardiovascular health of the Canadian population. CMAJ 186, 180–187 (2014).
Google Scholar
Azizi, Z. et al. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci. Rep. 13, 11540. https://doi.org/10.1038/s41598-023-38457-3 (2023).
Google Scholar
European Society of Coloproctology Collaborating Group. Predictors for anastomotic leak, postoperative complications, and mortality after right colectomy for cancer: Results from an International Snapshot Audit. Dis. Colon Rectum 63, 606–618 (2020).
2017 and 2015 European Society of Coloproctology (ESCP) collaborating groups. The impact of conversion on the risk of major complication following laparoscopic colonic surgery: An international, multicentre prospective audit. Colorectal Dis. 20 (Suppl 6), 69–89 (2018).
Reiter, J. Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21, 441–462 (2005).
Drechsler, J. & Reiter, J. P. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55, 3232–3243 (2011).
Google Scholar
Arslan, R. C. et al. Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. J. Pers. Soc. Psychol. 121, 410–431 (2021).
Google Scholar
Bonnéry, D. et al. The promise and limitations of synthetic data as a strategy to expand access to state-level multi-agency longitudinal data. J. Res. Educ. Effect. 12, 616–647 (2019).
Sabay, A. et al. Overcoming small data limitations in heart disease prediction by using surrogate data. SMU Data Sci. Rev. 1, 12 (2018).
Freiman, M., Lauger, A. & Reiter, J. Data Synthesis and Perturbation for the American Community Survey at the U.S. Census Bureau. US Census Bureau. https://www.census.gov/library/working-papers/2018/adrm/formal-privacy-synthetic-data-acs.html. Accessed 24 Feb 2020 (2017).
Nowok, B. Utility of Synthetic Microdata Generated Using Tree-Based Methods. https://unece.org/statistics/events/SDC2015 (Helsinki, 2015).
Nowok, B., Raab, G. M. & Dibben, C. Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R 1. Stat. J. IAOS 33, 785–796 (2017).
Quintana, D. S. A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. eLife 9, e53275 (2020).
Google Scholar
Little, C., Elliot, M., Allmendinger, R. et al. Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study. Vol. 17. https://unece.org/statistics/documents/2021/12/working-documents/generative-adversarial-networks-synthetic-data. (United Nations Economic Commission for Europe, 2021).
Hernandez, M. et al. Synthetic data generation for tabular health records: A systematic review. Neurocomputing. 493, 28–45 (2022).
Jacobs, F. et al. Opportunities and challenges of synthetic data generation in oncology. JCO Clin. Cancer Inform. 3, e2300045 (2023).
Ghosheh, G. O., Li, J. & Zhu, T. A survey of generative adversarial networks for synthesizing structured electronic health records. ACM Comput. Surv. 56, 1471–14734 (2024).
Chin-Cheong, K., Sutter, T. & Vogt, J.E. Generation of Heterogeneous Synthetic Electronic Health Records using GANs. https://doi.org/10.3929/ethz-b-000392473 (2019).
Choi, E., Biswal, S., Malin, B. et al. Generating Multi-Label Discrete Patient Records Using Generative Adversarial Networks. arXiv:170306490 [cs] (2017).
Yan, C., Zhang, Z., Nyemba, S. et al. Generating Electronic Health Records with Multiple Data Types and Constraints. arXiv:200307904 [cs, stat] (2020).
Bühlmann, P. & Hothorn, T. Boosting algorithms: Regularization. Predict. Model Fit. Stat. Sci. 22, 477–505 (2007).
Ke, G., Meng, Q., Finley, T. et al. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems (Guyon, I., Luxburg, U.V., Bengio, S. et al. eds.). Vol. 30. 3146–3154. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf. Accessed 15 Oct 2020 (Curran Associates, Inc., 2017).
Snoek, J., Larochelle, H. & Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems. Vol. 2. 2951–2959. https://papers.nips.cc/paper_files/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html (Curran Associates Inc., 2012).
Jones, M. C. Simple boundary correction for kernel density estimation. Stat. Comput. 3, 135–146 (1993).
Xu, L., Skoularidou, M., Cuesta-Infante, A. et al. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems (Wallach, H., Larochelle, H., d’Alche-Buc, F. et al. eds.). 7335–7345. https://papers.nips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html. Accessed 2 Oct 2021 (Curran Associates, Inc., 2019).
Bourou, S. et al. A review of tabular data synthesis using GANs on an IDS dataset. Information 12, 375 (2021).
Mirza, M. & Osindero, S. Conditional Generative Adversarial Nets. https://doi.org/10.48550/arXiv.1411.1784 (2014).
Xu, L., Skoularidou, M., Cuesta-Infante, A. et al. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems. https://papers.nips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html (2019).
El Kababji, S., Mitsakakis, N., Fang, X. et al. Evaluating the utility and privacy of synthetic breast cancer clinical trial datasets. JCO CCI (accepted).
El Emam, K., Mosquera, L. & Fang, X. Validating a membership disclosure metric for synthetic health data. JAMIA Open. 5, ooac083 (2022).
Google Scholar
Cancer of the Colon and Rectum-Cancer Stat Facts. SEER. https://seer.cancer.gov/statfacts/html/colorect.html. Accessed 9 Oct 2021 (2021).
Iversen, L. H. et al. Improved survival of colorectal cancer in Denmark during 2001–2012—The efforts of several national initiatives. Acta Oncol. 55(Suppl 2), 10–23 (2016).
Google Scholar
Burton, A. et al. The design of simulation studies in medical statistics. Stat. Med. 25, 4279–4292 (2006).
Google Scholar
Boulesteix, A.-L., Lauer, S. & Eugster, M. J. A. A plea for neutral comparison studies in computational sciences. PLOS ONE 8, e61562 (2013).
Google Scholar
Patki, N., Wedge, R. & Veeramachaneni, K. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49 (IEEE, 2016).
Yan, C., Yan, Y., Wan, Z. et al. A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models. https://doi.org/10.48550/arXiv.2208.01230 (2022).
De Cristofaro, E. A critical overview of privacy in machine learning. IEEE Secur. Privacy 19, 19–27 (2021).
Shafee, A. & Awaad, T. A. Privacy attacks against deep learning models and their countermeasures. J. Syst. Architect. 114, 101940 (2021).
Veale, M., Binns, R. & Edwards, L. Algorithms that remember: Model inversion attacks and data protection law. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 376, 20180083 (2018).
Google Scholar
Klein, R. A. et al. Investigating variation in replicability: A “many labs” replication project. Soc. Psychol. 45, 142–152 (2014).
Camerer, C. F. et al. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nat. Hum. Behav. 2, 637–644. https://doi.org/10.1038/s41562-018-0399-z (2018).
Google Scholar
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Franklin, J. M. et al. Nonrandomized real-world evidence to support regulatory decision making: Process for a randomized trial replication project. Clin. Pharmacol. Ther. 107, 817–826 (2020).
Google Scholar
Crown, W. et al. Can observational analyses of routinely collected data emulate randomized trials? Design and feasibility of the observational patient evidence for regulatory approval science and understanding disease project. Value Health. 26, 176–184 (2023).
Google Scholar
Yoon, D. et al. Real-world data emulating randomized controlled trials of non-vitamin K antagonist oral anticoagulants in patients with venous thromboembolism. BMC Med. 21, 375 (2023).
Google Scholar
Wang, S. V., Schneeweiss, S., RCT-DUPLICATE Initiative. Emulation of randomized clinical trials with nonrandomized database analyses: Results of 32 clinical trials. JAMA 329, 1376–1385 (2023).
Google Scholar
Franklin, J. M. et al. Emulating randomized clinical trials with nonrandomized real-world evidence studies. Circulation. 143, 1002–1013 (2021).
Google Scholar
Patil, P., Peng, R. D. & Leek, J. T. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11, 539–544 (2016).
Google Scholar