In our previous blog post, we explored the concept of synthetic (or external) control arms, a technique revolutionizing clinical trial design. This innovative approach leverages pre-existing data or data not initially intended for a specific study—such as data found in electronic healthcare records (EHR)—opening up new possibilities in the field. As we ventured deeper into this topic, important concerns arose, e.g., questions about dealing with incomplete data or ensuring patient privacy. Synthetic data generation is one of the most powerful tools we have at our disposal to bridge these gaps.
Synthetic data
In a world where machine learning, deep learning, generative AI, and neural networks are the talk of the town, a revolutionary trend is taking center stage—synthetic data generation. As artificial intelligence continues its unparalleled growth, the age of synthetic or artificial data is emerging as an innovative trend, captivating the minds of tech enthusiasts and data experts alike.
Synthetic data refers to artificially generated data that mimics the statistical properties and characteristics of real-world data (RWD), and its primary purpose is to provide a substitute for RWD when privacy concerns or data limitations hinder the use of original information. It enables us to perform testing, analysis, and modeling without risking the exposure of sensitive information or violating privacy regulations.
Methods for Synthetic Data Generation
The generation process employs various techniques such as statistical or mathematical models, machine learning, or deep learning.
Mathematical models aim to produce a statistical model of the dataset or the underlying process that explains the variation in the data. These include Gaussian process models, Monte Carlo simulations, sampling from probabilistic models, and kernel density smoothing. Some advantages of utilizing this approach are the explainability of the results, computational efficiency, and the fact that it can be combined with a subject area of knowledge. However, because it may be based on incorrect assumptions or models, creating false data, producing the correct parametrization, and representing complex patterns and relationships is difficult.
The next best technique to generate synthetic datasets is machine learning, where the model needs to ingest a real-world dataset to learn patterns and then leverage that information to create new data based on what it has learned. Machine learning models make fewer assumptions than mathematical models. Some tools in this category are decision tree models, clustering-based synthesis models, and naïve Bayesian models. Some favorable aspects of machine learning are that it captures many patterns and relationships, the modeling is data-driven, and it’s easy to automate and scale to vast datasets. On the other hand, real pre-processed data input is needed to train the model; they are computationally intensive and do not possess any explainability.
Deep learning, a more complex form of artificial intelligence, relies on having several layers of neural networks working together iteratively to learn from large input datasets. They connect the emerging properties of the layers below to understand patterns and how patterns are created in the data. Some examples of neural networks are large language models (LLM), generative adversarial networks (GANs), transformers, and variational autoencoders (VAEs). These techniques can learn and synthesize very complex relationships, work with varied types of datasets (usually required for health datasets), make fewer assumptions, and can be trained to optimize for both data utility and privacy simultaneously. But they come with some disadvantages. They require a very large initial (sometimes pre-processed) dataset to work, may be prone to overfitting, are very computationally demanding, and have lower explainability.
Recently, GANs have stood out as one of the most favored models. They excel in generating resilient synthetic data, effectively capturing trends from real-world data without overfitting the synthesized samples. Overfitting may occur when the generated data closely resembles or becomes nearly identical to the real-world data, posing challenges for privacy preservation since some synthetic examples might closely mimic RWD.
In computer vision, which allows computers to interpret visual information from the world and perform processes like image recognition, object detection, and image segmentation, diffusion models -and particularly latent diffusion models- are the current state-of-the-art techniques for generating synthetic data. These are a class of generative models that can capture the underlying distribution of data and sample from it to create new, realistic data examples. These models often use a process of iterative refinement to generate high-quality samples.
Applications of Synthetic Data in Healthcare
As we mentioned before, creating an external control arm directly from RWD has its advantages but comes with concerns about the use of such data. Synthetic data finds diverse applications in the healthcare domain and can help mitigate these challenges. Among key use cases, training machine learning models and protecting privacy are the main ones. Several groups have used synthetically generated data to augment real data to upsample rare events or patterns, enhancing the accuracy and diversity of AI models. Synthetic data is also valuable for testing software before accessing RWD, as it allows scientists to perfect their code without compromising privacy or wasting time.
- Protecting privacy: Patient information is highly sensitive, and traditional de-identification methods may not provide foolproof protection against privacy leaks. One of the solutions is generating synthetic data that reproduces populations without direct links to individuals in real samples. Synthetic data can significantly reduce identity disclosure risk when implemented correctly, offering greater protection than real population datasets. This safeguarding of privacy can enhance patient trust and confidence in data-sharing practices.
- Promoting data sharing: Regulatory and ethical concerns can hinder data sharing in healthcare, leading to dataset access and approval delays. Synthetic data presents an attractive alternative, mimicking real datasets while preserving valuable information such as feature correlations and parameter distributions. This data can be leveraged for statistical modeling, hypothesis-generating studies, and educational purposes.
- Data augmentation: In medical applications, limited data size is a common challenge due to the involvement of highly trained experts in data collection and annotation. Synthetic data generation is a powerful data augmentation technique, increasing the size of datasets without additional real data collection. Combining synthetic data with RWD during ML model training allows healthcare professionals to optimize statistical information extraction and improve diagnostic accuracy, ultimately benefiting patient care.
- Increasing representation: ML algorithms may exhibit bias when trained on datasets with imbalanced classes, leading to poor performance for underrepresented populations. By incorporating synthetic data from underrepresented groups, ML models can improve each subgroup’s performance, ultimately leading to more equitable and effective healthcare solutions.
Some organizations already offer synthetic datasets, such as Simulacrum. This project, in particular, offers synthetic cancer data that imitates some of the data held securely by the National Cancer Registration and Analysis Service (NCRAS) within the National Health Service (NHS) Digital in the United Kingdom. The Simulacrum looks and feels like the real cancer data held within NCRAS but does not contain any real patient information. Anyone can use it to learn more about cancer in England without compromising patient privacy.
Pros and Cons of Synthetic Data in Healthcare
Synthetic data offers several significant benefits. It minimizes constraints associated with regulated or sensitive data, facilitates customization to match conditions that RWD may not allow, and enables the generation of large training datasets without manual labeling. Moreover, synthetic data helps address privacy concerns and reduces bias compared to RWD. However, it is important to note that the quality of synthetic data is highly dependent on the quality and amount of the original data and the data generation model. Additionally, synthetic data may not capture outliers present in the real world and can reflect biases inherent in the original data.
Another important consideration is the potential for mode collapse in generative models for creating synthetic data. These models are designed to capture the underlying distribution of the original data and generate new samples from it. However, mode collapse can occur when the model focuses on only a few modes, resulting in a lack of diversity in the synthetic samples. Ensuring a diverse and representative training dataset is crucial to address this issue, and implementing regularization techniques and exploring approaches like mixing data sources is crucial. Combining these strategies helps mitigate the risk of mode collapse and ensures a richer and more realistic synthetic data generation process. In other words, the quality of the initial training dataset will directly determine the quality of the outcome.
Addressing the challenges
Evaluating the quality of synthetic medical data is vital. The main focus should be put on three key aspects: fidelity, diversity, and generalization. Fidelity examines the resemblance between synthetic and RWD, assessing whether they can be distinguished and if population inferences can be made. Diversity explores how well the synthetic data covers the entire real-world population. Generalization relates to privacy, determining if the synthetic data samples are replicas of RWD.
Privacy protection is crucial, and various metrics can be employed to assess the privacy risk of synthetic datasets. To strike a balance between privacy and transparency, decisions must be made on what aspects of the generation process to share publicly, as releasing fully trained models may increase privacy risks. One of the proposed alternatives is federated learning, which allows synthetic data creation from multiple sites while keeping sensitive RWD local. Differential privacy is another approach providing a predictable degree of privacy protection, but its implementation can be challenging, and its utility reduction may vary.
Avoiding bias magnification from RWD is another big concern. Synthetic data may inherit biases from the underlying real-world dataset, potentially amplifying them. Evaluating bias and fairness in the dataset before release is essential, ensuring that underrepresented groups are not ignored, and correlations are not mistaken for causation.
Balancing these aspects is crucial in creating high-quality and privacy-protected synthetic data, ensuring its potential is harnessed responsibly in healthcare.
The Promise of Synthetic Data in Healthcare
While the use of synthetic data has yet to be adopted broadly in healthcare and clinical research, its successful implementation in other industries, such as the finance sector, indicates its potential. The quick uptake in finance can be attributed to the less severe implications of errors, whereas healthcare requires a more cautious approach due to the possible impact on patient health. Nevertheless, as technological progress addresses these challenges, the future is rich with opportunities. Synthetic data can revolutionize healthcare research, strengthen privacy measures, enhance model training, and pave the way for many more advancements.