Reflections on the State of AI in Drug Discovery Today
Some analyses of last year’s progress in AI-powered drug discovery indicate that none of the molecules fully generated by artificial intelligence (AI) were taken from discovery through clinical development. Although AI has expedited the process of identifying drug targets and discovering novel molecules to interact with them, the outcomes of some clinical trials have been underwhelming. In 2023, molecules discovered through AI did not meet expectations in clinical trials for conditions such as atopic dermatitis, schizophrenia, and cancer.
AI remained the headline at this year’s J.P. Morgan Healthcare Conference and its satellite events. While the growing sentiment is that AI will make drug development faster, not cheaper, primarily because of the significant expense associated with software development and computational power, hopes are high for AI’s role in research and development in life sciences.
During a BCG-hosted breakfast panel on generative AI (genAI) in healthcare, Parminder Bhatia, the Chief AI Officer at GE Healthcare, highlighted the similarities between the pivotal role genAI is playing right now and the role cloud computing played in the last decade. He and other panelists – including Microsoft’s CTO for Healthcare & Life Sciences, John Doyle – expressed enthusiasm for the potential coming from multimodal models. At the same time, they acknowledged the challenge of keeping up with an onslaught of information. Jean-Philippe Vert, Chief R&D Officer at Owkin, pointed out that most publications are not peer-reviewed; anyone can publish their results online, and it takes deep expertise to separate signal from noise when evaluating what’s real in AI development in healthcare and life sciences.
As the tensions between hype and skepticism continue to play out, organizations coalesce to develop frameworks for responsible AI, especially in light of discussion papers on the use of AI in the pharmaceutical product lifecycle coming from regulators in Europe and the US.
In search of a pragmatic point of view, we sat down with Juan Manuel Domínguez Correa, the Head of Drug Discovery and Biostatistics at Topazium, to zoom in on target identification as an opportunity for practical AI use cases in drug discovery.
Basia Coulter (BC): Juan Manuel, before we talk about the application of AI in target identification, it might be useful to define what we mean by target identification.
Juan Manuel Domínguez (JMD): In drug discovery, target identification is often used to describe two distinct processes. The first interpretation of target identification refers to whether a naturally occurring biomolecule plays a role in a particular condition or disease. Such a biomolecule, or “target,” can then be used to develop new drugs interacting with it. This is the traditional understanding of target identification.
However, there is another process often called target identification, which I prefer to call target deconvolution. Target deconvolution involves identifying a target for a drug previously discovered through a process known as black box screening. This includes all types of phenotypic screening, from the most rudimentary ones looking at cell survival or death to the most sophisticated ones based on confocal microscopy to observe the cell’s response. The challenge here is to identify which specific protein, DNA, or biomolecule within the cell is affected by the discovered drug molecule, leading to the observed phenotypic changes. This identification process is distinct from the traditional target identification concept. It is especially relevant for drugs already on the market, but the exact targets are still unknown.
BC: When we talk about drug discovery, most people probably immediately think about finding new molecules that can be used as drugs. Yet here, we are not talking about discovering drugs per se. Rather, we’re talking about discovering molecules that a drug could target. Why is that so important?
JMD: There are over 20,000 genes in the human genome, which can give rise to over a million potential targets through gene expression regulation processes such as transcription, alternative splicing, post-translational modifications, and other mechanisms. When considering these potential targets’ three-dimensional (3D) structure, only about 5,000 display “pockets” or grooves where small molecules can bind. Such “pockets” can be leveraged to bind small drug molecules, making the targets that possess them “druggable.” Then, out of those 5,000 potentially druggable targets, only about 800 (under 20%) are targeted by drugs currently on the market, which means that over 80% of druggable targets have yet to be matched with molecules to bind. Any protein* among those druggable targets that play key roles in disease pathways could, therefore, be accessible to drugs yet to be discovered.
[*Footnote: For simplicity, we use the term “target” to refer to a protein, although DNA and/or RNA can also be targets for drug molecules. Ultimately, it’s the function of a protein that is altered by the binding of drug molecules (whether directly or by way of DNA/RNA), and it’s a protein that’s a functional target for a drug.]
BC: So you’re saying that we know, through a combination of inference and empirical evidence, that there must exist over 4,000 potentially druggable targets for which we have yet to discover drugs. That takes me straight to discovering drug molecules, then. Why do we need to spend more time on target identification?
JMD: Earlier, we spoke about the traditional understanding of target identification as finding out whether a naturally occurring biomolecule exists that plays a role in a particular condition. In other words, target identification in this context is about understanding the connection between a biomolecule, for example, a protein and disease, because a biomolecule can only be a meaningful target for a drug if it contributes to the development of the disease. When such a connection exists, targeting that protein with a drug molecule can disrupt the disease pathway.
When we think about target identification, the first step in drug discovery is looking for clues that might link a given protein with a particular disease pathway.
BC: MalaCards, a human disease database, includes over 22,000 disease entries. That number, combined with a total of 5,000 potentially druggable targets and over 4,000 still not targeted by a drug molecule, adds up to a large number of possible combinations to go hunting for clues. How do we do that?
JMD: Clues are initially found through so-called “dry” studies that may include reviewing foundational science or biomedical literature and conducting bioinformatics or epidemiology studies.
Given the number of potential targets combined with the number of diseases and possible ways in which altered protein function may lead to diseases, finding clues in dry studies is time-consuming and effort-intensive. Traditionally, dry studies are conducted by researchers who read many scientific papers and/or conduct analyses of available biological or healthcare data through bioinformatics or epidemiology studies. Those clues, or hypotheses, about the involvement of a given protein in developing a particular disease must subsequently be tested in the laboratory through so-called “wet studies.”
BC: As a former academic, I am very familiar with the level of effort that goes into reviewing scientific literature. How can AI support researchers in that effort?
JMD: That’s the first opportunity to leverage AI, particularly generative AI (genAI) and large language models (LLMs). LLMs are advanced artificial intelligence systems designed to understand, generate, and interact with human language. They are “large” because they are trained on vast amounts of text data, enabling them to “grasp” a wide range of linguistic patterns, styles, and concepts. They are tools perfectly suited for combing through scientific literature or other large datasets in search of clues linking proteins to disease pathways.
Imagine you’re looking for proteins potentially involved in a disease pathway for hereditary ataxia. Rather than have a team of people read papers connecting ataxia to genes, gene expression, and protein malfunction, you could deploy an LLM to comb through literature and retrieve key findings, starting with simple information such as the number of times a given gene is mentioned in association with hereditary ataxia. Not only can LLM read and analyze thousands of papers quickly, but they also don’t tire, so they can deliver reliable output regardless of the amount of input they analyze.
BC: You just mentioned reliable output. However, LLMs are known to be creative in their output or to hallucinate, and we need to acknowledge that this characteristic of LLMs runs counter to the rigor of scientific inquiry. Let me interject briefly to say that progress is being made in reducing the likelihood of LLM hallucination episodes. One such method involves complementing LLMs with symbolic models that are a source of predictability. At Globant, we have also been leveraging advanced retrieval augmented generation (RAG), including combining LLM with knowledge graphs. We also leverage open-source models.
But you also mentioned bioinformatics and epidemiology studies before getting back to target identification and opportunities to leverage AI. Can we talk a little more about those?
JMD: Bioinformatics and epidemiology studies involve analyzing large amounts of biological or healthcare data, typically applying computational, mathematical, and statistical methods. Through those studies, we can find clues connecting different biomolecule types to particular disease pathways. These types of studies play a role both in target identification in the traditional sense (finding out whether there exists a naturally occurring biomolecule that plays a role in a particular condition) as well as in target deconvolution (identification of targets for drugs that are already on the market, but their exact targets are still unknown). And since bioinformatics and epidemiology studies involve large volumes of data, they also benefit from applying AI tools.
At Topazium, for example, we have developed several such algorithms. One of them is a machine learning framework (MLF) that can analyze large amounts of clinical and genetic information to identify genetic biomarkers associated with poor survival rates. It can also find potential new therapeutic targets. Another AI tool we developed is based on graph neural networks that can traverse a large map of molecular interactions within the cell to explore interactions of known drugs and their natural receptors. This may unveil novel, previously unknown points of intervention that could become suitable targets.
BC: We’ve mentioned two important categories of use cases where AI can aid researchers in meaningful ways in the context of target identification: first, analyzing, retrieving, and summarizing insights from large amounts of text-based data such as scientific literature and second, making predictions about relationships between biomolecules and human physiology, including disease pathways.
I want to mention another use case – leveraging LLMs as copilots in data analysis, for example, in biostatistical analysis such as that conducted in epidemiology studies. At Globant, we have long been using AI to build tools and platforms that assist software engineers in writing code faster. With the rise of LLMs, such tools have become even more powerful. For example, copilots that can assist in writing Python code are available on GitHub. We can further enhance them to make them domain-specific; we can build and deploy copilots for CDISC integrations or for the generation of ADaM datasets used in clinical trial data analysis. These tools will help process and analyze biomedical data faster and more efficiently.
JMD: Yes. And this last application you mentioned is exciting because it enables decoupling access to insights from the ability or skill level of the user. Ultimately, human researchers’ expertise and judgment are paramount in deciding which lines of inquiry to pursue, and nothing can replace the creative power of the human brain that is the bedrock of research and sciences. However, there is a lot of groundwork in the research process that less trained, less senior researchers could do. I think that the initial clue-finding falls under that category and that tools such as LLM-powered copilots could make participation in research more accessible to people with less training.
BC: Let’s talk briefly about what happens once a clue is found in the target identification process.
JMD: Clues found in dry studies must be confirmed in “wet” laboratory experiments. The first line of confirmatory evidence may come from in vitro experiments that compare the expression patterns of the target protein between healthy and sick cells and tissues. Once evidence linking a target and a disease is found, the second line of investigation is conducted in cellular models or even in vivo through forward or reverse genetics.
In the forward approach, random mutations are induced in an organism (a cell line or simple animals like flies), followed by screening for phenotypes of interest (for example, a specific disease or condition). Once the phenotype of interest is identified, researchers use techniques (like genetic mapping) to identify the gene or genes responsible for the phenotype. In the reverse approach, researchers start with a gene of interest and then try to determine what phenotype, if any, results from mutating or deleting that gene. Another way of confirming a link between a target and a disease is through wide genomic and proteomic studies that look at differences in the expression patterns of the target protein between healthy and sick tissues.
BC: It sounds like there are more opportunities for leveraging AI in data analysis, especially when discussing those broad genomic and proteomic studies.
JDM: Exactly. As I mentioned earlier, at Topazium, we have developed a machine learning framework that analyzes genome sequencing data to create a synthetic representation of patients in a latent space by capturing their most relevant genetic features. Combining insights from such a synthetic representation with clinical information from the same patient population points to conclusions that can be used to identify novel therapeutic targets. Other applications of this approach include finding unexplored biomarkers, identifying patterns that may help select the most appropriate treatment for each patient in personalized medicine or stratifying patients for clinical trials.
BC: Pre-processing of data comes to mind as a practical opportunity for the use of AI in the analysis of large volumes of data. Biomedical data collected in the real world is notoriously noisy and messy, and data scientists spend more time making data usable for analysis than actually analyzing the data. LLMs can be effectively used to preprocess and clean data. They can be trained to understand and identify data inconsistencies, errors, or anomalies, such as missing values, outliers, or incorrect formats. They can be used to standardize and normalize data, ensuring that it is in a suitable format for analysis; for instance, they can convert text data into numerical data or categorize data into predefined classes. So, there is a lot of opportunity to use AI, especially genAI, to automate preprocessing and cleaning tasks. That brings us back to the practical role AI can play today in bringing efficiencies into research processes.
JMD: That’s a good point. I think that while our eyes are on the future potential of AI, including genAI, and the role it will play in modeling proteins or designing drugs, and while there are benefits resulting from AI’s predictive power that we can realize today, the greatest immediate opportunity at scale comes from the efficiencies that tools such as LLMs can inject into the traditionally slow and effort-intensive processes in biomedical research.
BC: Thank you, Juan Manuel. It’s been a pleasure speaking with you. Personally, while I recognize there is a lot more work to be done from the regulatory perspective and to address ethical concerns, I am very excited about the immediate opportunities we have today to make biomedical research easier, faster, and more accessible. Patients are waiting, and we have no time to lose.