Beyond Probabilities of Word Sequence: LLMs as Foundations for Analytical Tools in Pharma R&D and RWE

January 30, 2025

The explosion of large language models (LLMs) has fueled transformative applications in various industries, most notably for powering intelligent chatbots and enabling “chatting” with data and documents. Beneath these widely discussed use cases lie deeper capabilities of LLMs. Properties such as feature selection, zero-shot and few-shot learning, and relationship extraction with semantic reasoning make LLMs invaluable for developing advanced analytical and predictive applications, particularly in pharmaceutical R&D and real-world evidence (RWE) research. These underlying properties make LLMs more than simply models that predict the probability of the next possible word in a sequence; they allow them to uncover patterns, infer relationships, and support complex decision-making processes, ultimately enabling the development of analytical and reasoning tools.

Let’s dive into these capabilities, breaking down what they mean, why they matter, and how they can be harnessed to solve critical challenges in pharma.

Feature Selection: Prioritizing the Right Variables

Feature selection refers to identifying the most relevant variables (or features) in a dataset for building predictive models. LLMs excel at parsing and analyzing vast datasets, leveraging their ability to detect patterns and associations within textual or structured data to highlight the variables most critical for analysis.

Feature selection is crucial when building disease prediction models. Consider, for instance, detecting rare diseases in under-diagnosed or misdiagnosed populations. Traditional predictive models often require labor-intensive manual curation of potential predictors from structured datasets, such as electronic health records (EHRs). By applying LLMs to unstructured clinical notes, researchers can automatically surface features like symptom patterns, biomarkers, comorbidities, etc., to build more accurate and interpretable models for disease prediction. 

Similarly, LLMs can analyze omics datasets, such as genomics or proteomics, to identify key molecular markers correlated with disease progression, streamlining the process of identifying therapeutic targets. Moreover, LLMs can assist in patient stratification by analyzing multidimensional datasets to identify subpopulations based on genetic, clinical, or demographic factors, enabling more targeted and effective clinical trials.

A recent study suggesting that LLMs’ performance is on par with traditional feature selection techniques indicates that LLM-powered feature selection is a viable solution to accelerate hypothesis generation, reduce manual effort, and improve the performance of predictive algorithms.

Zero-Shot and Few-Shot Learning: Doing More With Less Data

Zero-shot learning enables LLMs to perform tasks they have not been explicitly trained for, while few-shot learning allows models to generalize tasks with minimal training examples. These capabilities arise from the models’ pretraining on vast amounts of text, equipping them with broad contextual understanding.

Protocol feasibility studies are a compelling example of zero-shot learning. Zero-shot reasoning enables LLMs to assess site-level protocol feasibility by analyzing historical site data against trial requirements, providing actionable insights even without prior task-specific training. 

Few-shot learning can be instrumental in generating regulated documents such as clinical trial protocols or investigator brochures. Traditionally, drafting these documents requires significant effort from medical writers and domain experts, followed by iterative revisions. With a few examples of study designs and requirements, an LLM can generate draft protocols, pre-populating sections with contextualized content while adhering to regulatory standards. 

Similarly, in the context of adverse event reporting, LLMs can analyze a small set of labeled adverse event cases to generate templates for reporting, ensuring compliance with regulatory frameworks while capturing critical details efficiently. Researchers can then refine these outputs, significantly reducing time-to-finalization.

LLMs can accelerate document creation, reduce operational bottlenecks, and save expert time for higher-value activities. Many pharmaceutical companies have been implementing LLMs for study protocol authoring, with some reports indicating significant time and cost savings. When integrated with compliant systems of records, such as Salesforce Life Sciences Cloud, LLMs become critical contributors to enabling the creation of capabilities such as a digital study protocol builder and intelligent study design.

Relationship Extraction with Semantic Reasoning: Mapping Complex Knowledge

Relationship extraction involves identifying and mapping connections between entities, while semantic reasoning allows LLMs to infer meaning and relationships based on context. Together, these capabilities enable the extraction of information from data organized into knowledge graphs (structured maps of interconnected concepts and their relationships) and the construction of knowledge graphs themselves.

Combining LLMs with knowledge graphs presents the potential to advance data integration and analysis in life sciences, supporting applications such as drug discovery and RWE research. LLMs can automate the construction of knowledge graphs by extracting and organizing entities and relationships from unstructured and structured data sources like scientific publications, clinical trial databases, or EHR systems, ensuring comprehensive mapping of biomedical concepts. A knowledge graph that captures the relationships between genes, pathways, diseases, and treatments can be a source of data for an LLM to uncover novel drug-repurposing opportunities by identifying subtle connections between an existing drug’s mechanism of action and emerging data on different diseases.

Another example application of combining LLMs with knowledge graphs is the generation of insights from heterogeneous data sources, such as molecules’ toxicity profiles, preclinical data, clinical trial results, and scientific literature, to predict the probability of success for an asset in clinical development. These capabilities empower pharmaceutical companies to make more informed and timely decisions, driving innovation and efficiency.

Building a Future Beyond Chatbots

While the allure of conversational AI dominates mainstream discourse, the less-heralded properties of LLMs offer transformative potential for pharma R&D and RWE. By leveraging feature selection, zero-shot and few-shot learning, and relationship extraction, these models can redefine how data is analyzed, decisions are made, and therapies are developed.

To fully realize this potential, organizations must:

  1. Invest in infrastructure: High-performance computing and data governance are critical for deploying LLM-based solutions.
  2. Foster cross-disciplinary collaboration: Bridging data science, clinical expertise, and regulatory knowledge is essential for meaningful innovation.
  3. Adopt an iterative approach: Continuous learning and validation ensure that models meet the rigorous requirements of pharma applications.

By embracing these strategies, pharma organizations can move beyond chat-based applications to harness LLMs’ true analytical and predictive power, driving innovation and better patient outcomes. Learn more about the future of Healthcare & Life Sciences here.

Trending Topics
Data & AI
Finance
Globant Experience
Healthcare & Life Sciences
Media & Entertainment
Salesforce

Subscribe to our newsletter

Receive the latests news, curated posts and highlights from us. We’ll never spam, we promise.

The Healthcare & Life Sciences Studio aims to reinvent the life sciences industry ecosystem through tangible technology-driven solutions. Globant aims to bridge the gap to help life sciences and healthcare organizations to achieve their mission of delivering innovation and services faster and more efficiently to enhance patient value and improve outcomes.