Recent Advances in Large Language Models for Healthcare
Khalid Nassiri and Moulay A. Akhloufi *
Perception, Robotics and Intelligent Machines (PRIME), Department of Computer Science,
Université de Moncton, Moncton, NB E1A 3E9, Canada;
* Correspondence:
Abstract: Recent advances in the field of large language models (LLMs) underline their high potential
for applications in a variety of sectors. Their use in healthcare, in particular, holds out promising
prospects for improving medical practices. As we highlight in this paper, LLMs have demonstrated
remarkable capabilities in language understanding and generation that could indeed be put to good
use in the medical field. We also present the main architectures of these models, such as GPT, Bloom,
or LLaMA, composed of billions of parameters. We then examine recent trends in the medical datasets
used to train these models. We classify them according to different criteria, such as size, source, or
subject (patient records, scientific articles, etc.). We mention that LLMs could help improve patient
care, accelerate medical research, and optimize the efficiency of healthcare systems such as assisted
diagnosis. We also highlight several technical and ethical issues that need to be resolved before LLMs
can be used extensively in the medical field. Consequently, we propose a discussion of the capabilities
offered by new generations of linguistic models and their limitations when deployed in a domain
such as healthcare.
Keywords: large language models (LLMs); transformers; medical datasets; foundation models
1. Introduction
Citation: Nassiri, K.; Akhloufi, M.A. Recent advances in the field of artificial intelligence (AI) have enabled the devel-
Recent Advances in Large Language opment of increasingly powerful linguistic models capable of generating text fluently
Models for Healthcare. and coherently. Among these models, “large language models” (LLMs) stand out for
BioMedInformatics 2024, 4, 1097–1143. their imposing size and their ability to learn enormous amounts of textual data. Models
https://doi.org/10.3390/ like GPT-3.5 [1] and GPT-4 [2] developed by OpenAI [3], or Bard, created by Google [4],
biomedinformatics4020062 have billions of parameters and have demonstrated impressive comprehension skills and
Academic Editors: Ognjen
language generation.
Arandjelović and Alexandre G. These fascinating advancements in natural language processing (NLP) have promis-
De Brevern ing implications in many fields, including healthcare [5–7]. Indeed, they offer new per-
spectives for improving patient care [8,9], accelerating medical research [10,11], supporting
Received: 25 January 2024 decision-making [12,13], accelerating diagnosis [14,15], and making health systems more
Revised: 24 February 2024
efficient [16,17].
Accepted: 25 March 2024
These models could also provide valuable assistance to healthcare professionals
Published: 16 April 2024
by helping them interpret complex patient records and develop personalized treatment
plans [18] as well as manage the increasing amount of medical literature.
In the clinical domain, these models could help make more accurate diagnoses by
Copyright: © 2024 by the authors.
analyzing patient medical records [19]. They could also serve as virtual assistants to provide
Licensee MDPI, Basel, Switzerland. personalized health information or even simulate therapeutic conversations [20–22]. The
This article is an open access article automatic generation of medical record summaries or examination reports is another
distributed under the terms and promising application [23–25].
conditions of the Creative Commons For biomedical research, the use of extensive linguistic models paves the way for rapid
Attribution (CC BY) license (https:// information extraction from huge databases of scientific publications [26,27]. They can
creativecommons.org/licenses/by/ also generate new research hypotheses by making new connections in the literature. These
4.0/). applications would significantly accelerate the discovery process in biomedicine [28,29].
BioMedInformatics 2024, 4, 1097–1143. https://doi.org/10.3390/biomedinformatics4020062 https://www.mdpi.com/journal/biomedinformatics
,BioMedInformatics 2024, 4 1098
Additionally, these advanced language models could improve the administrative
efficiency of health systems [30]. They would be able to extract key information from
massive medical databases, automate the production of certain documents, or even help in
decision-making for the optimal allocation of resources [31–33].
LLMs such as GPT-3.5, GPT-4, Bard, LLaMA, and Bloom have shown impressive
results in various activities related to clinical language comprehension. Nevertheless, it
is essential to evaluate them in detail in the medical context, which is characterized by its
distinct nuances and complexities compared to common texts.
It is therefore essential to continue studies in order to verify the capacity of these inno-
vative linguistic models in real clinical situations, whether for decision-making, diagnostic
assistance, or the personalization of treatments. Rigorous evaluation is the key to taking
full advantage of this technology and transforming medicine through better mastery and
use of clinical language.
Although promising, the use of this AI in health also raises ethical and technological
challenges that must be addressed. However, their potential for accelerating medical
progress and improving the quality of care seems immense. The coming years will tell
us to what extent these revolutionary models are capable of transforming medicine and
health research.
The main contributions of this paper are the following:
• We analyze major large language model (LLM) architectures such as ChatGPT, Bloom,
and LLaMA, which are composed of billions of parameters and have demonstrated
impressive capabilities in natural language understanding and generation.
• We present recent trends in the medical datasets used to train such models. We classify
these datasets according to different criteria, such as their size, source (e.g., patient
files, scientific articles), and subject matter.
• We highlight the potential of LLMs to improve patient care through applications like
assisted diagnosis, accelerate medical research by analyzing literature at scale, and
optimize the efficiency of health systems through automation.
• We discuss key challenges for practically applying LLMs in medicine, particularly im-
portant ethical issues around privacy, confidentiality, and the risk of algorithmic biases
negatively impacting patient outcomes or exacerbating health inequities. Addressing
these challenges will be critical to ensuring that LLMs can safely and equitably benefit
public health.
As depicted in Figure 1, this study is deployed according to a well-defined architecture
that aims to enlighten the reader on the different facets of LLMs and their relevance in
medical diagnoses. We will begin, in Section 2, our exploration with a brief history of LLMs.
Section 3 serves as an introduction to the transformer architecture, laying the foundation
for understanding the following sections. Building on this foundation, in Section 4, we
will delve deeper into the specific architecture of LLMs. Section 5 illustrates the practical
applications of LLMs, with an emphasis on their use in medical diagnosis. Section 6
presents a comprehensive review of medical datasets, segmenting them into three key
categories. Section 7 focuses on the major innovation, which is the advent of foundation
models in the AI landscape, highlighting their relevance in the clinical domain. Before
concluding this article in the final Section 9, Section 8 offers critical reflections, assessing
both the benefits of LLMs and not neglecting the challenges inherent in them.
,BioMedInformatics 2024, 4 1099
Figure 1. Structural diagram presenting the topics covered in our paper.
2. Brief History of Large Language Models
The first language models were n-gram models [34], which estimate the probability of
a word based on previous n − 1 words. They began to be used in the 1980s and are still
used today. However, they do not capture the semantics of the language well.
During the 2000s, researchers introduced topic models like latent Dirichlet allocation
(LDA) [35]. These models have the ability to detect themes within large collections of text
and are particularly useful for analyzing large amounts of health-related textual data.
Language models based on neural networks, such as word2vec [36] and GloVe [37],
began to emerge in the 2010s. They learn vector representations of words that capture
semantic and syntactic relationships. They have been used for tasks such as extracting
information from clinical texts.
The introduction of the transformer architecture in 2017 brought a revolution in the
field of NLP, paving the way for the emergence of large-scale pre-trained models such
as BERT [4], ELMo [38], RoBERTa [39], and GPT-3 [3]. These models are trained on
massive amounts of general domain text and can then be fine-tuned for specific healthcare
applications. They have been used for tasks such as named entity recognition, sentiment
analysis, and question answering (QA) in clinical text. Indeed, domain-specific versions of
BERT such as BioBERT [40] and ClinicalBERT [41] have been developed to address clinical
language comprehension tasks.
More recently, LLMs have continued to evolve, demonstrating cutting-edge perfor-
mance in all fields, including healthcare. They are being applied in new ways in healthcare,
such as to facilitate clinical documentation, identify adverse drug reactions, and pre-
dict health outcomes from patient notes. However, the specialized clinical vocabularies,
acronyms, and abbreviations present in the text remain a challenge.
Over time, LLMs have steadily increased in size and performance, such as GPT-3.5
and GPT-4, as well as Bard. This opened the way to new use cases: more comprehensive
virtual assistants [42], integration with patient files [43], diagnostic/therapeutic recommen-
dations [44], etc.
Today, research is exploring many avenues: personalized medicine with Omics
data [45], medical image analysis [46], clinical decision support [47,48], comprehensive
healthcare assistants [49,50], and biomedical knowledge bases [51,52], etc.
While LLMs have tremendous potential, their development raises major ethical chal-
lenges: guaranteeing patient safety, combating bias, verifying and explaining recommenda-
, BioMedInformatics 2024, 4 1100
tions, respecting privacy, and ensuring complementarity with caregivers. Researchers are
working actively on these issues so that AI benefits the healthcare system responsibly.
3. Transformers and Their Architecture
Transformers, a category of deep learning (DL) models, are predominantly employed
for tasks related to NLP. These models were first introduced by Google in 2017 through a pa-
per titled “Attention is All You Need” [53]. The fundamental element of transformers is the
attention mechanism (see Section 3.1). This mechanism enables the model to comprehend
the contextual associations between words (or other elements) within a sentence.
Transformers use multi-head self-attention to analyze the relationship between words
in a sentence. Self-attention means that words attend to their relationship with other words
in the same sequence, without regard to their relative or absolute position.
The basic architecture of transformers consists of an encoder and a decoder. The
encoder helps process the input sequence, while the decoder generates the output sequence.
Common transformer architectures include BERT [4], GPT [54], T5 [55], etc. BERT (Bidi-
rectional Encoder Representations from Transformers) introduced bidirectional training,
which looks at the context from both left and right. GPT (Generative Pre-trained Trans-
former) models like GPT-2 [56], GPT-3 [3], and GPT-4 [2] are able to generate new text.
T5 (Text-To-Text Transfer Transformer) can perform a wide range of text-based tasks like
summarization, QA, and translation [57].
All transformer models follow the same basic structure—embedding, encoding, and
decoding. However, they differ in pre-training objectives, model size, number of en-
coder/decoder stacks, attention types, etc. Later models like T5, GPT-3, and GPT-4 have
billions of parameters to handle more complex tasks through transfer learning from massive
text corpora.
3.1. The Attention Mechanism
The attention mechanism aims to palliate the loss of information transmitted to the
decoder, as it is only the hidden state created during the last phase by the encoder that is
provided as input to the decoder.
The original work of Larochelle and Hinton [58] introduced this approach to the field
of computer vision. By analyzing several regions of an image separately, i.e., by considering
different extracts, it is possible for a learning algorithm to gradually accumulate knowledge
about the shapes and objects present. By analyzing each segment in turn, the model can
build up a global understanding of the image as a whole, which will ultimately enable it to
assign a relevant category to it. Authors initially proposed this method, which involves
examining the various parts of an image and, after assimilating the details of each, arriving
at a precise classification.
According to Equation (1), the attention mechanism, recommended by [59,60], has
played a crucial role in improving the performance of machine translation systems. This
approach offers the model the ability to focus on essential segments of the input sequence.
The main idea behind attention is to evaluate the relationship between parts of two
different sequences. In NLP, especially in a sequence-to-sequence framework, the attention
mechanism aims to signal to the model which word in sequence “B” should be privileged
in relation to a specific word in sequence “A”.
A model with attention differs from a classic seq2seq model in two major respects:
Firstly, instead of only transmitting the ultimate hidden state from the encoder to the
decoder, it transmits all the hidden states to the decoder, thus enriching the information
transmitted. Secondly, before producing its output, the decoder integrates an additional
step. It evaluates all the hidden states received from the encoder, assigning them a score
via multiplication by their softmax value, to better target the crucial elements of the input.
For each word in the input sequence, a contextualized attention layer (self-attention)
generates a vector representative of its importance (attention vector). Although expla-
nations are presented here as vectors for simplicity’s sake, calculations actually involve