Generative AI and Large Language Models for Healthcare

Generative AI and Large Language Models for Healthcare: Comparison

Please note this is a comparison between Version 2 by Lindsay Dong and Version 3 by Lindsay Dong.

Generative artificial intelligence (AI) and large language models (LLMs), exemplified by ChatGPT, are promising for revolutionizing data and information management in healthcare and medicine.

generative artificial intelligence
generative AI
large language models
LLM

1. Introduction

Generative AI models are a subset of large language models (LLMs), e.g., generative pre-trained transformer (GPT). For example, GPT-3 is trained on 175 billion parameters, while GPT-4 is trained on one trillion parameters. An intermediary version, GPT-3.5, is specifically trained to predict the next word in a sequence using a large dataset of Internet text. It is the model that underpins the current version of ChatGPT ^[1]. After being pretrained on huge amounts of data to learn intricate patterns and relationships, these LLMs have developed capabilities to imitate human language processing ^[2]. Upon receiving a query or request in a prompt, ChatGPT can generate relevant and meaningful responses and answer questions drawing from its learned language patterns and representations ^[3]. These LLMs are often referred to as the ‘‘foundation model” or “base model” for generative AI, as they are the starting point for the development of more advanced and complex models.

Distinct from traditional AI systems, which are typically rule-based or rely on predefined datasets, generative AI models possess the unique ability to create new content that is original and not explicitly programmed. This can result in outputs that are similar in style, tone, or structure to the prompt instruction. Therefore, if designed thoughtfully and developed responsibly, generative AI has the potential to amplify human capabilities in various domains of information management. These may include support for decision-making, knowledge retrieval, question answering, language translation, and automatic report or computer code generation ^[2].

It is not surprising that a significant area for generative AI and LLM to revolutionize is healthcare and medicine, a human domain in which language is key for effective interactions for and between clinicians and patients ^[4]. It is also an information-rich field where every assessment, diagnosis, treatment, care plan, and outcome evaluation must be documented in specific terms or natural language in electronic health records (EHR). Once the LLM is exposed to the relevant EHR data set in a specific healthcare field, the model will learn the relationships between the terms and extend its model to represent the knowledge in this field. With the further advancement of generative AI technologies, including video and audio technologies, the dream is not far away for healthcare providers to audit instead of simply typing data into EHR. Clinicians may orally request computers to write prescriptions or order lab tests and ask the generative AI models integrated with EHR systems to automatically retrieve data, generate shift hand-over reports and discharge summaries, and support diagnostic and prescription decision-making. Therefore, generative AI can be ‘a powerful tool in the medical field’ ^[5].

2. Generative AI and Large Language Models for Healthcare

2.1. Technological Approaches to the Application of Generative AI and LLMs

Generative AI and LLMs are powered by a suite of deep learning technologies. For example, ChatGPT is a series of deep learning models that utilize transformer architecture that resorts to self-attention mechanisms to process large human-generated text datasets (GPT-4 response, 23 August 2023). These AI technologies work in harmony to power ChatGPT, enabling it to handle a wide range of tasks, including natural language understanding, language generation, text completion, translation, summarization, and much more.

2.1.1. Models

Based on model training strategies, architectures, and use cases, LLMs are classified into two types ^[6]: (1) encoder–decoder or encoder-only language models and (2) decoder-only models. The encoder-only models represented by BERT family models have started to phase out after the debut of ChatGPT. Encoder–decoder models, e.g., Meta’s BART, remain promising as most of them are open-sourced, providing opportunities for the global software community to continuously explore and develop. Decoder-only models, represented by the GPT family models, Pathways Language Model (PaLM) introduced by Google ^[7], and LLaMA models from Meta, have and will continue to dominate the LLM space because they are the foundation models for generative AI technologies. On the other hand, based on the training data set, LLMs are classified into foundation (or base) LLMs ^[8] and instruction fine-tuned LLMs ^[6]. The foundation LLMs, e.g., ChatGPT, are trained to predict the next most likely word that will follow based on the text training data; thus, the direction of output can be unpredictable. An instruction fine-tuned LLM is a base LLM to be fine-tuned, using various techniques, including Reinforcement Learning from Human Feedback (RLHF) ^[9]. As the instruction fine-tuned LLMs are better tuned to understand the context, input, and output in a specific application domain, they have improved their ability to align with purpose, overcoming the limitations in the base model, being safe and less biased and harmful. Therefore, instruction fine-tuned LLMs are the recommended LLMs to use in specific AI applications for healthcare and medicine ^[6].

2.1.2. Data

The impact of data on the models’ effectiveness starts from pre-training data and continues through to the training, test, and inference data ^[4]. The quality, quantity, and diversity of pre-training data significantly influence the performance of LLMs ^[6]. Therefore, pre-training base models on data from a specific healthcare or medical field to produce instruction fine-tuned models are the recommended development method for downstream machine learning tasks for these fields ^[10]. Of course, with abundant annotated data, both base LLM and instruction fine-tuned models can achieve satisfactory performance on a particular task and meet the important privacy constraint for healthcare and medical data ^[11].

2.1.3. Task

LLMs can be applied to four types of tasks: natural language understanding (NLU), natural language generation, knowledge-intensive tasks, and reasoning ^[6]. Traditional natural language understanding tasks include text classification, concept extraction or named entity recognition (NER), relationship extraction, dependency parsing, and entailment prediction. Many of these tasks are intermediate steps in large AI systems, such as NER for knowledge graph construction. Using the decoder LLMs may directly complete inference tasks and remove these intermediate tasks. Natural language generation includes two major types of tasks: (1) converting input texts into new symbol sequences, such as text summarization and machine translation, and (2) “open-ended” generation, which aims to generate new text or symbols in response to the input prompt, e.g., question answering, crafting emails, composing news articles, and writing computer codes ^[6]. Knowledge-intensive NLP tasks refer to tasks that require a substantial amount of background knowledge, whether it be specific knowledge in a particular domain, general real-world knowledge, or expertise gained over time ^[6]. These tasks not only require pattern recognition or syntax analysis but are also highly dependent on the memorization and proper utilization of knowledge about specific entities, events, and common sense of the real world. Healthcare and medicine tasks fit into this category. After exposure to a billion tokens, LLMs are excellent at knowledge-intensive tasks. However, in situations where LLMs have not learned the contextual knowledge or face tasks requiring this knowledge, LLMs would struggle and might “hallucinate” ^[8].

2.2. Methods to Train LLMs

2.2.1. Fine-Tuning LLMs

LLMs can be fine-tuned by various strategies, e.g., modifying the number of parameters ^[12], size of the training data set, or the amount of computing used for training ^[6]. Fine-tuning LLMs will scale up the pretrained LLMs and significantly improve their performance in reasoning beyond the power-law rule to unlock unprecedented, fantastic emergent abilities ^[4][13]. Emergent abilities refer to specific competencies that do not exist in smaller models but become salient as the model scales. These include but are not limited to nuanced concept understanding, sophisticated word manipulation, advanced logical reasoning, and complex coding tasks ^[4]. Furthermore, the scaling of LLMs has led to advances that closely approximate human performance in both arithmetic reasoning and linguistic common-sense reasoning ^[6], competencies both important for healthcare and medicine. These enhanced capabilities allow LLMs to serve as innovative tools for medical education and help medical students gain novel clinical insights ^[14].

2.2.2. Reinforcement Learning from Human Feedback (RLHF)

RLHF refers to methods that combine three interconnected model training processes: feedback collection, reward modeling, and policy optimization ^[15]. RLHF has been implemented as instruction prompts to train LLMs to achieve remarkable performance across many NLP tasks ^[4][12][16]. It not only improves model accuracy, factuality, consistency, and safety and mitigates harm and bias within medical question-answering tasks ^[4], but also bridges the gap between LLM-generated answers and human responses. Therefore, RLHF brings LLMs considerably closer to practical applications within real-world clinical settings.

2.2.3. Prompt Engineering

Prompt engineering refines prompts for generative AI to generate text or images, often through an iterative, refinement process. The common instruction prompts are zero-shot learning, few-shot learning, chain-of-thought, and self-consistency.

Zero-shot learning enables the training of LLMs for specific NLP tasks through single-prompt instructions, eliminating the need for annotated data ^[17]; e.g., people enter instructions into ‘prompt’ to seek answers from ChatGPT. This approach avoids the issue of catastrophic forgetting often encountered in fine-tuned neural networks, as it does not require model parameter updates ^[18]. Few-shot learning trains LLMs on specific NLP tasks by providing a limited set of example inputs, usually as input–output pairs termed “prompts” ^[16][19]. This learning technique facilitates quicker in-context learning compared to zero-shot learning, thereby producing more generalized, task-specific performance ^[4]. Umapathi et al. found that the level of performance improvement in hallucination control plateaued after three examples in a few-shot learning experiment ^[12]. They also found that the framing of prompts is crucial; concise and explicit prompts yield higher task execution accuracy compared to ambiguous or verbose ones. Chain-of-thought prompting imitates the human multi-step reasoning process in problem-solving tasks. It enhances few-shot examples in the prompt with a series of intermediate reasoning steps articulated in concise sentences that lead to the final answer ^[4][16][20]. This method can effectively draw out the reasoning capabilities of LLMs ^[20] and shows substantial improvements in performance on math problem-solving tasks ^[21]. Self-consistency prompting samples a diverse set of reasoning paths instead of only taking the greedy one ^[22]. Its logic is the common wisdom that a complex problem usually has multiple reasoning paths to reach the correct solution. It then selects the most consistent answer out of the sampled reasoning paths through unsupervised learning.

2.3. Model Evaluation

Three challenges impede the application of LLMs in modeling real-world tasks ^[6]: (1) noisy/unstructured real-world input data that are often messy, e.g., containing typos, colloquialisms, and mixed languages; (2) ill-defined practical tasks that are difficult to classify into predefined NLP task categories; and (3) ambiguous instructions that may contain multiple implicit intents. These ambiguities cause difficulty in predictive modeling without follow-up probing questions. Despite performing better than the fine-tuned models in addressing the above three challenges, the effectiveness of foundation models in handling real-world input data is yet to be evaluated ^[4][6]; therefore, Bommasani et al. calls for a holistic evaluation of LLMs ^[23]. Singhal et al. developed and piloted a seven-axes evaluation framework for the physician and lay user evaluation of LLM performance beyond accuracy on multiple-choice datasets ^[4]. The seven axes assess AI model answers for (1) agreement with the scientific and clinical consensus; (2) reading comprehension, retrieval, and reasoning capabilities; (3) incorrect or missing content; (4) possible extent and likelihood of harm; (5) bias for medical demographics; (6) laypeople assessment of helpfulness of answer; and (7) addressing the intent of the question. In the follow-up study, Singhal et al. added two additional human evaluations: (8) a pairwise ranking evaluation of model and physician answers to consumer medical questions along these nine clinically relevant axes and (9) a physician assessment of model responses on two newly introduced adversarial testing datasets designed to probe the limits of LLMs ^[16].

2.4. Current Applications of Generative AI and LLMs in Healthcare and Medicine

There is tremendous potential for LLMs to innovate information management, education, and communication in healthcare and medicine ^[5]. Li et al. proposed a taxonomy to classify ChatGPT’s utility in healthcare and medicine based on two criteria: (1) the nature of medical tasks that LLMs address and (2) the targeted end users ^[24]. According to the first criterion, seven types of ChatGPT applications were outlined: triage, translation, medical research, clinical workflow, medical education, consultation, and multimodal. Conversely, the second criterion delineates seven categories of end users: patients/relatives, healthcare professionals/clinical centers, payers, researchers, students/teachers/exam agencies, and lawyers/regulators. A use case of LLMs to support the medical task of triage ^[24] is assisting healthcare professionals in condensing a patient’s hospital stay into succinct summaries based on their medical records, then generating discharge letters ^[25], benefiting from these models’ strong ability to summarize data from heterogeneous sources ^[26]. A useful application of LLM to improve clinical workflow is to significantly reduce the documentation burden that has long plagued doctors and nurses, a problem that persisted even after the transition from paper to electronic health records ^[27].

2.5. The Benefits of Generative AI and LLMs for Healthcare and Medicine

2.5.1. Creating Synthetic Patient Health Records to Improve Downstream Clinical Text Mining

To date, many LLMs, e.g., ChatGPT, are only available through their APIs ^[11]. This raises privacy concerns for directly uploading patients’ data to LLM API for data mining. To tackle this challenge, Tang et al. propose a new training paradigm that first uses a small number of human-labeled examples for zero-shot learning via prompting on ChatGPT to generate a large volume of high-quality synthetic data with labels ^[11]. Using these synthetic data, they fine-tuned a local model for the downstream task of biological named entity recognition and relation extraction using three public data sets: BCBI Disease, BC5CDR Disease, and BC5CDR Chemical. Their training paradigm provides a useful application of LLMs to clinical text mining with privacy protection.

2.5.2. Using Chatbot Underpinned by LLMs to Assist Health Communication

Ayers et al. evaluated the ability of ChatGPT to provide both quality and empathetic responses to patient questions ^[28]. They conducted a cross-sectional study to compare responses from ChatGPT and certified physicians to 195 patient questions posted on a public social media forum, Reddit. A team of licensed healthcare professionals carried out the evaluation, preferring chatbot responses above physician responses in 78.6% of the 585 evaluations. They rated chatbot responses with significantly higher quality and empathy.

2.5.3. Potential to Address Routine Patient Queries following Routine Surgery

Chowdhury et al. tested LLM ChatGPT’s capability to safely address patient questions following cataract surgery. They sought answers from ChatGPT to 131 unique symptom-based questions posed by 120 patients and assessed the responses by two ophthalmologists ^[29]. Despite 21% of questions being unclear for answers, 59.9% of ChatGPT’s responses were rated ‘helpful’, and 36.3% ‘somewhat helpful’. A total of 92.7% of responses were rated as ‘low’ likelihood of harm, and 24.4% had the possibility of ‘moderate or mild harm’. Only 9.5% of answers were opposed to clinical or scientific consensus. Even without fine-tuning and minimal prompt engineering, LLMs such as ChatGPT have the potential to helpfully address real-world patient questions. Therefore, LLMs have the potential to helpfully address patient queries following routine surgery with further control of model safety.

2.5.4. Improving Accuracy in Medical Image Analysis

A three-step approach employing a Generative Adversarial Network (GAN) was proposed to improve the resolution of medical images, a critical component in accurate medical diagnosis ^[30]. The proposed architecture was evaluated with four medical image modalities, utilizing four test samples drawn from four public data sets. The authors reported superior accuracy of the model’s output and image resolution. By achieving high-resolution medical images, this method has the potential to assist medical professionals in interpreting data more precisely, leading to an improvement in diagnostic accuracy and patient care.

2.5.5. Potential to Provide Ongoing Clinical Decision Support throughout the Entire Clinical Workflow

Rao et al. tested LLM’s ability to provide ongoing clinical decision support ^[31]. They presented ChatGPT with a series of hypothetical patients, varied by age, gender, and Emergency Severity Indices (ESIs), and asked it to recommend diagnoses based on their initial clinical presentations. The test followed 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual. The results were noteworthy: ChatGPT achieved an overall 71.7% accuracy across all 36 vignettes. Within that, it demonstrated a 60.3% accuracy rate in generating an initial differential diagnosis and reached its highest accuracy of 76.9% in making a final diagnosis. These findings provide evidence to endorse the integration of LLMs into clinical workflow, highlighting their potential to support clinical decision making.

2.5.6. Fine-Tuning Local Large Language Models for Pathology Data Extraction and Classification

Bumgardner et al. introduced an innovative approach that utilized local LLMs to extract structured International Classification of Diseases (ICD) codes from complicated, unstructured clinical data, including clinical notes, pathology reports, and laboratory findings sourced directly from clinical workflows at the University of Kentucky ^[32]. Through fine-tuning, the researchers optimized a decoder model LLaMA, along with two encoder models, BERT and LongFormer. These models were then used to extract structured ICD codes, responding to specific generative instructions. The utilized dataset, consisting of 150,000 entries, included detailed pathology reports describing tissue specimen attributes, as well as final reports summarizing diagnoses, informed by microscopic tissue reviews, laboratory results, and clinical notes. The complexity was that individual cases might contain many tissue specimens.

2.5.7. Medical Education

Language is the key means of communication in healthcare and medicine ^[16]. It underpins interactions between people and care providers. A key application area of LLMs is medical communication and education for healthcare and medical student, staff, and consumers alike ^{[4][16][30][33]}. The Med-PaLM 2 model reached 86.5% accuracy in answering medical exam questions on the United States Medical Licensing Exam (USMLE) dataset ^[16]. The above results suggest that generative AI and LLMs can produce trustworthy and explainable outputs. They can serve as exemplary guides for human learners, particularly in drafting papers with high internal concordance, logical structure, and clear articulation of relationships between concepts ^[34]. They can also exemplify the deductive reasoning process ^[16].

2.6. Ethical and Regulatory Consideration for Generative AI and LLMs

The current mainstream view of the healthcare and medicine community towards LLM is a caution to balancing regulatory and technical challenges as the generative AI technology is still in the early experimental stage. For example, the well-known ChatGPT model is fine-tuned on Internet data instead of healthcare data. As the model output is impacted by the training dataset, the experts do not recommend direct utilization of ChatGPT without further specialization in healthcare or medicine ^[35]. It is well known that LLMs can generate outputs that are untruthful, toxic, hallucinated, or simply not helpful to the users ^[36].

2.6.1. Ethical Concerns

The large-scale use of ChatGPT has raised several social and ethical questions, e.g., the production of false, offensive, or irrelevant data that can cause harm or even threat to humanity, politics, warfare, and knowledge bases ^[5]. Training data patterns and algorithmic choices may reflect existing health inequalities ^[4]. Currently, the framework used for evaluating LLM application in healthcare is relatively subjective, limited by the current human knowledge and expert opinion, and lacks coverage of the full spectrum of the population. Another potential area of bias is the limited number and diversity of the human raters, i.e., clinicians and laypeople, who participated in the evaluation ^[4][16]. Harrer summarizes six ethical concerns for the use of generative AI technologies: accountability, fairness, data privacy and selection, transparency, explainability, and value and purpose alignment ^[5]. However, making significant progress on these problems can be a challenge since the detoxification method can have side effects ^[36].

2.6.2. Ensuring Patient Privacy and Data Security

The public, healthcare, and technology communities call for regulations and policies on data governance and privacy for AI technologies ^[5][37][38]. Currently, any data-entered prompts in ChatGPT are transferred to the server of the company OpenAI without any legal bounding, raising concerns about data privacy, which is not in compliance with personal data privacy legislations in many countries. Safeguarding AI systems and data is another critical concern for generative AI application in healthcare and medicine, which requires adequate data protection mechanisms to prevent unauthorized access as well as protection against adversarial cyber-attacks ^[39].

2.6.3. Addressing Biases in AI Algorithms

It is well-recognized that foundation LLMs can generate outputs that are untruthful, biased, toxic, hallucinated, or simply not helpful to users ^[36]. This is because the training objective is to predict the next token in a text instead of following the user’s instructions helpfully and safely ^[19].

2.6.4. Implications of AI Model “Hallucination” for Healthcare

An obvious limitation of LLMs is hallucination, which refers to the LLMs’ potential to generate information that, though plausible, may be unverified, incorrect, or false ^[36]. This impediment can cause serious consequences in healthcare applications ^[12], leading to inappropriate medical decisions that may compromise patient safety ^[40]. The fault may further have profound legal and liability ramifications. Umapathi et al. conducted hallucination tests on the common LLMs, including Text-Davinci, GPT-3.5, Llama 2, MPT, and Falcon ^[12]. They assembled a new hallucination benchmark dataset, Med-HALT (Medical Domain Hallucination Test), by amalgamating multiple-choice questions and answers from medical examination tests across three countries—the USA, Spain, and India—and a Chinese region, Taiwan. They conducted two types of tests on Med-HALT, namely reasoning tests and memory-based hallucination tests, utilizing accuracy and pointwise scores as metrics. The latter accounts for the score as the sum of the positive scores for correct answers and imposes a negative penalty for incorrect ones.

3. Conclusion

Distinct from traditional AI, generative AI necessitates active engagement from experts and collaborative efforts between clinicians and consumers when being integrated into healthcare. While challenges such as ethics, transparency, legal considerations, safety, and bias exist, the capabilities of these technologies to uplift healthcare quality and efficiency are profound. Healthcare institutions should adopt these technologies to augment care quality and safety while ensuring cost effectiveness and uphold the highest ethical standards.

References

OpenAI. Aligning Language Models to Follow Instructions. 2022. Available online: https://openai.com/research/instruction-following (accessed on 30 June 2023).
Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR 139, pp. 12697–12706.
Cascella, M.; Montomoli, J.; Bellini, V.; Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: An analysis of multiple clinical and research scenarios. J. Med. Syst. 2023, 47, 33.
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S. Large language models encode clinical knowledge. Nature 2023, 620, 172–180.
Harrer, S. Attention is not all you need: The complicated case of ethically using large language models in healthcare and medicine. eBioMedicine 2023, 90, 104512.
Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Yin, B.; Hu, X. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. arXiv 2023, arXiv:2304.13712.
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311.
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258.
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S. Scaling instruction-finetuned language models. arXiv 2022, arXiv:2210.11416.
Wang, B.; Xie, Q.; Pei, J.; Chen, Z.; Tiwari, P.; Li, Z.; Fu, J. Pre-trained language models in biomedical domain: A systematic survey. ACM Comput. Surv. 2021, 56, 1–52.
Tang, R.; Han, X.; Jiang, X.; Hu, X. Does synthetic data generation of llms help clinical text mining? arXiv 2023, arXiv:2303.04360.
Umapathi, L.K.; Pal, A.; Sankarasubbu, M. Med-HALT: Medical domain hallucination test for large language models. arXiv 2023, arXiv:2307.15343.
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361.
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 2023, 9, e45312.
Casper, S.; Davies, X.; Shi, C.; Gilbert, T.K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv 2023, arXiv:2307.15217.
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D. Towards expert-level medical question answering with large language models. arXiv 2023, arXiv:2305.09617.
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35.
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526.
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837.
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168.
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171.
Bommasani, R.; Liang, P.; Lee, T. Language Models are Changing AI: The Need for Holistic Evaluation. Available online: https://crfm.stanford.edu/2022/11/17/helm.html (accessed on 30 June 2023).
Li, J.; Dada, A.; Kleesiek, J.; Egger, J. ChatGPT in healthcare: A taxonomy and systematic review. medRxiv 2023.
Arora, A.; Arora, A. The promise of large language models in health care. Lancet 2023, 401, 641.
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634.
Moy, A.J.; Schwartz, J.M.; Chen, R.; Sadri, S.; Lucas, E.; Cato, K.D.; Rossetti, S.C. Measurement of clinical documentation burden among physicians and nurses using electronic health records: A scoping review. J. Am. Med. Inform. Assoc. 2021, 28, 998–1008.
Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing physician and artificial intelligence Chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 2023, 183, 589–596.
Chowdhury, M.; Lim, E.; Higham, A.; McKinnon, R.; Ventoura, N.; He, Y.; De Pennington, N. Can Large Language Models Safely Address Patient Questions Following Cataract Surgery; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 131–137.
Ahmad, W.; Ali, H.; Shah, Z.; Azmat, S. A new generative adversarial network for medical images super resolution. Sci. Rep. 2022, 12, 9533.
Rao, A.; Pang, M.; Kim, J.; Kamineni, M.; Lie, W.; Prasad, A.K.; Landman, A.; Dreyer, K.J.; Succi, M.D. Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv 2023.
Bumgardner, V.; Mullen, A.; Armstrong, S.; Hickey, C.; Talbert, J. Local large language models for complex structured medical tasks. arXiv 2023, arXiv:2308.01727.
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198.
Lahat, A.; Shachar, E.; Avidan, B.; Shatz, Z.; Glicksberg, B.S.; Klang, E. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci. Rep. 2023, 13, 4164.
Rao, A.; Kim, J.; Kamineni, M.; Pang, M.; Lie, W.; Succi, M.D. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv 2023.
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744.
Larsen, B.; Narayan, J. Generative AI: A Game-Changer That Society and Industry Need to Be Ready for. 2023. Available online: https://www.weforum.org/agenda/2023/01/davos23-generative-ai-a-game-changer-industries-and-society-code-developers/ (accessed on 30 June 2023).
Heikkilä, M. Inside a Radical New Project to Democratize AI. In MIT Technology Review; MIT: Cambridge, MA, USA, 2022; Available online: https://www.technologyreview.com/2022/07/12/1055817/inside-a-radical-new-project-to-democratize-ai/ (accessed on 30 June 2023).
Finlayson, S.G.; Bowers, J.D.; Ito, J.; Zittrain, J.L.; Beam, A.L.; Kohane, I.S. Adversarial attacks on medical machine learning. Science 2019, 363, 1287–1289.
Sorin, V.; Klang, E.; Sklair-Levy, M.; Cohen, I.; Zippel, D.B.; Balint Lahat, N.; Konen, E.; Barash, Y. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 2023, 9, 44.