Cynthia Zhao - Student Research and Creativity Forum - Hofstra University

UsingGenerativeLargeLanguageModelsforAssertionDetectioninMedicalNotes

MSComputerScienceThesisProject

By:CynthiaZhao Advisor:Dr.SimonaDoboli

The unstructured nature of electronic medical records and clinical notes make it difficult for Natural LanguageProcessingtechniquestoautomaticallyanalyze,yetitisimportantforhealthcareprofessionals toaccuratelyandefficientlyunderstandclinicalnotesanddeterminethepresence,absence,orpossibility of certain medical problems that may exist in their patients. To address this issue, the task of categorizing different assertions (i.e. statements expressing facts) can be employed. This is also known as clinical assertion detection andithasbeenexploredusingnon-generativemodelstoidentifyassertion classes of medical entities in unstructured clinical notes. Non-generative models are models that don’t generate text but are good at sorting data and making predictions basedonpatterns.Thisprojectdiffers in that we employ a generative LLM for assertion detection. Generative models have the ability to generate text, enabling the generation of explanations alongside the assertion classification, potentially offering more support for medical professionals. There are many applications for this in the healthcare realm. Primarily, it is crucial for healthcare professionals to efficiently comprehend the information contained in clinical notes. It can also be used for the automation of clinical decision support, medical riskevaluation,andmuchmore.

Our primary aim is to utilize a clinical note along with a medical entity and its context, asprovidedby the 2010 i2b2 Assertion Task dataset, to enable the model to determine whether the entity is present, absent, or possible in a patient. The generative LLM that we used is called Mistral and it has 7 billion parameters. Through a task called finetuning, we can change or “finetune” these parameters so thatthe model is better suited to our specific task of clinical assertion detection. Using the i2b2 dataset andthe comprehensive training dataset that we generated through aniterativeprocess,weobservedtremendous progress inthefinetunegeneratedpredictions.Improvingfromapre-finetunedmicroF1-scoreof0.50to a post-fine tuned micro F1-score of 0.88, underscoring the capability of our fine-tuned model to learn essentialpatternsandexcelinassertiondetction.

This project also demonstrates the importance of prompting formats and their impact on the model’s generated responses. We observed that utilizing one-shot prompting, or prompting that provides an exampleofthedesiredresponse,improvedthemodel’spredictionsaconsiderableamountoverzero-shot prompting, or prompting without any examples in the prompt, without any additional finetuning. Combining one-shot prompting with othermethodslikemultiplechoicepromptingandchainofthought prompting enabled us to develop a prompting format that enhanced responsegeneration,surpassingthe performanceofabasiczero-shotprompt.

While our results do not outperform those of thestate-of-the-artBERTmodel,ourresultsarepromising and show the potential of the application of generative models in the healthcare field. Some areas of further research include exploration into medical question and answer prompt engineering to generate optimal prompts as well as finetuning Mistral with other clinical datasets and fine tuning other generativeLLMsonmedicalclinicalnotes.

Using Large Language Models for Assertion Detection in Medical Notes

Cynthia Zhao

Dr. Simona Doboli

Abstract

It is important for healthcare professionals to accurately and efficiently understand clinical notes and determine the presence, absence, or possibility of certain medical problems that may exist in their patients To address this issue, the task of categorizing different assertions, statements expressing facts, can be employed This is also known as clinical assertion detection, and it has been explored using non-generative transformer models to identify assertions classes of medical entities in unstructured clinical notes The main objective of this project is to test the use of generative large language models (LLMs) in the problem of clinical assertion and compare the performance of the generative versus non-generative models While the generative model's outcome may not surpass those of the non-generative model, it nonetheless demonstrates promising potential for application within the healthcare field.

Intro/Background

Clinical assertion detection is known to be a challenging task due to the unstructured nature and free-text format of clinical notes (Chen, 2019) Understanding the valuable information hidden in these clinical notes can be used for various applications such as clinical decision support, disease modeling, medical risk evaluation, medication reconciliation, and quality measurements (Chen, 2019) Previous methods have used non-generative transformer models such as BERT (van Aken et al 2021) to label assertions of medical terms from text

Methods

We used the pretrained model Mistral-7B (Jiang et al 2023) and first tested out different prompts to see which prompt type will elicit the desired response We then finetuned Mistral using LoRA (Hu et al , 2021) and tested out the performance using an unseen i2b2 testing dataset

Dataset Description

The 2010 i2b2 Assertion Task dataset provides a corpus of assertions, concepts, and relations for clinical discharge summaries (Uzuner et al , 2011) We will only be using the present, absent, and possible assertion categories

Format:

c=”concept text” offset||t=”concept type”||a=”assertion value”

Example:

c=”diabetes” 2:14 2:14||t=”problem”||a=”absent”

Figure 1: i2b2 Clinical Assertion Annotated File Format and Examples

Fine-tuning

Methods

Training Data Testing Data

Present Cases 4622 8622

Absent Cases 1596 2594

Possible Cases 309 652

Total Cases 6527 11, 868

The i2b2 dataset lacks answers containing both assertions and justifications, so we needed to create the desired response format that is necessary for finetuning our model We had to adopt an iterative strategy which involved generating answers with the pre-finetuned model on the i2b2 training dataset and using the correctly generated responses to improve the model's performance on the dataset incrementally This iterative process continued until we compiled a comprehensive training dataset containing accurately formatted responses Once this training dataset was established, we could then apply the model to make predictions on the provided i2b2 testing dataset

Prompt Engineering

We found that the generated answers were greatly dependent on whichever prompt format that was used We used the prompting styles in Table 2 to create our prompt format as shown in Figure 4 We found that the prefinetuned model generated the best responses when using one-shot prompting (providing the prompt with an example of each case: present, absent, and possible) We used zero-shot prompting for our finetuned model.

Table 4: Prompting Styles and Definitions

Prompting Style Definitions

Multiple Choice

Prompting (MCP)

Chain-of-Thought

Prompting (CoT)

Provide multiple choices with each answer associated with “A”, “B”, “C”, etc (Robinson et al., 2023).

Encourage the model to generate solutions with step-bystep reasoning by adding phrases like “Let’s think step by step” before the answer (Kojima et al., 2022).

One-shot Prompting Ask the model a question and provide an example (or shot) of “question-explanation-answers triplets” in the prompt (Liévin et al., 2023).

Zero-shot Prompting Ask the model a question without providing any examples.

<s> [INST]

Patient Note: "lungs : breath sounds clear and equal bilaterally , no retractions "

Considering the information in the patient note, which of the following is the most suitable classification for determining the patient's association with retractions?

A Present B Absent C Possible

Let’s derive the differential diagnosis step by step

Answer: [/INST]

Based on the patient note, the most suitable classification for determining the patient's association with retractions would be "Absent" The patient's breath sounds are clear and equal bilaterally, and there are no retractions present Therefore, among A through C, the answer is B Absent </s>

Results

Training and Validation Loss over Epochs

We trained on 6 epochs and used a learning rate of 2e-5 Our comprehensive training dataset was generated using a checkpoint at epoch 3 because it had the lowest validation loss (how well the model is able generalize unseen data)

Precision: the accuracy of positive predictions. Recall (sensitivity): the model's ability to identify all relevant instances.

F1-score (micro): a single value that balances the precision and recall metrics. The micro-average gives equal weight to each prediction

F1-score (micro): 0.89

Present (4622 cases): precision: 0.97, recall: 0.87

Absent (1596 cases): precision: 0.93 recall: 0.94

Possible (309 cases): precision: 0.33, recall: 0.86

F1-score (micro): 0.50

Present (8622 cases): precision: 0.97, recall: 0.37

Absent (2594 cases): precision: 0.95, recall: 0.82

Possible (652 cases): precision: 0.10, recall: 0.91

F1-score (micro): 0.88

Present (8622 cases): precision: 0.98, recall: 0.87

Absent (2594 cases): precision: 0.92, recall: 0.94

Possible (652 cases): precision: 0.35, recall: 0.83

BioClinicalBERT Confusion Matrix and Metrics

F1-score (micro): 0.97

Present (8622 cases): precision: 0.98, recall: 0.98

Absent (2594 cases): precision: 0.97, recall: 0.98

Possible (652 cases): precision: 0.84, recall: 0.73

Conclusions

Although there were tremendous improvements in the finetune generated predictions, the model continues to struggle with the accuracy in identifying positive possible cases It mistakenly classifies Present and Absent cases as Possible (low precision) However, the model rarely mistakes actual possible cases as Present or Absent (high recall) Our results do not outperform those of the BERT model, primarily because of the increased complexity involved in generating justifications for the predicted assertions Unlike our approach, the BERT model can utilize the entire provided i2b2 training dataset, providing it with a large set of data points for fine tuning This limitation restricted us to using only correct model generated responses for fine-tuning, resulting in fewer data points Despite these challenges, our results are promising and show the potential of generative models in the healthcare field.

Future Work

Further research and exploration into medical question and answer prompt engineering and automatic prompt generation to generate the optimal prompt format could result in better performance Additionally, fine tuning Mistral with other clinical datasets as well as, fine tuning other generative models on medical clinical notes could be areas to be further explored It would be insightful to compare the results of how various datasets can influence the performance of Mistral It would also be interesting to compare the performances of different generative LLMs with the task of clinical assertion

References

www.PosterPresentations.com

Chapman, W W Bridewell, W Hanbury, P Cooper, G F & Buchanan, B G (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries Journal of biomedical informatics, 34(5), 301–310 https //doi org 10 1006/jbin 2001 1029 Chen, L (2019) Attention-based deep learning system for negation and assertion detection in clinical notes International Journal of Artificial Intelligence &amp Applications, 10(01), 1–9 https://doi org 10 5121/ijaia 2019 10101 Hu, E Shen, Y Wallis, P Allen-Zhu, Z Li, Y Wang, S & Chen, W 2021 LoRA Low-Rank Adaptation of Large Language Models https //doi org 10 48550/arXiv 2106 09685 Jiang, A Q Sablayrolles A Mensch, A Bamford, C , Chaploy D S Casas, D de las, Bressand, F Lengyel, G Lample G , Saulnier, L Lavaud, L R Lachaux, M -A Stock, P , Le Scao T Lavril, T Wang, T Lacroix, T & El Sayed, W (2023) Mistral 7B https://doi org/10 48550/arXiv 2310 06825 Liévin V Hother C E Motzfeldt A G & Winther O 2023) Can large language models reason about medical questions? Patterns, 5 3 https //doi org 10 1016/j patter 2024 100943 Perera S Sheth A Thirunarayan K Nair, S & Shah, N 2013 Challenges in understanding clinical notes Proceedings of the 2013 International Workshop on Data Management & Analytics for Healthcare - DARE 13 https://doi org 10 1145 2512410 2512427 Robinson, Rytting C M & Wingate, D (2023) Leveraging Large Language Models for Multiple Choice Question Answering https //doi org https://doi org/10 48550/arXiv 2210 12353 Sergeeva, E Zhu, H Prinsen, P & Tahmasebi, A (2019) Negation Scope Detection in Clinical Notes and Scientific Abstracts A Feature-enriched LSTM-based Approach AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science, 2019 212–221 Wang, S Tang, L Majety A Rousseau, J F Shih, G Ding, Y & Peng, Y 2022 Trustworthy assertion classification through prompting Journal of biomedical informatics, 132, 104139 https://doi org 10 1016/j jbi 2022 104139 Qian, Z Li, P Zhu, Q Zhou, G Luo, Z & Luo, W 2016 Speculation and negation scope detection via Convolutional Neural Networks Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing https //doi org 10 18653/v1/d16-1078 van Aken B Trajanovska I Siu, A Mayrdorfer M Budde K & Loeser A (2021) Assertion detection in clinical notes: Medical Language Models to the rescue? Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations https //doi org 10 18653/v1/2021 nlpmc-1 5 Uzuner Ö South, B R Shen, S & DuVall, S L 2011) 2010 2b2/VA challenge on concepts, assertions, and relations in clinical text Journal of the American Medical Informatics Association JAMIA, 18 5), 552–556 https://doi org/10 1136/amiajnl-2011-000203 Zhong, R Lee, K Zhang, Z , & Klein, D (2021) Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections Findings of the Association for Computational Linguistics EMNLP 2021 https://doi org 10 18653/v1 2021 findings-emnlp 244 Acknowledgement This work could not have been possible without the guidance and support from my advisor Dr Simona Doboli

Assertion Category Definition Medical Entity Clinical Note Context Present Problems associated with the patient can be present. This is the default category for medical problems and it contains information that does not fit the definition of any of the other assertion categories increasing weight gain the patient has had increasing weight gain Absent The note asserts that the problem does not exist in the patient. This category also includes mentions where it is stated that the patient HAD a problem, but no longer does. pain patient denies pain Possible The note asserts that the patient may have a problem, but there is uncertainty expressed in the note. Possible takes precedence over absent, so terms like “probably not” or “unlikely” categorize problems as being possible just as “probably” and “likely” do. asthma exacerbation This is very likely to be an asthma exacerbation

Table 3: Number of Each Case for i2b2 Training and Testing Dataset Table 1: i2b2 Clinical Assertion Definitions and Examples Mistral Pre-finetuned Results on i2b2 Testing Data (one-shot prompt) Mistral Post-finetuned Results on i2b2 Testing Data Mistral Post-finetuned Results on i2b2 Training Data Table 2: Data Frame with Extracted Medical Entity, Assertion, and Context Figure 3: Our Desired Prompt and Model Generated Response Format Figure 2: Iterative Strategy to Generate Training Dataset

Turn static files into dynamic content formats.

Create a flipbook