Michael Boodoo - Hofstra University Student Research and Creativity Forum

Language Models as Automated Scenario Generators for Medical Education Paul 1Department

1 Chung ,

Michael

1 Lee ,

Matthew

1 Giangola ,

of Surgery, Zucker School of Medicine at Hofstra/Northwell,

2Computer

Science Department, Hofstra University, NY

ABSTRACT • The work investigates language models for generating novel scenarios to assess medical students’ surgical judgment skills. • Using 1.16 million real-life trauma surgical encounters from the National Trauma Databank (NTDB) from 2007-2013, we trained a GPT-2 model (Generalized Pretrained Transformer). • The architecture generates plausible automated scenarios.

PROBLEM • Automated training of surgical judgement skills for medical students • Currently this is done with manually generated scenarios which is both expensive and biased. • We propose to use NLP machine learning tools to train a language model on an extensive National Trauma dataset (NTDB). • Use the trained model to generate new and plausible scenarios.

METHODS • GPT (Generative Pre-training) models are auto-regressive language models. [3,4] based on the decoder module in the transformer architecture [5]. • GPT models are pre-trained on a very large web corpus [3,4]. • The model used here is the smallest GPT model with 12 layers, masked selfattention heads and 118 million

parameters.

www.PosterPresentations.com

2 Boodoo ,

Andrew Y.

Word-FT (r = 0.92)

IMPLEMENTATION • GPT vocabulary uses byte-level-encoding (BPE) – which encodes chunks of words. • Format of NTDB dataset: <START> E-Code <DSTART> Diagnoses ICD-9 codes <PSTART> Procedure ICD-9 codes <END>

• Sample NTDB scenario: <START> E965.0 <DSTART> 864.10 863.1 862.1 <PSTART> 38.93 34.04 34.82 44.61 96.71 96.07 96.04 50.61 96.07 57.94 <END>

Training Procedure: • Replaced BPE vocabulary with ICD9 codes* (vocabulary size = 11,606) • Trained two GPT models: • WORD-FT: Fine-tune the pretrained model • WORD-T: Model with random initial weights • 90/10 training/validation split Model

Simona

2 Doboli

Training Learn LR Optimizer Epochs Rate Scheduler

Word-T

10

4e-5

AdamW

Linear

Word-FT

5

5e-5

AdamW

Linear

Table 1: Training Parameters

Generation Procedure • top-p method – distributes next code probability among a small code set with summed probability equal to p • Starting sequence for a generated scenario: <START> Ecode <DSTART>

Model

p

Max Length

Repetition Penalty

Length Penalty

20

3.0

0.5

Word-T/FT 0.9

Word-T (r = 0.89)

Table 2: Top-P Generation parameters

Evaluation Procedures: 1. Perplexity over validation data 2. Bigram statistics: • Generated equal # of sequences as training data • Conditional bigram P(ci|ci-1) statistics of training and generated data • Pearson correlation between training and generated bigram statistics 3. Clinician evaluation • 100 generated scenarios by the WORD-FT model were rated for plausibility by two clinicians • Starting Ecode: E965.0 – assault by handgun, unspecified injury of the liver

RESULTS Model

Train Loss

Valid Loss

Perplexity

Word-T Word-FT

2.679 2.929

2.815 2.912

16.71 18.40

Table 3: Training results

Sample Scenario Generated by Word-FT: <START> E965.0 <DSTART> 864.10 862.1 861.32 861.13 860.3 <PSTART> <UNK> <END> Ecode: E965.0 = Assault by handgun

DCodes 864.10 = Injury to liver, with open wound into cavity 862. = Injury to diaphragm, with open wound into cavity 861.32 = Laceration of lung with open wound into thorax 861.13 = Laceration of heart with penetration of heart chambers with open wound into thorax 860.3 = Traumatic hemothorax with open wound into thorax

Figure 1: Bigram Statistics (intersection bigrams) Bigrams

Total # bigrams

Intersection generated & training bigrams

Training 573,845 n/a Word-T 190,098 142,227 Word-FT 189,479 149,227 Table 4: Bigrams in generated vs trained data

Clinician Evaluation Results: • Clinician 1: 84/100 plausible scenarios • Clinician 2: 65/100 plausible scenarios • Cohen Kappa = 0.11

CONCLUSIONS • Developed Automated Scenario Generator based on GPT-2 model • Transfer learning possible through modification of vocabulary • Bigram statistics demonstrate superior results of FT model • Fair-interrater correlation by clinicians

REFERENCES [1] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pretraining. [2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. [3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

ACKNOWLEDGEMENT This project was funded by the NBME Stemmler Grant

Turn static files into dynamic content formats.

Create a flipbook