Language Models as Automated Scenario Generators for Medical Education Paul 1Department
1 Chung ,
Michael
1 Lee ,
Matthew
1 Giangola ,
of Surgery, Zucker School of Medicine at Hofstra/Northwell,
2Computer
Science Department, Hofstra University, NY
ABSTRACT • The work investigates language models for generating novel scenarios to assess medical students’ surgical judgment skills. • Using 1.16 million real-life trauma surgical encounters from the National Trauma Databank (NTDB) from 2007-2013, we trained a GPT-2 model (Generalized Pretrained Transformer). • The architecture generates plausible automated scenarios.
PROBLEM • Automated training of surgical judgement skills for medical students • Currently this is done with manually generated scenarios which is both expensive and biased. • We propose to use NLP machine learning tools to train a language model on an extensive National Trauma dataset (NTDB). • Use the trained model to generate new and plausible scenarios.
METHODS • GPT (Generative Pre-training) models are auto-regressive language models. [3,4] based on the decoder module in the transformer architecture [5]. • GPT models are pre-trained on a very large web corpus [3,4]. • The model used here is the smallest GPT model with 12 layers, masked selfattention heads and 118 million
parameters.
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
2 Boodoo ,
Andrew Y.
Word-FT (r = 0.92)
IMPLEMENTATION • GPT vocabulary uses byte-level-encoding (BPE) – which encodes chunks of words. • Format of NTDB dataset: <START> E-Code <DSTART> Diagnoses ICD-9 codes <PSTART> Procedure ICD-9 codes <END>
• Sample NTDB scenario: <START> E965.0 <DSTART> 864.10 863.1 862.1 <PSTART> 38.93 34.04 34.82 44.61 96.71 96.07 96.04 50.61 96.07 57.94 <END>
Training Procedure: • Replaced BPE vocabulary with ICD9 codes* (vocabulary size = 11,606) • Trained two GPT models: • WORD-FT: Fine-tune the pretrained model • WORD-T: Model with random initial weights • 90/10 training/validation split Model
Simona
2 Doboli
Training Learn LR Optimizer Epochs Rate Scheduler
Word-T
10
4e-5
AdamW
Linear
Word-FT
5
5e-5
AdamW
Linear
Table 1: Training Parameters
Generation Procedure • top-p method – distributes next code probability among a small code set with summed probability equal to p • Starting sequence for a generated scenario: <START> Ecode <DSTART>
Model
p
Max Length
Repetition Penalty
Length Penalty
20
3.0
0.5
Word-T/FT 0.9
Word-T (r = 0.89)
Table 2: Top-P Generation parameters
Evaluation Procedures: 1. Perplexity over validation data 2. Bigram statistics: • Generated equal # of sequences as training data • Conditional bigram P(ci|ci-1) statistics of training and generated data • Pearson correlation between training and generated bigram statistics 3. Clinician evaluation • 100 generated scenarios by the WORD-FT model were rated for plausibility by two clinicians • Starting Ecode: E965.0 – assault by handgun, unspecified injury of the liver
RESULTS Model
Train Loss
Valid Loss
Perplexity
Word-T Word-FT
2.679 2.929
2.815 2.912
16.71 18.40
Table 3: Training results
Sample Scenario Generated by Word-FT: <START> E965.0 <DSTART> 864.10 862.1 861.32 861.13 860.3 <PSTART> <UNK> <END> Ecode: E965.0 = Assault by handgun
DCodes 864.10 = Injury to liver, with open wound into cavity 862. = Injury to diaphragm, with open wound into cavity 861.32 = Laceration of lung with open wound into thorax 861.13 = Laceration of heart with penetration of heart chambers with open wound into thorax 860.3 = Traumatic hemothorax with open wound into thorax
Figure 1: Bigram Statistics (intersection bigrams) Bigrams
Total # bigrams
Intersection generated & training bigrams
Training 573,845 n/a Word-T 190,098 142,227 Word-FT 189,479 149,227 Table 4: Bigrams in generated vs trained data
Clinician Evaluation Results: • Clinician 1: 84/100 plausible scenarios • Clinician 2: 65/100 plausible scenarios • Cohen Kappa = 0.11
CONCLUSIONS • Developed Automated Scenario Generator based on GPT-2 model • Transfer learning possible through modification of vocabulary • Bigram statistics demonstrate superior results of FT model • Fair-interrater correlation by clinicians
REFERENCES [1] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pretraining. [2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. [3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
ACKNOWLEDGEMENT This project was funded by the NBME Stemmler Grant