Extraction Of Structured Data From Electronic Health Records Using Natural Language Processing

ExtractionOfStructured DataFromElectronicHealth RecordsUsingNatural LanguageProcessing

27Feb2023

Electronic Health Record (EHR) is a health record documenting each patient visit called to encounter, followed by supplemental documentation such as laboratory results, radiology results, patient handouts, etc Each visit record contains the patient’s demographics,medicalhistory,familyhistory, vitalsigns,diagnoses,medications,treatment plans, immunizations, allergies, radiology images, laboratory and test results, and administrativeandbillingdata.

Providers send medical record documents periodicallytotheinsurancecompanyinPDF, TIFF, or TXT format as proof of services provided Each visit information is appended and sent together as a single file EHR providesacompletepatient’smedicalhistory across multiple providers. An EHR’s collaborative nature is its main advantage It is made to be shared among healthcare professionals,enablingpatientstomovetheir records from one clinic to another with them (including labs, emergency rooms, and pharmacies) Each EHR typically contains

Page 1 of 10

hundreds of pages A Medicare patient with one chronic disease sees an average of nine to 14 different providers in a given year. Providers capture this data in their EHR and thenshareitwiththepayer

Download case study here – HEDIS Audit ManagementICDEngine

EHR contains valuable information about the patient. The information consists of patient demographics, visit date/time, family history, Diagnoses, Diseases detected and corresponding ICD-10 codes, Drugs prescribed, Procedure codes, Provider information, etc However, as noted by the physician, this data is usually in an unstructured, free-text format. EHR systems fromdifferentvendorshavedifferentformats, making it difficult for healthcare providers to access and share patient data Interoperability standards like FHIR (Fast Healthcare Interoperability Resources) are becomingpopular.However,supportingthem requires converting historical patient data in a free-text format in traditional EHR systems to FHIR format for sharing. Also, due to the unstructured nature of EHR data and nonstandard formats, it’s not possible to directly runpredictiveanalyticsonthepatientdataor perform aggregation for patient population analytics. To support such analytics, the text needs to be converted to structured data by interpreting the conversational-style languageusedbytheProviderintheEHR.

AtypicalProgressNotewithinanEHRlooks

Page 2 of 10

likethis–

Recent Natural Language Processing (NLP) advances, augmented with deep learning and novel Transformer-based architectures, offer new avenues for extracting structured data from unstructured clinical records. This structured data can then be used for descriptive and predictive analytics. The various natural language processing techniques used to extract structured EHR dataarelistedbelow

Pre-processingFile

The first step in extracting structured data from EHR is fully extracting the text while preservingitscontext

ThedigitalorscannedPDFortheTIFFfileis convertedintopageimages

Pageimagesaresharpenedandstraightened usingde-skewing

Textandassociatedmeta-dataareextracted fromtheimageusingOpticalCharacter Recognition(OCR),imageprocessing,andcore NLP.TheOCRenginemeta-dataincludesblocks, paragraphs,lines,andwordboundingboxes.

Theimageprocessingmetadataincludes header,footer,layouts(singleorsplitparas), tables,figures,andsignatures;theNLP metadatacontainssentenceboundaries

TheOCRtextfromdifferentblocksonapageis alignedbasedoncoordinates Anintermediate fileiscreatedwithallthemetadataembedded andtheOCRoutput.

DocumentLayoutDetection

The documents can have different layouts based on the EHR systems they came from Page images are sent through the layout detector (based on image processing), and header, footer, title, fonts, figures, tables, signatures, and stamp areas are detected This data is embedded as metadata in the

Page 3 of 10

This data is embedded as metadata in the intermediatefile

DocumentboundariesandtypeDetection

Each EHR file consists of multiple documents likeProgressNotes,Prescriptions,Labreports, Physician letters, etc. The progress note contains patient visit information and is the most crucial document. Document boundaries are detected based on page numbersinheadersandfootersandalsothe start of the following document based on its title. All continuously incrementing page numbers are considered part of a single document Documenttypeisdetectedbased on the repository of titles matching the doctype.

TableDetectionandCellExtraction

Tables are information-rich structured objects.However,OCRenginestendtojumble table data, and the structure needs to be recovered To add back the structure, table and cell boundaries are detected, and the word boundaries from OCR are matched to reproduce the table structure in the intermediatefile.

ProgressNoteDetection

ProgressNoteisthemostessentialdocument which contains all the information on the patient’s visit The Progress Note documents fromdifferentEHRfileslookdifferentandhave different titles. A repository of these titles for ProgressNotesaswellasotherdocuments,is usedtomatchthedocumenttitlesanddetect Progress Note documents and their boundaries. The section names on the page indicate that the document is of type Progress Note A TF/IDF approach to detect commonlyoccurringwordsonapageisalso usedtodetectwhetherthepagebelongstoa particulardocumenttype.

SectionDetection

EHR contains Progress Notes, including

Page 4 of 10

sections like Chief Complaint, Discharge Summary, Present Illness, Personal Histories, Family Histories, Physical, Laboratory Exams, Radiological report, Impressions, and Recommendations These sections have headings, but each EHR system follows its naming convention and hierarchy for section headings. For example, the section chief problemmaybeindicatedbyheadings“chief complaint,” “presenting complaint(s),” “presenting problem(s),” “reason for encounter,” or even the abbreviation “CC” in different EHRs. A repository of mapping of section headings to their respective normalized section names is kept, and a common section name is derived using this mapping. The linear chain Conditional Random Field (CRF) model is used to recognize the boundaries of sections sequentially.

PatientDemographicsDetection

Patient demographics like age, gender, location, ethnicity, and race are essential factors in individual and population health analysis. These attributes are extracted from Progress Note using Named Entity Recognition

ProviderInformationDetection

Provider name, designation, and degree are essentialattributesextractedfromEHR These areextractedsimilarlytodemographicsusing customNamedEntityRecognition.

ConceptRecognitionandDetection

Natural Language Processing breaks down the text into smaller components, identifies the links between the pieces, and explores how they are combined to create meaning. Named Entity Recognition (NER) is one of the vital entity detection methods in NLP, which automatically detects the concept from a free text and classifies tokens (words) into pre-definedcategories

Differententitiesfrombiomedicaldocuments Page 5 of 10

areextractedusingaquerydatabase,named entity recognition, and linked to a concept in abiomedicaldatabasesuchasUMLS(Unified Medical Language System) UMLS is a set of filesandsoftwarethatcombinesmanyhealth and biomedical vocabularies and standards to enable interoperability between computer systems UMLS contains 12M different concepts; on average, two different names are assigned to each concept. For example, theconceptwithidC0006826has16different assigned names, including cancer, tumor, malignant neoplasm, malignancy, and disease.Onaverage,90%ofthesenameslink to more than one concept in UMLS. Consequently, it is impossible to link a detected entity to a biomedical concept basedonlyonthename.Aconceptdatabase (CDB) and vocabulary (VCB) files are essentially required for linking the extracted biomedicalentitytoadatabase

Concept Database (CDB) is built from a biomedical dictionary (eg, UMLS & SNOMEDCT) and stores all the linking-related information like name, sub name, Concept Unique id (CUI), Type id, etc. All the concepts we want to identify and link to are stored there

Vocabulary (VCB) is a list of all terms that mightappearinthetextswewishtoannotate andiscontainedinaVCB.Inourinstance,itis primarilyutilizedforspell-checking.

AKnowledgegraphcanbebuiltusingNamed Entities (nodes) and Relation Classification (edges). Such a knowledge graph can be used for various purposes, like predictive analytics

HereisthecleantextOCRportionofanEHR–

AfterNERhasdetectedallthemedicalterms –Page 6 of 10

After linking the entities to a biomedical database ID, each entity is assigned and linked to a biomedical database (UMLS) concept. The concept IDs are, in turn, associated with disease codes like ICD-10 codes

Drug-Diseasemapping

The extracted disease entity is linked to its targeteddrugentityifthedrugisprescribed Drug-disease associations, drug features, and disease semantic information are used to detect this The drug detection model detects multiple drug-related concepts such as dosage, drug names, duration, form, frequency, route of administration, and strength

Negationandhistorydetection

The challenge with clinical NLP is that the meaning of a clinical entity is heavily influenced by modifiers such as negation. Therefore, negation detection is essential to identify conditions associated with valuable clinical decision support and knowledge extraction. A mention of a disease in a biomedical document does not necessarily implythatapatientsuffersfromthatdisease. Since documents explain a diagnostic process,adocumentmaydetailatestbeing performed to determine whether or not a patient has a condition, or it may relate to a discussion of the arguments for whether or not a patient has a problem or state that a family member has it. Therefore, in EHRs, the

Page 7 of 10

detectionofnegatedentitiesisessential.

Hereistheoutputofnegationdetection–

SignatureandHand-writtentext Detection

Payers validate the Progress Note if the Provider signs it. The Provider can physically orelectronicallysignthedocument,indicated by the text “Electronically signed by,” at the endofeachprogressnote.

Conclusion

Electronic Health Records have many freeform, conversational-formatted, unstructured information about the patient. Once this informationisconvertedtostructureddata,it canbeusedinvarioususecases.

PatientandProviderDataAnalytics

The data extracted from EHR combined with claims, enrolment, and member data can be usedtodrawinsightsintothepatientjourney and historical trends using description analytics. This data can also be used to perform predictive analytics to understand patients’proclivitytowardspotentialdiseases and rehospitalization The data from many patients can be used to do Population analytics and explore Social Determinants of Health

HEDISHybridMeasurements

HEDISqualitymeasurementsareusedtotrack value-based care by providers, and ratings are assigned to providers based on compliance. A few of the hybrid measures require attributes that are only captured in EHR documents. Combining these attributes with data from claims allow Payers to track

Page 8 of 10

PaymentIntegrity

EHR contains details of services provided duringavisit.Thesedetailscanbecompared against the claims data to ensure that the payment claims by providers match the servicesprovidedduringthepatientvisit.

RiskAdjustmentCoding

For high-risk diseases, HCC (Hierarchical Condition Category Coding) codes are assigned to some ICD-10 codes. These codes allow for estimating future healthcare costs for patients With some government programs, Payers get compensated additionally to treat high-risk patients. Risk Adjustment Coding using EHR data proves high-riskpatientsandtreatmentsandallows payerstogetcompensatedaccordingly Share

AUTHOR

hybridHEDISmeasures

Page 9 of 10

SiddhiJain SoftwareEngineer DataAnalyticsdataextractionData StructureEHRHCCHEDISICDEnginesNLP

Youcanreachusathello@47billion.com

SantaClara

USA

1600DuaneAvenue

SantaClaraCA95054

Indore India

4thFloor,CrystalITPark

Indore-452001