Content Coding of Psychotherapy Transcripts Using Labeled Topic Models
Abstract: Psychotherapy represents a broad class of medical interventions received by millions of patients each year. Unlike most medical treatments, its primary mechanisms are linguistic; i.e., the treatment relies directly on a conversation between a patient and provider. However, the evaluation of patient-provider patient conversation suffers from critical shortcomings, including intensive labor requirements, coder error, nonstandardized nonstandardized coding systems, and inability to scale up to larger data sets. To overcome these shortcomings, psychotherapy analysis needs a reliable and scalable method for summarizing the content of treatment encounters. We used a publicly available psychotherapy corpus corpus from Alexander Street press comprising a large collection of transcripts of patient-provider patient conversations to compare coding performance for two machine learning methods. We used the labeled latent Dirichlet allocation (L-LDA) (L LDA) model to learn associations ions between text and codes, to predict codes in psychotherapy sessions, and to localize specific passages of within within-session session text representative of a session code. We compared the L-LDA L LDA model to a baseline lasso regression model using predictive accuracy and nd model generalizability (measured by calculating the area under the curve (AUC) from the receiver operating characteristic curve). The LLLDA model outperforms the lasso logistic regression model at predicting sessionsession level codes with average AUC scores of of 0.79, and 0.70, respectively. For finefine grained level coding, L-LDA LDA and logistic regression are able to identify specific talk talk-