Psychiatry Transcript Annotation: Human Factors Process Study and Improvements
TimeThursday, April 152:34pm - 2:35pm EDT
The demand for psychiatry is increasing each year. Limited research has been performed to improve psychiatrist work experience and reduce daily workload using computational methods. According to the Board of American Medical Colleges, it is estimated that in the US, by 2024, there will be only 11.3 psychiatrists per 100,000 people. The prior researchers have explored the concepts of Artificial Intelligence (AI)/Machine Learning (ML) and Natural Language Processing (NLP) to help make the input to electronic medical record more efficient, thus reducing workload of clinicians while treating the patients. Past research has developed quality documentation tools for generating consistent and reliable data when rating physician case notes. But there is currently no validated tool for the mental health transcript annotations for generating gold standard data for such computational methods. Choosing the right data for training the machine learning model is important for predicting accurate results. The purpose of this paper was to determine the annotation process for mental health transcripts and how it can be improved to acquire more reliable results considering elements of human factors.
The study consisted of five synthetic mental health transcripts created by National Alliance on Mental Illness (NAMI) Montana. Each transcript was segmented into 90 (transcript A), 90 (transcript B), 100 (transcript C), 91 (transcript D), and 229 (transcript E) sentences respectively. Each sentence were annotated in six categories: Chief Complaint (CC), Medical History (MH), Family History (FH), Social History (SH), Client Detail (CD), and Other Information (OT).
Three clinicians were recruited in this study to evaluate the transcripts. The clinicians were asked to annotate transcript A and transcript B and asked to provide feedback for further improvements. The gold standard was achieved for transcript A and transcript B by comparing the transcripts of three participating clinicians. Kappa statistics were used to measure the inter-rater reliability between clinicians.
Five subjects were recruited randomly (aged between 20-40) for the pilot study. The pilot study was divided into two phases, phase 1 and phase 2. Five transcripts were used in both phases; transcript A, transcript B, transcript C, transcript D, and transcript E. In phase 1, before the transcript annotation, the subjects were given information about the focus of the study, definition of the categories, and four example sentences on how to correctly identify the category for each sentence. Phase 2 was conducted two weeks after phase 1 doing the same five transcripts. In addition, phase 2 also included a training transcript segmented into 90 sentences with the six categories for each sentence. The subjects had to complete the training transcript before completing phase 2 transcripts. The training transcript guides the subject to choose the correct option if they selected the wrong category for that sentence. Kappa statistics were used to measure the inter-rater reliability and accuracy between subjects.
Results and Discussion
Through this study, the authors found the overall inter-rater reliability between clinicians was higher in transcript B (k= 0.49 (CI 0.42 to 0.57)) compared to transcript A (k= 0.26 (CI 0.19 to 0.33)). It was because the patient in transcript A was more complex in terms of erratic behavior and responses leading to less cooperative and ultimately failure to comply with implicit and explicit rules for proper patienthood. The patient in transcript B was more corporative, straightforward, and going through one problem; germaphobia leading to anxiety. The study finds that the type of transcripts can affect inter-rater reliability among participating clinicians. The participating clinicians also provided feedback on having additional categorical labels specifically in medical history and mental history (History of Present Illness (HPI), Past Psychiatric History (PPSH), History of Substance Abuse (HSU), and Review of Systems (RS)) can directly help understand the patient problems for a proper diagnosis.
In the pilot testing phases, the mean inter-rater reliability between subjects was higher in phase 2 with training transcript (k= 0.35 (CI 0.052 to 0.625)) than in phase 1 without training transcript (k= 0.29 (CI 0.128 to 0.451)). After the training, the accuracy percentage among subjects was significantly higher in transcript A (p=0.04) than transcript B (p=0.10).
This innovative study focuses on understanding the annotating process for mental health transcripts which will be applied in training machine learning models. Through this exploratory study, the research contributes to finding appropriate categorical labels that should be included for transcripts annotations, suitable subject selection for data collection, and finally importance of training the subjects before the survey for delivering high reliability in the data. Contributions of this case study will help the psychiatric clinicians and researchers in implementing the recommended data collection process to develop a more accurate AI model. Future research built from this study should include implementing the recommended data collection process to develop a novel system for psychiatrists, that combines automated case notes with an in-build treatment algorithm module.
To the authors' knowledge, this research is first of its kind to explore and develop a tool for mental health transcript annotation, for delivering consistence and better results when training AI models. This research is also unique because it incorporates human factors to analyze the annotation process and recommend improvements. The study will promote to develop a novel AI model and help psychiatric clinicians in generating automated case notes with an in-build treatment algorithm module.