P0192 - Predicting Recurrent C. difficile Infection in IBD Patients: An Application of AutoML Tabular and Text Classification Models on Electronic Health Records and Clinical Notes
Raseen Tariq, MBBS, Ankita Sethi, BA, Shivaram Poigai Arunachalam, PhD, DBA, Darrell S.. Pardi, MD, MS, William A. Faubion, MD, Sahil Khanna, MBBS, MS Mayo Clinic, Rochester, MN
Introduction: The utilization of traditional machine learning models on big data demands substantial expertise. To overcome this challenge, Automated Machine Learning (AutoML) has emerged as a promising tool to automate parts of the machine learning pipeline. We evaluated AutoML to predict recurrent C difficile infection (rCDI) in Inflammatory Bowel Disease (IBD) patients.
Methods: We included IBD patients with primary CDI and developed two models: one leveraging structured Electronic Health Record data for a supervised machine learning model, and another using clinical notes to predict recurrent CDI using natural language processing (NLP) for patient with > 5 clinical notes. Data were processed and uploaded onto a HIPPA compliant Google platform via the Mayo Clinic AI Factory. Data files were formatted to be compatible with the AutoML platform. The dataset was divided into 80/10/10 split for training, validation, and testing sets respectively. For tabular model, performance was evaluated based on the Area under the Receiver Operating Characteristic curve (AuROC), accuracy, precision, and recall and for text classification, the metrics included average precision which is Area under the Precision Recall curve (PR AUC).
Results: Of 2,573 patients, 655 (25.4%) had recurrent CDI, an ML model using structured data was trained, validated, and tested on a dataset consisting of 2058, 257, and 257 patients respectively. The model demonstrated promising results to predict rCDI, achieving an AuROC 0.853 and precision and recall, both standing at 78%. (Figure A).
For the NLP model, we utilized clinical notes from 2,100 patients, of which 508 (24.1%) had rCDI. The model was trained, validated, and tested on a dataset divided into 1680, 210, and 210 items, respectively. The NLP model performed well with PR AUC of 0.827, with both precision and recall at 77.6%. (Figure B). We observed similar accuracy and performance with a supervised learning model for tabular data using XGBoost with manual coding (abstract submitted separately to ACG 2023).
Discussion: Leveraging minimal coding and diverse data, we developed high-performing algorithms to predict recurrent CDI using autoML. These models demonstrate comparable performance with traditional machine learning approaches. This emphasizes a promising potential of AutoML in delivering accurate predictions while streamlining the development process.
Figure: Performance of Automated Machine Learning models using AutoML Tabular and Text Classification
Disclosures:
Raseen Tariq indicated no relevant financial relationships.
Ankita Sethi indicated no relevant financial relationships.
Shivaram Poigai Arunachalam indicated no relevant financial relationships.
Raseen Tariq, MBBS, Ankita Sethi, BA, Shivaram Poigai Arunachalam, PhD, DBA, Darrell S.. Pardi, MD, MS, William A. Faubion, MD, Sahil Khanna, MBBS, MS. P0192 - Predicting Recurrent C. difficile Infection in IBD Patients: An Application of AutoML Tabular and Text Classification Models on Electronic Health Records and Clinical Notes, ACG 2023 Annual Scientific Meeting Abstracts. Vancouver, BC, Canada: American College of Gastroenterology.