Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner: Cohort Study

BackgroundPostacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology...

Full description

Saved in:
Bibliographic Details
Main Authors: Zachary Butzin-Dozier (Author), Yunwen Ji (Author), Haodong Li (Author), Jeremy Coyle (Author), Junming Shi (Author), Rachael V Phillips (Author), Andrew N Mertens (Author), Romain Pirracchio (Author), Mark J van der Laan (Author), Rena C Patel (Author), John M Colford (Author), Alan E Hubbard (Author)
Format: Book
Published: JMIR Publications, 2024-08-01T00:00:00Z.
Subjects:
Online Access:Connect to this object online.
Tags: Add Tag
No Tags, Be the first to tag this record!

MARC

LEADER 00000 am a22000003u 4500
001 doaj_54e3e207a3e34b1aa772b40f500dee76
042 |a dc 
100 1 0 |a Zachary Butzin-Dozier  |e author 
700 1 0 |a Yunwen Ji  |e author 
700 1 0 |a Haodong Li  |e author 
700 1 0 |a Jeremy Coyle  |e author 
700 1 0 |a Junming Shi  |e author 
700 1 0 |a Rachael V Phillips  |e author 
700 1 0 |a Andrew N Mertens  |e author 
700 1 0 |a Romain Pirracchio  |e author 
700 1 0 |a Mark J van der Laan  |e author 
700 1 0 |a Rena C Patel  |e author 
700 1 0 |a John M Colford  |e author 
700 1 0 |a Alan E Hubbard  |e author 
245 0 0 |a Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner: Cohort Study 
260 |b JMIR Publications,   |c 2024-08-01T00:00:00Z. 
500 |a 2369-2960 
500 |a 10.2196/53322 
520 |a BackgroundPostacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. ObjectiveUsing a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. MethodsWe predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. ResultsWe were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. ConclusionsThe methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment. International Registered Report Identifier (IRRID)RR2-10.1101/2023.07.27.23293272 
546 |a EN 
690 |a Public aspects of medicine 
690 |a RA1-1270 
655 7 |a article  |2 local 
786 0 |n JMIR Public Health and Surveillance, Vol 10, p e53322 (2024) 
787 0 |n https://publichealth.jmir.org/2024/1/e53322 
787 0 |n https://doaj.org/toc/2369-2960 
856 4 1 |u https://doaj.org/article/54e3e207a3e34b1aa772b40f500dee76  |z Connect to this object online.