GitHub - XinyuanHu/n3c-longcovid: Scripts used to produce the analysis in our paper on computable phenotypes for long-COVID

Who has long-COVID? A big data approach

Introduction

This is reproducible code for our paper, Who has long-COVID? A big data approach, which uses data from the National COVID Cohort Collaborative’s (N3C) EHR repository to identify potential long-COVID patients. The full citation is:

Pfaff ER, et al. Who has long-COVID? A big data approach. medRxiv 2021; : 2021.10.18.21265168.

Abstract

Background Post-acute sequelae of SARS-CoV-2 infection (PASC), otherwise known as long-COVID, have severely impacted recovery from the pandemic for patients and society alike. This new disease is characterized by evolving, heterogeneous symptoms making it challenging to derive an unambiguous long-COVID definition. Electronic health record (EHR) studies are a critical element of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the urgent need to understand PASC, accurately identify who has PASC, and identify treatments.

Methods Using the National COVID Cohort Collaborative’s (N3C) EHR repository, we developed XGBoost machine learning (ML) models to identify potential long-COVID patients. We examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. We used these features and 597 long-COVID clinic patients to train three ML models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized.

Findings Our models identified potential long-COVID patients with high accuracy, achieving areas under the receiver operator characteristic curve of 0.91 (all patients), 0.90 (hospitalized); and 0.85 (non-hospitalized). Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the "all patients” model to the larger N3C cohort identified 100,263 potential long-COVID patients.

Interpretation Patients flagged by our models can be interpreted as “patients warranting likely to be referred to or seek care at a long-COVID specialty clinic,” an essential proxy for long-COVID diagnosis while consensus is reached on a definitionin the current absence of a definition. We also achieve the urgent goal of identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs.

Funding This study was funded by NCATS and NIH through the RECOVER Initiative.

Issues

Please report issues via email or via the issues page

Data Sharing Statement

The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol # IRB00249128 or individual site agreements with NIH. The N3C Data Enclave is managed under the authority of the NIH; information can be found at ncats.nih.gov/n3c/resources. Enclave data is protected, and can be accessed for COVID-related research with an approved (1) IRB protocol and (2) Data Use Request (DUR). A detailed accounting of data protections and access tiers is found in [1]. Enclave and data access instructions can be found at https://covid.cd2h.org/for-researchers.

Project Structure

./scripts/ contains all the scripts used in the analysis.
./figures/ contains the figures developed for publication, using the results generated by the ./scripts/ pipeline

Authors

Emily R. Pfaff (:empff)
Andrew T. Girvin(:andrewtgirvin)
Tellen D. Bennett
Abhishek Bhatia (: abhatia08 | :@abhibhatia08)
Ian M. Brooks
Rachel R Deer
Jonathan P Dekermanjian (: dekermanjian)
Sarah Elizabeth Jolley(:@se_jolley)
Michael G. Kahn
Kristin Kostka(: kmkostka | :@kricketchirps)
Julie A McMurry (: jmcmurry | :@figgyjam)
Richard Moffitt
Anita Walden (:@awalden20)
Christopher G Chute
Melissa A Haendel (: tis-lab | :@ontowonka)

Notes

This repository is continually updated for clarity in response to feedback. However, all code will remain public.

For full transparency, we include the state of the repository at the time of submission. Release Publication code v1.0.0 is the version of the repository that existed at the time of submission. This release is archived on Zenodo:

An early pre-print of this paper is available on medRxiv: Who has long-COVID? A big data approach

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Who has long-COVID? A big data approach

Introduction

Abstract

Issues

Data Sharing Statement

Project Structure

Authors

Notes

About

Releases

Packages

Languages

XinyuanHu/n3c-longcovid

Folders and files

Latest commit

History

Repository files navigation

Who has long-COVID? A big data approach

Introduction

Abstract

Issues

Data Sharing Statement

Project Structure

Authors

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages