Selected Publications

Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner: Cohort Study

Zach Butzin-Dozier, Ph.D.
Yunwen Ji
Haodong Li
Jeremy Coyle
Seraphina Shi
Rachael Phillips, Ph.D.
Andrew Mertens
Romain Pirracchio, M.D., MPH, Ph.D, FCCM
Mark van der Laan, Ph.D.
Rena C Patel
John M Colford
Alan Hubbard, Ph.D.
2024

Post-acute Sequelae of COVID-19 (PASC), also known as Long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19 infection. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited.

HAL-based Plugin Estimation of the Causal Dose-Response Curve

Seraphina Shi
Wenxin Zhang
Alan Hubbard
Mark van der Laan
2024

Estimating the marginally adjusted dose-response curve for continuous treatments is a longstanding statistical challenge critical across multiple fields. In the context of parametric models, mis-specification may result in substantial bias, hindering the accurate discernment of the true data generating distribution and the associated dose-response curve. In contrast, non-parametric models face difficulties as the dose-response curve isn't pathwise differentiable, and then there is no n...

Data-Adaptive Identification of Subpopulations Vulnerable to Chemical Exposures using Stochastic Interventions

David McCoy, Ph.D.
Wenxin Zhang
Alan Hubbard, Ph.D.
Mark van der Laan, Ph.D.
Alejandro Schuler, Ph.D.
2024

In environmental epidemiology, identifying subpopulations vulnerable to chemical exposures and those who may benefit differently from exposure-reducing policies is essential. For instance, sex-specific vulnerabilities, age, and pregnancy are critical factors for policymakers when setting regulatory guidelines. However, current semi-parametric methods for heterogeneous treatment effects are often limited to binary exposures and function as black boxes, lacking clear, interpretable rules for subpopulation-specific policy interventions. This study introduces a novel method using cross-...

Large Language Models as Co-Pilots for Causal Inference in Medical Studies

Ahmed Alaa, Ph.D
Rachael Phillips, Ph.D.
Emre Kiciman
Laura Balzer, Ph.D.
Mark van der Laan, Ph.D.
Maya Petersen, M.D. Ph.D.
2024

The validity of medical studies based on real-world clinical data, such as observational studies, depends on critical assumptions necessary for drawing causal conclusions about medical interventions. Many published studies are flawed because they violate these assumptions and entail biases such as residual confounding, selection bias, and misalignment between treatment and measurement times. Although researchers are aware of these pitfalls, they continue to occur because anticipating and addressing them in the context of a specific study can be challenging without a large, often unwieldy,...

Artificial Intelligence–Based Copilots to Generate Causal Evidence

Maya Petersen, M.D. Ph.D.
Ahmed Alaa, Ph.D
Mark van der Laan, Ph.D.
2024

While there is growing consensus that real-world data should play a larger role in generating causal evidence for health care, it is less clear whether and how AI can help. Current approaches to AI-driven analysis of health data are ill-equipped to account for the many threats to causal validity. However, the current human-reliant pipeline for causal analysis also falls short: analyses are complex, require multidisciplinary expertise, and are slow, labor-intensive and error-prone. Here, we speculate how a “human-in-the-loop” AI-based system could help relieve bottlenecks to high-...

Evaluating and Utilizing Surrogate Outcomes in Covariate-Adjusted Response-Adaptive Designs

Wenxin Zhang
Aaron Hudson
Maya Petersen, M.D. Ph.D.
Mark van der Laan, Ph.D.
2024

This manuscript explores the intersection of surrogate outcomes and adaptive designs in statistical research. While surrogate outcomes have long been studied for their potential to substitute long-term primary outcomes, current surrogate evaluation methods do not directly account for the potential benefits of using surrogate outcomes to adapt randomization probabilities in adaptive randomized trials that aim to learn and respond to treatment effect heterogeneity. In this context, surrogate outcomes can benefit participants in the trial directly (i.e. improve expected outcome of...

Highly Adaptive LASSO: Machine Learning that Provides Valid Nonparametric Inference in Realistic Models

Zach Butzin-Dozier, Ph.D.
Sky Qiu
Alan Hubbard, Ph.D.
Seraphina Shi
Mark van der Laan, Ph.D.
2024

Understanding treatment effects on health-related outcomes using real-world data requires defining a causal parameter and imposing relevant identification assumptions to translate it into a statistical estimand. Semiparametric methods, like the targeted maximum likelihood estimator (TMLE), have been developed to construct asymptotically linear estimators of these parameters. To further establish the asymptotic efficiency of these estimators, two conditions must be met: 1) the relevant components of the data likelihood must fall within a Donsker class, and 2) the estimates of nuisance...