DBDSfest 2024 Student Abstracts/Bios

Matthew Aguirre
Ph.D. student in the Department of Biomedical Data Science

Title: Gene regulatory network structure informs the distribution of perturbation effects

Abstract: Gene regulatory networks (GRNs) govern many core developmental and biological processes underlying human complex traits. But even with broad-scale efforts to characterize the effects of molecular perturbations and interpret gene co-expression, it remains challenging to infer the architecture of transcriptional regulation in a precise and efficient manner. Key properties of GRNs, like hierarchical structure, modular organization, and sparsity, provide both challenges and opportunities for this task. Here, we seek to better understand these properties using a new approach to simulate the structure and model the function of GRNs. We produce realistic graph structures with a novel generating algorithm based on insights from small-world network theory, and we model gene expression regulation using stochastic differential equations formulated to accommodate molecular perturbations. With these tools, we systematically characterize the effects of gene knockouts within and across simulated GRNs. We describe a coherent set of relationships between properties of GRNs and their susceptibility to perturbations, and discover a subset of networks that recapitulate features of a recent genome-scale perturbation study. These exemplar networks share hallmark properties like sparse regulatory architecture, a high degree of substructure via gene modules, and a heavy-tailed out-degree distribution implying existence of transcriptional master regulators. We then compare and contrast gene knockouts with co-expression patterns in these networks, finding that while interventional data are needed to elucidate specific regulatory interactions, both data sources are suitable for discovering broad-scale network structure like modules. We conclude by discussing considerations for future efforts to map the gene regulatory networks and their constituent parts.

Bio: Matthew Aguirre is a fifth year Ph.D student in the Department of Biomedical Data Science, advised by Jonathan Pritchard. His research interests are at the intersection of network science and quantitative genetics, where he is currently working on methods to investigate the effects of molecular perturbations and genetic variation on gene regulatory networks.

Gautam Machiraju (he/him)
Ph.D. Candidate, Biomedical Informatics
Stanford AI Lab (SAIL)

Prospector Heads: Generalized Feature Attribution for Large Models & Data

Abstract: Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for machine learning models in scientific and biomedical domains. Current methods for feature attribution, which rely on “explaining” the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based methods for feature attribution that can be applied to any data modality and any encoder — including those from foundation models. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 30 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in the input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for machine learning models in complex domains.

Biography: Gautam is a final year PhD student advised by Parag Mallick (Radiology) and Christopher Ré (CS). His work centers around developing “data copilots” — AI that can expand our understanding of scientific and biomedical data, perhaps to better inform our decision-making (e.g. in the clinic, or for drug development) or to (re)discover phenomena in high-dimensional unstructured data (e.g. time-series, text corpora, images, or graphs). Methodologically, his work sits at the intersection of foundation model training and inference, architectures for long-range modeling, geometric ML, and interpretability & explainability. He is currently supported by the Stanford Data Science Scholarship.

Indrani Bhattacharya, Ph.D.
Research Engineer, Stanford University

Learning from Multimodal Data for Improving Radiology Imaging-Based Cancer Detection

Abstract: Radiology imaging-based early cancer detection and risk-stratification play an important role in improving patient care and reducing cancer deaths. Yet, subtle imaging features lead to wide inter-reader variability, missed cancers, and overdiagnosis and overtreatment of indolent disease. Computational methods that leverage the fusion of multimodal machine learning with interdisciplinary domain knowledge have immense potential in helping standardize radiologist interpretations and serving as a diagnostic support tool to clinicians. Research questions involve how to seamlessly integrate and learn from complementary multimodal data in clinically relevant ways, and how to design robust experiments and evaluations to enable translation from the laboratories to the clinic. This talk will highlight how we address these questions as we develop multimodal machine learning systems that leverage radiology-pathology fusion to assist clinicians in urologic cancer detection and risk-stratification.

Biography: Dr. Indrani Bhattacharya is a Research Engineer in the Department of Radiology (Division of Integrative Biomedical Imaging Informatics) at Stanford University School of Medicine. She was a postdoctoral scholar in the same department before transitioning to her current position. Dr. Bhattacharya received her Ph.D. and M.S. in Electrical, Computer, and Systems Engineering from Rensselaer Polytechnic Institute (RPI), NY, USA, and her bachelor’s in electrical engineering from Jadavpur University, India. Her research is highly interdisciplinary, at the intersection of multimodal data fusion, machine learning, image processing, medicine, and human behavioral analytics. During her doctoral and postdoctoral research career, she has been the recipient of multiple awards, including the RPI Founder’s Award of Excellence, selection as one of the ‘Rising Stars in EECS, 2020’, and several travel, poster, and perfect pitch awards.

Erin Craig, PhD student in Stanford’s Department of Biomedical Data Science

Bio:

Erin Craig is a PhD student in Stanford’s Department of Biomedical Data Science, with an MS in Data Science and BA in Mathematics from New College of Florida. Previously, she worked at Wolfram Research where she led the team building math content for Wolfram|Alpha. Her research interest is in the development of statistical methods for biomedical applications with an emphasis on simplicity and interpretability; recent examples include methods to do pretraining with sparse linear models, to train classifiers with incompletely labeled data, and to identify treatment effect heterogeneity in clinical trial data. She particularly enjoys collaboration and communicating ideas from statistics in a friendly, accessible way.

Talk:

Title: Pretraining and the lasso

Abstract: Suppose we want to predict survival times for cancer patients, and we have a dataset spanning 10 cancers. Should we pool our data and train one model for all cancers, or should we train a separate model for each cancer? This is a difficult question to answer in general, as we want to leverage the power of our full dataset, but we also want a model that is specific to each cancer type. Neural networks have a tool called “pretraining” for scenarios just like this: train one neural network using all 10 cancers and then fine tune it individually for each cancer. Pretraining usually results in much better performance than training individual models for each group. Here, we present a method to apply pretraining to the lasso, thereby reaping the performance benefits of pretraining and retaining the interpretability and simplicity of sparse linear modeling. We find that our method is general and is broadly applicable to many tasks common in biomedicine, including time series modeling, prediction for data with multiple responses of different types, and conditional average treatment effect estimation.

Matthew Aguirre Ph.D. student in the Department of Biomedical Data Science

Gautam Machiraju (he/him​) Ph.D. Candidate, Biomedical Informatics Stanford AI Lab (SAIL)

Indrani Bhattacharya, Ph.D. Research Engineer, Stanford University

Erin Craig, PhD student in Stanford’s Department of Biomedical Data Science

Matthew Aguirre
Ph.D. student in the Department of Biomedical Data Science

Gautam Machiraju (he/him)
Ph.D. Candidate, Biomedical Informatics
Stanford AI Lab (SAIL)

Indrani Bhattacharya, Ph.D.
Research Engineer, Stanford University