Workshop Archives

Fall 2024 Schedule


Date:
 9/26/24
Speaker: Xihong Lin
Title: Empower an end-to-end scalable and interpretable data science ecosystem by
integrating statistics, AI, and Genomics and Health
Host: Zihuai He
More info here


Date:
 10/3/24
Speaker: Wing Hung Wong and Wanwen Zeng
Title: Leveraging Large Language Models (LLMs) for Disease Risk Prediction and Personalized Gene Expression Modeling with Whole-Genome Sequencing Data
More info here


Date:
 10/10/24
Speaker: Todd Coleman
Title: A Generalized Framework for Learning Graphical Models from Traveling Waves
Host: Barbara Engelhardt
More info here


Date:
 10/17/24
Speaker: Markus Covert
Title: Whole-cell modeling of E. coli: from simulation to discovery
Host: Barbara Engelhardt
More info here


Date:
 10/24/24
Speaker: Hawa Racine Thiam
Title: Neutrophil Biophysics through the Lens of NETosis
Host: Zihuai He
More info here


Date:
 10/31/24
Speaker: Sanmi Koyejo
Title: Handling distribution shifts (in healthcare) like a pro!
Host: Barbara Engelhardt
More info here


Date:
 11/7/24
Speaker: Lexin Li
Title: Some Recent and Ongoing Work on Intracranial Neurodata Analysis
Host: Zihuai He
More info here


Date:
 11/14/24
Speaker: Serena Sanulli
Title: TBD
Host: Barbara Engelhardt


Date:
 11/21/24
Speaker: John Witte
Title: TBD
Host: Zihuai He
More info here


Date:
 12/5/24
Speaker: Anshul Kundaje
Title: TBD
Host: Zihuai He

Winter 2024

Adit Radhakrishnan

Date: 1/18/24
Adit Radhakrishnan
Title: How do neural networks learn features from data?
Host: Aaron Newman
More info here. 

Daniel Mukasa

Date: 1/25/24
Daniel Mukasa
Title: Computationally Design of Wearable Chemical Sensors for Personalized Healthcare
Host: Aaron Newman
More info here

Drago Plecko

Date: 2/1/24
Speaker: Drago Plecko
Title: Causal Health Equity – Toward a Taxonomy
Host: Aaron Newman
More info here

 

Date: 2/8/24
Speaker: Emma Pierson
Title: Methods for missing outcomes
 in healthcare and public health
Host: Aaron Newman
More info here 

Date: 2/15/24
Speaker: Yuzhe Yang
Title: Learning to Assess Chronic Diseases: Early Diagnosis, Severity, Progression, and Medication Response
Host: Aaron Newman
More info here

Brian Trippe

Date: 2/22/24
Speaker: Brian Trippe
Title: Probabilistic methods for designing functional protein structures
Host: Aaron Newman
More info here

 


Date:
2/29/24
Speaker: Emily Alsentzer
Title: Few Shot Learning for Rare Disease Diagnosis
Host: Aaron Newman
More info here

Victoria Popic

Date: 3/4/24
Speaker: Victoria Popic
Title: Learning the signatures of structural variation in the genome
Host: Aaron Newman
More info here

Date: 3/7/24
Speaker: William DeWitt
Title: Dynamics, prediction, and computation for evolutionary mechanisms in immune responses
Host: Aaron Newman
More info here

Bahareh Tolooshams

Date: 3/14/24
Speaker:
Bahareh Tolooshams
Title: Deep Unrolling for Inverse Problems in Engineering and Science
Host: Aaron Newman
More info here

Peter Koo

Date: 3/21/24
Speaker: Peter Koo
Title: Interpreting rules of gene regulation learned by deep learning
Host: Aaron Newman
More info here

Spring 2023

Abstracts, when available, are included in the drop-down

4/6 Jianqing Fan - Communication-Efficient Distributed Estimation and Inference for Cox's Model

Jianqing Fan

Princeton University

TITLE:

Communication-Efficient Distributed Estimation and Inference for Cox’s Model

ABSTRACT:

Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high- dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any of its coordinate elements based on decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods. (Joint work with Pierre Bayle and Zhipeng Lou).

Website: https://fan.princeton.edu/

Zoom link: https://stanford.zoom.us/j/94324405118? pwd=WnR3Y1dqK3plYWREN0RNVjRlNnhEUT09&from=addon

Meeting ID: 943 2440 5118

Password: 366430

PDF Flier

4/13 Rhiju Das - Modeling and design of RNA-only structures

BIOMEDICAL DATA SCIENCE PRESENTS:
BIODS 260C
4/13/23 1:30PM-2:50PM
MSOB X303 (SEE ZOOM DETAILS BELOW)
Rhiju Das
Associate Professor of Biochemistry; Stanford University

Title:

Modeling and design of RNA-only structures

Abstract:

The discovery and design of biologically important RNA molecules has lagged behind proteins, in part due to the general difficulty of three-dimensional RNA structural characterization. What are the prospects for an AlphaFold for RNA? I’ll describe some recent progress in modeling RNA structure from old-fashioned and new machine learning, cryoelectron microscopy, and internet-scale competitions hosted on the Eterna, Kaggle, and CASP platforms.

Reading/viewing list:

“RNA structure: a renaissance begins?” https://www.nature.com/articles/s41592-021-01132-4
“RNA secondary structure packages evaluated and improved by high-throughput experiments”

https://www.nature.com/articles/s41592-022-01605-0

Zoom link: https://stanford.zoom.us/j/94324405118? pwd=WnR3Y1dqK3plYWREN0RNVjRlNnhEUT09&from=addon

Meeting ID: 943 2440 5118

Password: 366430

PDF Flier

4/20 Mark van der Laan - Targeted Learning and Causal Inference for Integrating Real World Evidence into the Drug Approval Process and Safety Analysis

Mark van der Laan

Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics

University of California, Berkeley

Title:

Targeted Learning and Causal Inference for Integrating Real World Evidence into the Drug Approval Process and Safety Analysis

Abstract:

Targeted Learning represents a general multi-step roadmap for accurately translating the real world into a formal statistical estimation problem, and a corresponding template for construction of optimal machine learning based estimators of any desired target causal estimand combined with formal statistical inference. It is flexible by being able to incorporate high dimensional and diverse data sources. To optimize finite sample performance, it can be tailored towards the precise experiment and statistical estimation problem in question, while being theoretically grounded, optimal, and benchmarked. We provide a motivation, explanation, and overview of targeted learning; the key role of super-learning, the Highly Adaptive Lasso; and discuss SAP construction based on targeted learning. Specifically, we discuss recent theoretical advances on the Higher order Spline Highly Adaptive Lasso. We also discuss a Sentinel and FDA RWE demonstration project of targeted learning.

Zoom link: https://stanford.zoom.us/j/94324405118? pwd=WnR3Y1dqK3plYWREN0RNVjRlNnhEUT09&from=addon
Meeting ID: 943 2440 5118
Password: 366430

PDF Flier

4/27 Su-In Lee - Explainable AI: where we are and how to move forward for cancer pharmacogenomics

Su-In Lee

Professor, Paul G. Allen School of Computer Science &

Engineering University of Washington

Title: Explainable AI: where we are and how to move forward for cancer pharmacogenomics

Abstract: In the first part of the talk, I will go over a number of research work done by my lab on the topics of explainable AI applied to biomedical problems, which exemplifies how it addresses new scientific questions, make new biological discoveries from data, make informed clinical decisions, and even open new research directions in biomedicine.

In the second part of the talk, I will show you that explainable AI needs to evolve and improve to solve real-world problems in computational biology and medicine by having a deep dive into our cancer pharmacogenomics project led by our Ph.D. student Joseph Janizek in collaboration with Prof. Kamila Naxerova at Harvard Medical School.

Bio:

Prof. Su-In Lee is a Paul G. Allen Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington. She completed her PhD in 2009 at Stanford University with Prof. Daphne Koller in the Stanford Artificial Intelligence Laboratory. Before joining the UW in 2010, Lee was a Visiting Assistant Professor in the Computational Biology Department at Carnegie Mellon University School of Computer Science. She has received the National Science Foundation CAREER Award and been named an American Cancer Society Research Scholar. She has received generous grants from the National Institutes of Health, the National Science Foundation, and the American Cancer Society.

Zoom linkhttps://stanford.zoom.us/j/94324405118? pwd=WnR3Y1dqK3plYWREN0RNVjRlNnhEUT09&from=addon

Meeting ID: 943 2440 5118 Password: 366430

PDF Flier

5/4 Honoring Richard Olshen on his 80th Birthday

Honoring Richard Olshen on his 80th Birthday

Featured Speakers: Lu Tian, Trevor Hastie, Brad Betts, and Brad Efron with introductory and concluding remarks by Sylvia Plevritis

Talks from 1:30-2:40 at MSOB 303
Cake and other refreshments to follow at DBDS Lounge

Professor Richard Allen Olshen began his illustrious career at Yale University under the guidance of L.J. Savage, where he completed his Ph.D in Statistics in 1966. He first came to Stanford in 1967. In 1975 he went to UC San Diego, where he was on the faculty in the Math Department until 1989, at which point he returned to Stanford first in the Department of Health Research and Policy and ultimately the Department of Biomedical Data Science. During his career he made the transition from mathematical statistics to cutting-edge applications. He has made strong contributions to tree-structured learning, gait analysis, digital radiography, and is currently working on problems in molecular genetics. The four speakers will highlight some of his research accomplishments during his storied career.

PDF Flier

5/11 Lexin Li - Statistical Neuroimaging Analysis: An Overview

Lexin Li, Ph.D.
Professor of Biostatistics at the Department of Biostatistics and Epidemiology, and Helen Wills Neuroscience Institute, of the University of California, Berkeley

Title: Statistical Neuroimaging Analysis: An Overview

Abstract:

Understanding the inner workings of human brains, as well as their connections with neurological disorders, is one of the most intriguing scientific questions. Studies in neuroscience are greatly facilitated by a variety of neuroimaging technologies, including anatomical magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), electroencephalography (EEG), diffusion tensor imaging, positron emission tomography (PET), among many others. The size and complexity of medical imaging data, however, pose numerous challenges, and call for constant development of new statistical methods. In this talk, I give an overview of a range of neuroimaging topics our group has been investigating, including imaging tensor analysis, brain connectivity network analysis, multimodality analysis, and imaging causal analysis. I also illustrate with a number of specific case studies.

Bio:

Lexin Li, Ph.D., is a Professor of Biostatistics at the Department of Biostatistics and Epidemiology, and Helen Wills Neuroscience Institute, of the University of California, Berkeley. His research interests include neuroimaging analysis, network data analysis, high dimensional regressions, dimension reduction, machine learning, and biomedical applications. He is a Fellow of the American Statistical Association (ASA), a Fellow of the Institute of Mathematical Statistics (IMS), and an Elected Member of the International Statistical Institute (ISI).

Zoom link: https://stanford.zoom.us/j/92124459914? pwd=cFpJYXVLOExUVjMzZkNsYXA0b0RxUT09&from=addon

Meeting ID: 943 2440 5118
Password: 366430

PDF Flier

5/18 Dan Daniel Erdmann-Pham - Probing differential expression patterns efficiently and robustly through adaptive linear multi-rank two-sample tests

Dan Daniel Erdmann-Pham

Stein Fellow in the Statistics Department at
Stanford University

Title:

Probing differential expression patterns efficiently and robustly through adaptive linear multi-rank two-sample tests

Abstract:

Two- and K-sample tests are commonly used tools to extract scientific discoveries from data. Naturally, the precise choice of test ought depend on the specifics of the generating mechanisms producing the data: strong parametric assumptions allow for efficient likelihood-based testing, while non-parametric approaches like Mann- Whitney and Kolmogorov-Smirnov-type tests are popular when such prior knowledge is absent. As this talk will argue, practitioners often find themselves in situations of neither full knowledge of all involved distribution nor in full ignorance of them, and therefore are in need of tests that span the spectrum of possible prior knowledge gracefully. It proposes so-called adaptive linear multi-rank statistics as promising candidates for this task, and illustrates their general utility, flexibility (including applications to multiple testing and testing under nuisance alternatives), and computational feasibility on examples from population genetics and single-cell differential expression analysis.

Bio:

I am a statistician working on the rigorous, interpretable, and scalable analysis of data with a specific focus on data arising in biology. Data underpins much modern scientific discovery, which has motivated the development of a rich set of tools to aid its analysis. The field of machine learning in particular has supplied an inventory of quantitative methods ranging from hypothesis testing to function approximation that are available off-the-shelf. However, choosing the most suitable algorithm for a given data set, or indeed whether an algorithm delivering satisfactory performance exists, is often obscured by tacit theoretical assumptions not readily accessible to the user, or a lack of clarity regarding method-specific capabilities and limitations. The broad theme of my work is to bridge such gaps by providing transparent data-analysis schemes for which provable optimality guarantees exist.

Zoom link: https://stanford.zoom.us/j/92124459914? pwd=cFpJYXVLOExUVjMzZkNsYXA0b0RxUT09&from=addon
Meeting ID: 943 2440 5118
Password: 366430

PDF Flier

5/25 Julia Palacios - Inference from single cell lineage tracing data generated via genome editing and a novel test for phylogenetic association.

Julia Adela Palacios
Assistant Professor of Statistics
Assistant Professor of Biomedical Data Science
Stanford University

Title: Inference from single cell lineage tracing data generated via genome editing and a novel test for phylogenetic association.

Abstract: Single cell lineage tracing data obtained via genome editing with Crispr/Cas9 technology enables us to better understand important developmental processes at an unprecedented resolution. In the first part of the seminar, I will present a model that allows us to infer cell lineage phylogenies and lineage population size trajectories in a maximum likelihood or Bayesian framework. We assume an efficient coalescent model on cell phylogenies and propose a mutation model that describes how synthetic CRISPR target arrays generate observed variation after many cell divisions. We apply our method to two different CRISPR technologies. In the second part of the seminar, I will present a model for trait evolution inspired by the Chinese Restaurant process. We use this model to derive a test for phylogenetic binary trait association and apply it to test several hypotheses in phylogenetics, infectious diseases and cancer.

References:

1. Zhang J, Preising GA, Schumer M, Palacios JA. CRP-Tree: A phylogenetic association test for binary traits. 2. Yang et al., Lineage tracing reveals the phylodynamics, plasticity, and paths of tumor evolution, Cell 2022

Bio:

In her research, Professor Palacios seeks to provide statistically rigorous answers to concrete, data-driven questions in population genetics, epidemiology, and comparative genomics, often involving probabilistic modeling of evolutionary forces and the development of computationally tractable methods that are applicable to big data problems. Past and current research relies heavily on the theory of stochastic processes and recent developments in machine learning and statistical theory for big data; future research plans are aimed at incorporating the effects of selection and population structure in Bayesian inference of evolutionary parameters such as effective population size and recombination rates, and development of more realistic and computationally efficient methods for phylodynamic methods of infectious diseases.

Zoom link: https://stanford.zoom.us/j/92124459914pwd=cFpJYXVLOExUVjMzZkNsYXA0b0RxUT09&f rom=addon
Meeting ID: 943 2440 5118
Password: 366430

PDF Flier

6/1 Xiao-li Meng - Privacy, Data Privacy, and Differential Privacy

Xiao-Li Meng
Founding Editor-in-Chief of Harvard Data Science Review
Whipple V. N. Jones Professor of Statistics, Harvard University

Title: Privacy, Data Privacy, and Differential Privacy

Abstract: This talk invites curious minds to contemplate the notion of data privacy, especially at the individual levels. It first traces the evasive concept of privacy to a legal right, apparently derived from the frustration of the husband of a socialite attracting tabloids when yellow journalism and printing photography in newspapers became popular in 1890s. More than a century later, the rise of digital technologies and data science has made the issue of data privacy a central concern for essentially all enterprises, from medical research to business applications, and to census operations. Differential privacy (DP), a theoretically elegant and methodologically impactful framework developed in cryptography, is a major milestone in dealing with the thorny issue of properly balancing data privacy and data utility. However, the popularity of DP has brought both hype and scrutiny, revealing several misunderstandings and subtleties that have created confusions even among specialists. The technical part of this talk is therefore devoted to explicating such issues from a statistical framework, built upon the prior-to-posterior semantics of DP and a multi-resolution perspective. This framework yields an intuitive statistical interpretation of DP, albeit it does not correspond in general to the commonly perceived and desired data privacy protection. Ultimately, the talk aims to highlight the challenges and research opportunities in quantifying data privacy, what DP does and does not protect, and the need to properly analyze DP data. (This talk is based on joint work with James Bailie and Ruobin Gong.)

Suggested readings:

1) Harvard Data Science Review: Differential Privacy for 2020 Census, https://hdsr.mitpress.mit.edu/specialissue2
and, in the same issue, the editorial.

2) Oberski, D. L., & Kreuter, F. (2020). Differential Privacy and Social Science: An Urgent Puzzle. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.63a22079

3) Groshen, E. L., & Goroff, D. (2022). Disclosure Avoidance and the 2020 Census: What Do Researchers Need to Know? Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.aed7f34f

Bio:

Xiao-Li Meng, the Founding Editor-in-Chief of Harvard Data Science Review and the Whipple V. N. Jones Professor of Statistics, was named the best statistician under the age of 40 by Committee of Presidents of Statistical Societies (COPSS) in 2001, and he is the recipient of numerous awards and honors for his more than 150 publications in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development. In 2020, he was elected to the American Academy of Arts and Sciences. . Meng received his BS in mathematics from Fudan University in 1982 and his PhD in statistics from Harvard in 1990. He was on the faculty of the University of Chicago from 1991 to 2001 before returning to Harvard, where he served as the Chair of the Department of Statistics (2004–2012 and the Dean of Graduate School of Arts and Sciences (2012–2017).

Zoom link: https://stanford.zoom.us/j/92124459914pwd=cFpJYXVLOExUVjMzZkNsYXA0b0RxUT09&from=a ddon
Meeting ID: 943 2440 5118
Password: 366430

PDF Flier

Fall 2022

Abstracts, when available, are included in the drop-down

9/29 Ajit Johnson Nirmal - The Spatial Landscape of Progression and Immunoediting in Primary Melanoma at Single-Cell Resolution

Dr. Ajit Johnson Nirmal, Instructor, Dana-Farber Cancer Institute

Seminar Title: The Spatial Landscape of Progression and Immunoediting in Primary Melanoma at Single-Cell Resolution

Abstract: Cutaneous melanoma is a highly immunogenic malignancy that is surgically curable at early stages but life-threatening when metastatic. Here we integrate high-plex imaging, 3D high-resolution microscopy, and spatially resolved microregion transcriptomics to study immune evasion and immunoediting in primary melanoma. We find that recurrent cellular neighborhoods involving tumor, immune, and stromal cells change significantly along a progression axis involving precursor states, melanoma in situ, and invasive tumor. Hallmarks of immunosuppression are already detectable in precursor regions. When tumors become locally invasive, a consolidated and spatially restricted suppressive environment forms along the tumor–stromal boundary. This environment is established by cytokine gradients that promote expression of MHC-II and IDO1, and by PD1–PDL1-mediated cell contacts involving macrophages, dendritic cells, and T cells. A few millimeters away, cytotoxic T cells synapse with melanoma cells in fields of tumor regression. Thus, invasion and immunoediting can coexist within a few millimeters of each other in a single specimen.

Suggested reading: 

The Spatial Landscape of Progression and Immunoediting in Primary Melanoma at Single-Cell Resolution

Zoom link: HTTPS://STANFORD.ZOOM.US/J/95499501659? PWD=TLFQTLBRM1JSEGXXWXFFELFIQLBCZZ09&FROM=ADDON PASSWORD: 406712

PDF Flier

10/6 Roxana Daneshjou - Precision health for all: developing inclusive datasets and algorithms

Dr. Roxana Daneshjou 

Location: MSOB x303 

Seminar Title: Precision health for all: developing inclusive datasets and algorithms

Abstract: Large biomedical datasets coupled with machine learning tools have the potential to transform the practice of dermatology. For example, analysis of skin disease images could help triage patients prior to the clinical visit and precision genomic medicine could identify personalized treatments for skin disease. However, biased datasets and algorithms that exclude underrepresented groups could exacerbate existing health disparities in dermatology. This talk will discuss working towards inclusive precision medicine through three examples: 1) assessing fairness in datasets and AI algorithms used for diagnosing disease in dermatology 2) developing an inclusive patient-facing algorithm to improve the quality of images submitted for teledermatology and 3) developing a pharmacogenomics algorithm that accounts for population diversity.  In order to develop a data-driven approach to dermatology that improves health disparities, rather than exacerbating them, we must be mindful of developing inclusive datasets and algorithms.

Suggested Readings:

Zoom link:
Meeting ID: 983 6641 4259

10/13 Lynn Petukhova - Leveraging Information in the Human Genome to Improve Skin Health and to Advance the Practice of Dermatology

BIODS 260
10/13/22
1:30 pm-2:30 pm
Lynn Petukhova
Assistant Professor, Epidemiology and Dermatology at the Columbia University Medical Center

Title: Leveraging Information in the Human Genome to Improve Skin Health and to Advance the Practice of Dermatology

Abstract: The process of diagnosing a patient historically has largely relied on clinical observations of symptoms by physicians. Limitations of a clinical diagnosis have been identified with the use of genetic and genomic technologies, which demonstrate that a molecular diagnosis derived from biomedical data can provide greater diagnostic accuracy and inform subsequent management. I conduct human genetic studies as a starting point for leveraging information in the human genome to improve the accuracy and utility of a skin disease diagnosis. Statistical evidence for an association between an inherited genetic variant and a disease outcome is a definitive marker for a disease mechanism, but does not provide adequate resolution of the mechanism for clinical translation. The scale and complexity of biomedical data that is available to define disease mechanisms requires data-driven approaches to identify salient features and to detect patterns among them that link disease mechanisms to interventions and outcomes. Using the hair follicle as a model organ to understand mappings between disease mechanisms and clinical diagnoses, our group is using clustering, network, and tensor factorization methods to discover clinically relevant relationships among genetically-derived disease entities. I will present results from three studies that our group is conducting that leverages knowledge about inherited genetic variants, disease genes, pathways, and/or comorbidities to define an underlying causal structure of skin disease pathogenesis and to identify key genetic regulators of hair follicle health.

 

Suggested Readings:

https://www.nature.com/articles/s41598-017-16050-9

https://dl.acm.org/doi/abs/10.1145/3368555.3384464

 

Zoom link: https://stanford.zoom.us/j/92865685887pwd=YjlUM1cxOHZ4UnBZMkhqcG1JYzFNdz09

Meeting ID: 928 6568 5887

Password: 219826

10/20 DBDS SEMINAR Lee Hood - Data-Driven Science of Wellness and Prevention: A 2nd Human Genome Project

Dr. Leroy Hood, MD, PhD, CEO/Founder of Phenome Health and Chief Strategy Officer and Professor of the Institute of Systems Biology

Location: MSOB x303 
Title: Data-Driven Science of Wellness and Prevention:  A 2nd Human Genome Project
Abstract: The vision of this project is that we will develop the infrastructure to employ a data-driven (genome/phenome analyses) approach to optimizing the health trajectory of individuals for body and brain.  We have two large populations (5000 and 10,000) that have respectively validated this approach for body and brain health, respectively.  These studies have led to us pioneering of the science of wellness and prevention as I will discussed in the lecture.  This million-person project, termed Beyond the Human Genome, has led to the creation of a non-profit, Phenome Health, which has acquired key partners for execution of this ambitious.  We are approaching the Federal Government for funding for this project, as we did for the first Human Genome Project.  This project is one of perhaps 10 or so 500,000 to one million person projects world-wide and it is unique in that it will carry out deep longitudinal phenome analyses, it will return results to participants and it is creating the infrastructure to spread this approach across the US and world healthcare systems.  This project will lead to a powerful data ecosystem that will generate new knowledge about medicine, will catalyze the initiation of many start-up companies and will pioneer a paradigm shift in healthcare from its current disease orientation to a wellness and prevention orientation, the largest paradigm shift in medicine ever.

Suggested readings:

Zoom Link: https://stanford.zoom.us/j/92874055477pwd=aThzNmpmNEQ1L2FjV0E5ZXF5SDR1UT09&from=addon

10/27 Adrian Buganza Tepole - Data-driven skin biophysics

Dr. Adrian Buganza-Tepole is an Associate Professor of Mechanical Engineering and Biomedical Engineering (courtesy) at Purdue University. He obtained his Ph.D. in Mechanical Engineering from Stanford University in 2015 and was a postdoctoral fellow at Harvard University for a year before joining Purdue as a faculty member in 2016. He was also a Miller Visiting Professor at UC Berkeley during Spring 2022. His group studies the interplay between mechanics and mechanobiology of skin. Using computational simulation, machine learning, and experimentation, his group seeks to characterize the multi-scale mechanics of skin to understand the fundamental mechanisms of this tissue’s mechano-adaptation in order to improve clinical diagnostics and interventional tools.

Title: Data-driven skin biophysics  

Abstract: 

The recent explosion in machine learning (ML) and artificial intelligence (AI) algorithms has started a revolution in many engineering fields, including computational biophysics. This talk focuses on our recent efforts to leverage ML methods to increase our fundamental understanding of skin and its unique ability to adapt to mechanical cues. The first project that will be described is skin growth in tissue expansion, a popular reconstructive surgery technique that grows new skin in response to sustained supra-physiological loading. We have created computational models that combine mechanics and mechanobiology to describe the deformation and growth of expanded skin. Together with experiments on a porcine model, and leveraging ML tools such as multi-fidelity Gaussian processes, we have performed Bayesian inference to learn mechanistically how skin grows in response to stretch. The second half of the talk will explore how mechanical cues can be key drivers of wound healing pathologies such as fibrosis. I will show computational models of reconstructive surgery and wound healing for a murine model of wound healing and in patient-specific cases. Once again, ML methods enable new kinds of analyses such as optimization under uncertainty and inverse parameter calibration which are not achieved with traditional approaches.

Suggested readings: 

Han T, et al. Bayesian calibration of a computational model of tissue expansion based on a porcine animal model. Acta Biomaterialia. 2022;137:136-46.

Tac V, Costabal FS, Tepole AB. Data-driven tissue mechanics with polyconvex neural ordinary differential equations. Comput Method Appl Mech Eng. 2022;398:115248.

Sohutskay DO, Tepole AB, Voytik-Harbin SL. Mechanobiological wound model for improved design and evaluation of collagen dermal replacement scaffolds. Acta Biomaterialia. 2021;135:368-82.

Lee T, Bilionis I, Tepole AB. Propagation of uncertainty in the mechanical and biological response of growing tissues using multi-fidelity Gaussian process regression. Comput Method Appl Mech Eng. 2020;359:112724.

Zoom info:

11/3 David Van Valen - Everything as Code

BIODS 260
11/3/22
1:30 pm-2:50 pm
David Van Valen
MSOB x303

Title: Everything as Code

Bio: David Van Valen is an Assistant Professor in the Division of Biology and Bioengineering at Caltech. Before becoming faculty, he studied mathematics (B.S. 2003) and physics (B.S. 2003) at the Massachusetts Institute of Technology, applied physics (Ph.D. 2011) at Caltech, medicine (M.D. 2013) at UCLA, and bioengineering as a postdoctoral fellow at Stanford University. At Caltech, his research group develops new technologies at the intersection of imaging, genomics, and machine learning to produce quantitative measurements of living systems with single-cell resolution. David is the recipient of several awards, including a Hertz Fellowship (2005), a Rita Allen Scholar award (2020), A Pew-Stewart Cancer Research Scholar award (2021), a Heritage Medical Research Investigator award (2021), a Moore Inventor Fellowship (2021), and the NIH New Innovator award (2022).

Abstract: Biological systems are difficult to study because they consist of tens of thousands of parts, vary in space and time, and their fundamental unit—the cell—displays remarkable variation in its behavior. These challenges have spurred the development of genomics and imaging technologies over the past 30 years that have revolutionized our ability to capture information about biological systems in the form of images. Excitingly, these advances are poised to place the microscope back at the center of the modern biologist’s toolkit. Because we can now access temporal, spatial, and “parts list” variation via imaging, images have the potential to be a standard data type for biology.

For this vision to become reality, biology needs a new data infrastructure. Imaging methods are of little use if it is too difficult to convert the resulting data into quantitative, interpretable information. New deep learning methods are proving to be essential to reliable interpretation of imaging data. These methods differ from conventional algorithms in that they learn how to perform tasks from labeled data; they have demonstrated immense promise, but they are challenging to use in practice. The expansive training data required to power them are sorely lacking, as are easy-to-use software tools for creating and deploying new models. Solving these challenges through open software is a key goal of the Van Valen lab. In this talk, I describe DeepCell, a collection of software tools that meet the data, model, and deployment challenges associated with deep learning. These include tools for distributed labeling of biological imaging data, a collection of modern deep learning architectures tailored for biological image analysis tasks, and cloud-native software for making deep learning methods accessible to the broader life science community. I discuss how we have used DeepCell to label large-scale imaging datasets to power deep learning methods that achieve human level performance and enable new experimental designs for imaging-based experiments.

Website: https://vanvalen.caltech.edu

Zoom info:

Password: 705300
Meeting URL: https://stanford.zoom.us/j/92874055477?pwd=aThzNmpmNEQ1L2FjV0E5ZXF5SDR1UT09&from=addon
iPhone one-tap (US Toll): +18333021536,,92874055477# or +16507249799,,92874055477#
Or Telephone: 
Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free)
Meeting ID: 928 7405 5477
Password: 705300

PDF Flier

11/10 Liana Lareau - Revealing patterns of alternative splicing in single cells

Liana Lareau Assistant Professor, Department of Bioengineering University of California, Berkeley

TITLE:

Revealing patterns of alternative splicing in single cells

ABSTRACT:

Alternative splicing shapes the output of the genome and contributes to each cell’s unique identity, but single-cell RNA sequencing has struggled to capture its impact. We have shown that low recovery of mRNAs from single cells can lead to misleading conclusions about alternative splicing and its regulation. To address this, we have developed a method, Psix, to confidently identify splicing that changes across a landscape of single cells, using a probabilistic model that is robust against the data limitations of scRNA-seq. Its autocorrelation-inspired approach finds patterns of alternative splicing that correspond to patterns of cell identity, such as cell type or developmental stage, without the need for explicit cell clustering, labeling, or trajectory inference. Psix reveals cell type-dependent splicing patterns and the wiring of the splicing regulatory networks that control them, enabling scRNA-seq analysis to go beyond transcription to understand the roles of post-transcriptional regulation in determining cell identity.

SUGGESTED READINGS:

CF Buen Abad Najar, N Yosef, LF Lareau. Coverage-dependent bias creates the appearance of binary splicing in single cells. eLife, 2020. https://elifesciences.org/articles/54603

CF Buen Abad Najar, P Burra, N Yosef, LF Lareau. Identifying cell state–associated alternative splicing events and their coregulation. Genome Research, 2022 https://genome.cshlp.org/content/32/7/1385.short

Zoom link: https://stanford.zoom.us/j/92874055477pwd=aThzNmpmNEQ1L2FjV0E5ZXF5SDR 1UT09&from=addon
Password: 705300

PDF Flier

11/17 Lorin Crawford - Machine Learning for Human Genetics: A Multi-Scale View on Complex Traits and Disease

BIOMEDICAL DATA SCIENCE PRESENTS:
BIODS 260
11/17/22 1:30PM-2:50PM
MSOB X303 (SEE ZOOM DETAILS BELOW)
Lorin Crawford
Principal Researcher, Microsoft Research New England; Associate Professor of Biostatistics, Brown University http://lorincrawford.com/

TITLE:

Machine Learning for Human Genetics: A Multi-Scale View on Complex Traits and Disease

ABSTRACT:

A common goal in genome-wide association (GWA) studies is to characterize the relationship between genotypic and phenotypic variation. Linear models are widely used tools in GWA analyses, in part, because they provide significance measures which detail how individual single nucleotide polymorphisms (SNPs) are statistically associated with a trait or disease of interest. However, traditional linear regression largely ignores non-additive genetic variation, and the univariate SNP-level mapping approach has been shown to be underpowered and challenging to interpret for certain trait architectures. While machine learning (ML) methods such as neural networks are well known to account for complex data structures, these same algorithms have also been criticized as “black box” since they do not naturally carry out statistical hypothesis testing like classic linear models. This limitation has prevented ML approaches from being used for association mapping tasks in GWA applications. In this talk, we present flexible and scalable classes of Bayesian feedforward models which provide interpretable probabilistic summaries such as posterior inclusion probabilities and credible sets which allows researchers to simultaneously perform (i) fine- mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. While analyzing real data assayed in diverse self-identified human ancestries from the UK Biobank, the Biobank Japan, and the PAGE consortium we demonstrate that interpretable ML has the power to increase the return on investment in multi-ancestry biobanks. Furthermore, we highlight that by prioritizing biological mechanism we can identify associations that are robust across ancestries—suggesting that ML can play a key role in making personalized medicine a reality for all.

SUGGESTED READINGS:

A.R. Martin, M. Kanai, Y. Kamatani, Y. Okada, B.M. Neale, and M.J. Daly (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nature Genetics. 51: 584–591.

S.P. Smith, S. Shahamatdar, W. Cheng, S. Zhang, J. Paik, M. Graff, C. Haiman, T.C. Matise, K.E. North, U. Peters, E. Kenny, C. Gignoux, G. Wojcik, L. Crawford, and S. Ramachandran (2022). Enrichment analyses identify shared associations for 25 quantitative traits in over 600,000 individuals from seven diverse ancestries. American Journal of Human Genetics. 109: 871-884.

P. Demetci, W. Cheng, G. Darnell, X. Zhou, S. Ramachandran, and L. Crawford (2021). Multi-scale inference of genetic architecture using biologically annotated neural networks. PLOS Genetics. 17(8): e1009754.

Zoom link: https://stanford.zoom.us/j/92874055477pwd=aThzNmpmNEQ1L2FjV0E5ZXF5SDR1UT09 &from=addon
Password: 705300

PDF Flier

12/8 Lior Pachter - Fireside chat with Lior Pachter, moderated by Barbara Engelhardt

BIOMEDICAL DATA SCIENCE PRESENTS:

BIODS 260

12/08/22 1:30PM-2:50PM

MSOB X303

Lior Pachter Fireside chat

Moderated by Barbara Engelhardt

Lior Pachter was born in Ramat Gan, Israel, and grew up in Pretoria, South Africa where he attended Pretoria Boys High School. After receiving a B.S. in Mathematics from Caltech in 1994, He left for MIT where he was awarded a PhD in applied mathematics in 1999. He then moved to the University of California at Berkeley where he was a postdoctoral researcher (1999-2001), assistant professor (2001-2005), associate professor (2005-2009), and until 2018 the Raymond and Beverly Sackler professor of computational biology and professor of mathematics and molecular and cellular biology with a joint appointment in computer science. Since January 2017 he has been the Bren professor of computational biology at Caltech.

His research interests span the mathematical and biological sciences, and he has authored over 100 research articles in the areas of algorithms, combinatorics, comparative genomics, algebraic statistics, molecular biology and evolution. He has taught a wide range of courses in mathematics, computational biology and genomics. He is a Fellow of the International Society of Computational Biology and has been awarded a National Science Foundation CAREER award, a Sloan Research Fellowship, the Miller Professorship, and a Federal Laboratory Consortium award for the successful technology transfer of widely used sequence alignment software developed in his group.

12/1 Marina Sirota - Leveraging Molecular and Clinical Data to Improve Women’s Health in the Era of Precision Medicine

BIODS 260
12/1/22
1:30 pm-2:30 pm
MSOB x303
Marina Sirota, PhD Associate Professor
Associate Director of Advocacy and Outreach Bakar Computational Health Sciences Institute (BCHSI) University of California, San Francisco

TITLE:

Leveraging Molecular and Clinical Data to Improve Women’s Health in the Era of Precision Medicine

ABSTRACT:

Each year, 15 million babies (representing 10% of the world’s births) are born preterm, defined as before the 37th week of gestation. Survival for most children born preterm has improved considerably, but surviving children remain at increased risk for a variety of serious complications, many of which contribute to lifelong challenges for individuals and their families, as well as to burdensome economic costs to society. The exact mechanism of spontaneous preterm birth is unknown, though a variety of social, environmental, and maternal factors have been implicated in its cause. We are in particular interested in applying computational integrative methods to investigate the role of the immune system in pregnancy (Cell Press Sneak Peak 2021) and elucidating genetic (Sci Rep 2018), transcriptomic (Front Immunol 2018), microbiome (Front Microbio 2020), environmental (Environ Health 2018), and clinical determinants of preterm birth. Moreover, through the March of Dimes (MOD) Database for Preterm Birth Research, we are leading efforts to organize scientific data and research across all MOD- funded Prematurity Research Centers with the goal of enhancing research collaboration and coordination to accelerate the overall pace of discovery in this field (Sci Data 2018). This work is funded by the National Library of Medicine at NIH, March of Dimes and The Burroughs Wellcome Fund.

SUGGESTED READING:

https://www.nature.com/articles/s41598-021-91625-1

https://doi.org/10.1186/s12916-022-02522-x

WEBSITE: http://sirotalab.ucsf.edu

Zoom link: https://stanford.zoom.us/j/92874055477pwd=aThzNmpmNEQ1L2Fj V0E5ZXF5SDR1UT09&from=addon
Password: 705300

PDF Flier

Spring 2022

Abstracts, when available, are included in the drop-down

3/31 Matthew Jones, Bioinformatics PhD candidate at UC San Francisco and UC Berkeley, advised by Jonathan Weissman and Nir Yosef - Algorithmic tools for single-cell lineage tracing to illuminate the phylodynamics, plasticity, and transcriptional paths of tumor evolution

4/7 Dianbo Liu, PhD, Postdoc fellow and leader of humanitarian AI team, Prof. Yoshua Bengio Group, Mila -Quebec AI institute, Canada, and Research affiliate, The Broad Institute of MIT and Harvard, USA - Improve accessibility and fairness in healthcare using generalizable artificial intelligence

4/14 Jennifer Listgarten, PhD, Professor, UC Berkeley Department of Electrical Engineering and Computer Science and Center for Computational Biology - Machine Learning-Based Protein Engineering

Abstract (PDF)

4/21 Collin M. Stultz, MD, PhD, Nina T. and Robert H. Rubin Professor in Medical Engineering and Science, Professor of Electrical Engineering and Computer Sciecne, Harvard-MIT, Division of Health Science and Technology, MIT, Division of Cardiovascular Medicine, Massachusetts General Hospital - Artificial Intelligence in Clinical Medicine: Challenges, Obstables, and Opportunities

4/28 Serena Wang, PhD student in Computer Science, UC Berkeley, advised by Rediet Abebe and Michael I. Jordan - Out of Scope, Out of Mind: Expanding Frontiers for Fair ML in Social Decision Making

5/5 Aaron Newman, PhD, Assistant Professor of Biomedical Data Science, Stanford - Decoding stem cell hierarchies and cellular ecosystems in cancer

5/12 James Zou, PhD, Assitant Professor of Biomedical Data Science and, by courtesy, of Comupter Science and Electrical Engineering, Stanford - AI for clinical trials and clinical trials for AI

5/19 Steven E. Brenner, PhD, Professor at the Department of Plant and Microbial Biology, University of California Berkeley - Prediction potential and pitfalls in pervasive population personal genomics: Interpreting newborn genomes with Notes on privacy timebombs in functional genomics data.

5/26 Elizabeth Stuart, PhD, Associate Dean for Education, and Professor of Mental Health, of Biostatistics, and of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health - Study designs to estimate policy effects using large-scale data: Applications to COVID-19 and opioid policies.

Winter 2022

Abstracts, when available, are included in the drop-down

Anshul Kundaje, Assistant Professor of Genetics and of Computer Science, Stanford - Deep learning oracles for genomic discovery

The human genome sequence contains the fundamental code that defines the identity and function of all the cell types and tissues in the human body. Genes are functional sequence units that encode for proteins. But they account for just about 2% of the 3 billion long human genome sequence. What does the rest of the genome encode? How is gene activity controlled in each cell type? Where do the regulatory control elements lie and what is their sequence composition? How do variants and mutations in the genome sequence affect cellular function and disease? These are fundamental questions that remain largely unanswered. The regulatory code that controls gene activity is encoded in the DNA sequence of millions of cell type specific regulatory DNA elements in the form of functional sequence syntax. This regulatory code has remained largely elusive despite exciting developments in experimental techniques to profile molecular properties of regulatory DNA. To address this challenge, we have developed neural networks that can learn de-novo representations of regulatory DNA sequence to map genome-wide molecular profiles of protein DNA interactions and chromatin state at unprecedented resolution and accuracy across diverse cellular contexts. We have developed methods to interpret DNA sequences through the lens of the models and extract local and global predictive syntactic patterns revealing many insights into the regulatory code. Our models also serve as in-silico oracles to predict the effects of natural and disease-associated genetic variation i.e. how differences in DNA sequence across healthy and diseased individuals are likely to affect molecular mechanisms associated with common and rare diseases. These models enable optimized design of genome perturbation approaches to decipher functional properties of DNA and variants and serve as a powerful lens for genomic discovery.

David Kurtz, Assistant Professor of Medicine (Oncology), Stanford - Developing and translating liquid biopsies to create a personalized dynamic risk model during cancer therapy

Predicting an individual’s response to treatment remains a major challenge in the care of patients with cancer. Liquid biopsies – a group of biomarkers to detect cancer from circulating tumor DNA (ctDNA) – are promising tools to measure treatment response and residual disease. Moreover, due to their ease of evaluation, liquid biopsies readily allow serial evaluation over time to monitor response and emerging resistance to therapy.

Despite this promise, current approaches for implementing liquid biopsies are limited in two key ways. First, the sensitivity of current approaches for detecting minimal residual disease are limited. Second, current clinical paradigms for risk stratification largely rely on static features measured at a single point in time. To address these, our group has developed novel molecular and statistical frameworks to improve the performance of liquid biopsies. First, we developed Phased Variant Enrichment and Detection by Sequencing (PhasED-Seq), a novel platform to detect ctDNA in the parts-per-million range. This method allows for measurement of residual disease that is undetectable by other approaches. Later, we developed a framework to integrate serial ctDNA measurements with conventional prognostic features to produce a single personalized prediction of likely outcome to cancer therapy. This method, called the Continuous Individualized Risk Index (CIRI), adds additional information as it is measured throughout a patient’s course of therapy to dynamically update the probability of outcomes for an individual patient. We demonstrate that CIRI can improve on risk predictions from conventional risk tools in diverse cancers, including lymphomas, leukemias, and breast cancer.

Sherri Rose, Associate Professor of Health Policy, Stanford - Fairness and Generalizing to Target Populations

Data science solutions tailored to problems in the health care system are particularly salient for social policy given the size and impact of the health care sector. This presentation will discuss specific challenges related to developing and deploying statistical machine learning algorithms for health economics and outcomes research with emphasis on groups marginalized by the health care system. Considerations go beyond typical considerations in statistical methodology, and focus on concepts such as algorithmic fairness and generalizability as well as the intersection between them.

Major examples in this talk will examine health care costs. Financing changes in the health care system can lead to improved health outcomes and gains in access to care. More than 50 million people in the U.S. are enrolled in an insurance product that risk adjusts payments, and this has huge financial implications—hundreds of billions of dollars. Risk adjustment for health plan payment is known to undercompensate insurers for certain groups of enrollees, creating incentives for insurers to discriminate against these groups. The impact of individual plans with varying benefit design on health spending among low-income individuals has also been underexplored. We will discuss and demonstrate new algorithmic fairness methods and generalizability approaches for these problems.

Bianca Dumitrascu, Departmental Early Career Fellow, Computer Laboratory, Cambridge University - Statistical machine learning for genetics and health: multi-modality, interpretability, mechanism

Genomic and medical data are available at unprecedented scales. This is due, in part, to improvements and developments in data collection, high throughput sequencing, and imaging technologies. How can we extract lower dimensional representations from these high dimensional data in a way that retains fundamental biological properties across different scales? Three main challenges arise in this context: how to aggregate information across different experimental modalities, how to enforce that such representations are interpretable, and how to leverage prior dynamical knowledge to provide new insight into mechanism. I will present my work on developing statistical machine learning models and algorithms to answer this question and address these challenges. First, I will present a generative model for learning representations that jointly model information from gene expression and tissue morphology in a population setting. Then, I will describe a method for making multi-modal representations interpretable using a label-aware compressive classification approach for gene panel selection in single cell data. Finally, I will discuss inference methods for models which encode mechanistic assumptions, a need that arises naturally in gene regulatory networks, predator-prey systems, and electronic health care records. Throughout this work, recent advances in machine learning and statistics are harnessed to bridge two worlds — the world of real, messy biological data and that of methodology and computation. This talk describes the importance of domain knowledge and data-centric modeling in motivating new statistical venues and introduces new ideas that touch upon improving experimental design in biomedical contexts.

Molei Liu, Ph.D. Candidate, Harvard T.H. Chan School of Public Health - Integrative Analysis of Multi-Institutional Electronic Health Record Data: overcoming noisiness, heterogeneity and privacy constraints

Precise statistical modeling and inference often rely on integrative analysis of datasets from multiple sites. While such modern meta-analysis could be uniquely challenging for electronic health record (EHR) data due to noisiness, high dimensionality, heterogeneity and privacy constraints. I will present novel statistical framework and approaches to overcome these practical challenges. In specific, we develop three methods for individual information protected aggregation of multi-institutional large-scale and heterogeneous EHR data sets, aiming at sparse regression, multiple testing, and surrogate-assisted semi-supervised learning respectively. Through both asymptotic analysis and numerical experiments, we demonstrate that our proposed methods outperform existing options and perform closely to the ideal individual patient data pooling analysis not feasible due to the privacy constraint. We illustrate the use of our methods in real EHR-based studies including EHR phenotyping for cardiovascular disease and inferring genetic associations of type II diabetes linked with biobank data.

Stephanie Hicks, Assistant Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health - Making single-cell data science accessible, reproducible, and scalable to improve human health

Single-cell RNA-Seq (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. However, single-cell data present unique challenges that have required the development of specialized methods and software infrastructure to successfully derive biological insights. Compared to bulk RNA-seq, there is an increased scale of the number of observations (or cells) that are measured and there is increased sparsity of the data, or fraction of observed zeros. Furthermore, as single-cell technologies mature, the increasing complexity and volume of data require fundamental changes in data access, management, and infrastructure alongside specialized methods to facilitate scalable analyses. I will discuss some challenges in the analysis of scRNA-seq data and present some solutions that we have made towards addressing these challenges.

Daniela Witten, Professor of Biostatistics and of Statistics, University of Washington School of Public Health - Inference for single-cell RNA-sequencing data

When analyzing single-cell RNA-sequencing data, we often wish to perform unsupervised learning of latent structure among the cells, and then to test for association between this latent structure and gene expression. For example, we might cluster the cells into cell types, and then test whether gene expression differs between the clusters. Or we might estimate a low-dimensional subspace representing a cellular developmental trajectory, and then test whether gene expression is correlated with this trajectory. However, a classical statistical test of the association between gene expression and the latent structure will not control the Type 1 error, since the latent structure was estimated on the same data used for hypothesis testing. Furthermore, a straightforward sample splitting approach does not fix the problem.

In this talk, I will discuss two solutions to this problem. The first involves selective inference, and the second involves “count splitting”, a simple variant of sample splitting that does control the Type 1 error.

This is joint work with current PhD students Anna Neufeld and Yiqun Chen, PhD alum Lucy Gao (now at U. Waterloo), and collaborators Jacob Bien (USC) and Alexis Battle and Joshua Popp (Johns Hopkins).

Nicholas Tatonetti, Associate Professor of Biomedical Informatics (in System Biology and Medicine), Columbia University - Observational data mining for biomedical discoveries

Data is transforming the scientific method across many domains. In drug safety, data from electronic health records and search logs are now being collected alongside traditional sources like case studies, spontaneous reporting systems, and model systems. These data sources present new opportunities for studying the phenomenological and molecular effects of active small molecules. However, they also present new challenges in data integration, statistical analysis, and the nature of hypothesis generation. I will discuss these opportunities an challenges and their role in the future of drug safety science.

Jingyi (Jessica) Li, Associate Professor, Biostatistics, Statistics, Human Genetics, UCLA - Opening Black Boxes: Enhancing Statistical Rigor in Genomics Data Science

The rapid development of genomics technologies has propelled fast advances in genomics data science. While new computational algorithms have been continuously developed to address cutting-edge biomedical questions, a critical but largely overlooked aspect is the statistical rigor. In this talk, I will introduce our recent work that aims to enhance the statistical rigor by addressing three issues: 1. Large-scale feature screening (i.e., enrichment and differential analysis of high-throughput data) relying on ill-posed p-values; 2. Double-dipping (i.e., statistical inference on biasedly altered data); 3. Gaps between black-box generative models and statistical inference.

F. William Townes, Postdoctoral Researcher, Princeton Computer Science Department - Nonnegative Spatial Factorization

Gaussian processes are widely used for the analysis of spatial data due to their nonparametric flexibility and ability to quantify uncertainty, and recently developed scalable approximations have facilitated application to massive datasets. For multivariate outcomes, linear models of coregionalization combine dimension reduction with spatial correlation. However, their real-valued latent factors and loadings are difficult to interpret because, unlike nonnegative models, they do not recover a parts-based representation. We present nonnegative spatial factorization (NSF), a spatially-aware probabilistic dimension reduction model that naturally encourages sparsity. We compare NSF to real-valued spatial factorizations such as MEFISTO and nonspatial dimension reduction methods using simulations and high-dimensional spatial transcriptomics data. NSF identifies generalizable spatial patterns of gene expression. Since not all patterns of gene expression are spatial, we also propose a hybrid extension of NSF that combines spatial and nonspatial components, enabling quantification of spatial importance for both observations and features.

Susan Holmes, Professor of Statistics, & Laura Symul, Postdoctoral Scholar, Statistics - Topic modeling for understanding multi-domain data from the Vaginal Microbiome

Diverse and non-Lactobacillus-dominated vaginal microbial communities are associated with adverse health outcomes such as preterm birth and acquisition of sexually transmitted infections. Despite the importance of recognizing and understanding the key risk-associated features of these communities, their heterogeneous structure and properties remain ill-defined. Clustering approaches have been commonly used to characterize vaginal communities, but they lack sensitivity and robustness in resolving community substructures and revealing transitions between potential sub-communities. We used a more highly resolved approach based on mixed membership topic models with multi-domain longitudinal data from cohorts of pregnant and non-pregnant subjects. We have developed a specific tool, alto, that facilitates the visualization and choice of the number and interpretation of the topics involved.

Overall, our analyses based on mixed membership models revealed new substructures of the vaginal ecosystems which may have potentially important clinical or biological associations.

Julia Simard, Associate Professor of Epidemiology & Population Health and, by courtesy, of Medicine (Immunology & Rheumatology) - Misclassification in “big data” in epidemiologic research: examples from reproductive outcomes and pharmacoepidemiology

Based on ongoing work examining the potential protective effect of hydroxychloroquine on preeclampsia in lupus pregnancies, this talk will present two instances of potential misclassification.

The first will focus on challenges with phenotyping preeclampsia, in the case of early-onset preeclampsia. The second example considers how adherence to medication is considered (and interpreted). In both cases, we’ll show how complementary data can help characterize the misclassification, brainstorm potential solutions, and the implications from clinical care to the basic sciences.

Autumn 2021

Abstracts, when available, are included in the drop-down

Chiara Sabatti, Professor of Biomedical Data Science and of Statistics, Stanford - Genetic variants across human populations—how similarities and differences play a role in our understanding of the genetic basis of traits

Identifying which genetic variants influence medically relevant phenotypes is an important task both for therapeutic development and for risk prediction. In the last decade, genome wide association studies have been the most widely-used instrument to tackle this question. One challenge that they encounter is in the interplay between genetic variability and the structure of human populations. In this talk, we will focus on some opportunities that arise when one collects data from diverse populations and present statistical methods that allow us to leverage them. The presentation will be based on joint work with M. Sesia, S. Li, Z. Ren, Y. Romano and E. Candes.

Emiley Eloe-Fadrosh, Metagenome Program Head & Environmental Genomics Group Lead, DOE Joint Genome Institute, Lawrence Berkeley National Lab - Big data from tiny microbes across Earth’s ecosystems

Genome-resolved metagenomics has enabled unprecedented insights into the ecology and evolution of environmental and host-associated microbiomes. This powerful approach is scalable and was applied to over 10,000 metagenomes collected from diverse habitats to generate an extensive catalog of microbial diversity. In collaboration with a large research consortium, we highlight how this genomic catalog can support discovery of new biosynthetic gene clusters and associating environmental viruses to their microbial hosts.

Pang Wei Koh, PhD Student, Computer Science, Stanford - Modeling the spread of COVID-19 with large-scale dynamic mobility networks

We develop epidemiological models on top of dynamic mobility networks, derived from US cell phone data, that capture the hourly movements of millions of people from local neighborhoods to points of interest such as restaurants, grocery stores, or religious establishments. These models correctly predict higher infection rates among disadvantaged racial and socioeconomic groups, and enable fine-grained analysis of disease spread that can inform more effective and equitable policy responses to COVID-19. *This presentation is based on joint work with Serina Chang, Emma Pierson, Jaline Gerardin, Beth Redbird, David Grusky, Jure Leskovec, and many others.

Stefan Wager, Associate Professor of Operations, Information and Technology, and (by courtesy) of Statistics, Stanford - Noise-Induced Randomization in Regression Discontinuity Designs

Regression discontinuity designs are used to estimate causal effects in settings where treatment is determined by whether an observed running variable crosses a pre-specified threshold. While the resulting sampling design is sometimes described as akin to a locally randomized experiment in a neighborhood of the threshold, standard formal analyses do not make reference to probabilistic treatment assignment and instead identify treatment effects via continuity arguments. Here we propose a new approach to identification, estimation, and inference in regression discontinuity designs that exploits measurement error in the running variable. Under an assumption that the measurement error is exogenous, we show how to consistently estimate causal effects using a class of linear estimators that weight treated and control units so as to balance a latent variable of which the running variable is a noisy measure. We find this approach to facilitate identification of both familiar estimands from the literature, as well as policy-relevant estimands that correspond to the effects of realistic changes to the existing treatment assignment rule. We demonstrate the method with a study of retention of HIV patients and evaluate its performance using simulated data and a regression discontinuity design artificially constructed from test scores in early childhood. *This presentation is based on joint work with Dean Eckles, Nikolaos Ignatiadis, and Han Wu.

Moisés Expósito Alonso, Staff Associate, Departments of Plant Biology and Global Ecology, Carnegie Institution for Science; Assistant Professor (by courtesy) of Biology, Department of Biology, Stanford University - The genomics of climate adaptation (and extinction)

The ongoing climate change has put a spotlight on rapid evolutionary processes that could aid species adapt to new environments. But what is the architecture of fitness across environments? Is this predictable? Can we understand genetic constraints across multiple adaptive traits? How is genetic variation loss during extinction? To address these questions we combine statistical genomics with experimental ecology and genetic engineering approaches using the plant species Arabidopsis thaliana as our experimental climate change genetics model and scaling up insights with publicly available genomes of diverse plant species.

Serena Yeung, Assistant Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering, Stanfordy - Using Computer Vision to Augment Clinician Capabilities Across the Spectrum of Healthcare Delivery

Clinicians often work under highly demanding conditions to deliver complex care to patients. As our aging population grows and care becomes increasingly complex, physicians and nurses are now also experiencing feelings of burnout at unprecedented levels. In this talk, I will discuss possibilities for computer vision to augment clinician capabilities in various examples across the spectrum of healthcare delivery. I will present ongoing work in settings ranging from surgical procedures, to ambient ICU monitoring, and telemedicine at home.

Mallory Harris, PhD Student in Biology and 2019 Knight-Hennessy Scholar, Stanford - Scientific Integrity & Academic Accountability

During the COVID-19 pandemic, the rapid, mass dissemination of research has accelerated scientific progress and likely saved many lives. However, existing scientific infrastructures (e.g., peer review) have struggled to adapt to fast-paced, highly polarized political and media ecosystems. This seminar will examine case studies involving retracted papers, pre-prints that garnered significant press coverage, and public scientific communications. We will discuss the responsibilities of researchers in the context of a public health crisis and consider proposed reforms related to scientific publishing and communications.

Jan Schellenberger, Senior Staff Software Engineer; Nan Zhang, Vice President of Biostatistics and Data Management; and Jing Zhang, Associate Biostatistical Director, GRAIL, LLC - A Targeted Methylation-based Multi-Cancer Early Detection Test and the Modeled Clinical Utility with a Novel Microsimulation Method

Cancer screening aims to prevent cancer death by detecting cancer early, when treatment is more effective. However, most cancers lack available screening modalities and the diagnosis of these cancers (often by the presentation of clinical signs and symptoms) often occurs in later stages when the cancer has already spread to other parts of the body and the chance of successful treatment and survival is much lower. GRAIL, LLC developed a noninvasive multi-cancer early detection (MCED) test, utilizing targeted methylation analysis of blood-based circulating tumor cell-free DNA and machine learning techniques. This MCED test can detect cancer signals for more than 50 types of cancer as defined by the American Joint Committee on Cancer (AJCC) across all stages with a false-positive rate of 0.5%. In this workshop, we will talk about the machine learning classifier development process and the clinical validation of this MCED test. We will also discuss a novel microsimulation method that was proposed for this MCED test to mimic the world’s largest pragmatic randomized controlled trial (PRCT) and to evaluate the clinical utility. Individual-level cancer state transitions were simulated for all participants in the two arms of the PRCT to assess stage shift and mortality reduction. The implementation utilizes the programming language Julia, which is fast and enables straightforward parallelization.

John Witte, Vice Chair and Professor in the Department of Epidemiology & Population Health, and Professor of Biomedical Data Science and, by courtesy, of Genetics - Polygenic Risk Scores: Methods and Models

Polygenic risk scores (PRS) provide a promising avenue for incorporating germline genetic information into prediction models for traits and diseases. However, most PRS have been developed in European ancestry populations and can have poor predictive performance in other populations, which may in turn exacerbate health disparities. To try and address such limitations, several different PRS methods and models have been developed, ranging from including tens to hundreds of genome-wide significant variants using ‘Pruning and Thresholding’ approaches to including millions of variants from across the genome using Bayesian shrinkage. In this talk, Dr. Witte will first show how PRS can provide valuable evidence for predicting cancer risk beyond known non-genetic risk factors (e.g., age, smoking) by application to the UK Biobank. Second, he will present simulation results highlighting the limited transferability of PRS across admixed populations. Third, he will contrast the performance of different PRS methods and models, including efforts to improve accuracy across diverse populations. Finally, Dr. Witte will discuss PRS approaches being considered by large-scale consortia, and by Stanford Health in a pilot study of PRS in cardiovascular disease, breast cancer, and prostate cancer.

Donald Redelmeier, Professor of Medicine at the University of Toronto; Canada Research Chair in Medical Decision Sciences; Director of Clinical Epidemiology at Sunnybrook Health Sciences Centre; Senior Scientist at the Institute for Clinical Evaluative Studies in Ontario; Staff physician in the Division of General Internal Medicine at Sunnybrook Hospital - COVID Vaccine Hesitancy and Risk of a Traffic Crash

COVID vaccine hesitancy is a reflection of judgment, reasoning, and other psychological influences that may also contribute to traffic safety. We tested whether COVID vaccine hesitancy was associated with an increased risk of a serious traffic crash. A total of 11,270,763 adults were identified, of whom 16% had not received a COVID vaccine and 84% had received a COVID vaccine. Those who had not received the vaccine accounted for a disproportionate number of crashes, equivalent to a significant increased traffic risk. The association between a lack of COVID vaccine and increased traffic risks extended to diverse patient subgroups, persisted after adjusting for measured baseline differences, applied across a spectrum of crash severity, and was similar to the relative risk associated with a diagnosis of sleep apnea. We suggest that COVID vaccine hesitancy is associated with a significant increased risk of a serious traffic crash. An awareness of this counter-intuitive finding might contribute to more public support for the COVID vaccine.

Spring 2021

Abstracts, when available, are included in the drop-down

Hilary Finucane, Co-Director of the Program in Medical and Population Genetics, Broad Institute - Leveraging local and polygenic signal for GWAS gene prioritization

Prioritizing likely causal genes from GWAS data is a fundamental problem. There are many methods for GWAS gene prioritization, including methods that map candidate SNPs to their target genes, and methods that leverage patterns of enrichment from across the genome. In this talk, I will introduce a new method for leveraging genome-wide patterns of enrichment to prioritize genes at GWAS loci, incorporating information about genes from many sources. I will then discuss the problem of benchmarking gene prioritization methods, and I will describe a large-scale analysis to benchmark many different methods and combinations of methods on data from the UK Biobank. Our analyses show that the highest confidence can be achieved by combining multiple lines of evidence, and I will conclude by giving examples of genes prioritized in this way.

Access Zoom recording.

Julia Salzman, Associate Professor of Biochemistry and of Biomedical Data Science, Stanford - RNA splicing at single cell resolution

Single cell sequencing experiments almost exclusively analyze cells’ “gene expression”. Yet, more than 95% of human genes are alternatively spliced or subject to regulated RNA processing that can result in isoforms of the same “gene” with opposite functions. Despite the need to quantify RNA variants of the same gene, methods to do so are lacking in the field. Dr. Salzman will discuss two new statistical methods to identify cell type-specific RNA regulation: 1) generalization of a metric on which the splicing field has long relied: “Percent Spliced In” (PSI) called the “Splicing Z score” (SpliZ); and 2) a new metric to detect alternative RNA processing. Both methods can be applied to any RNAseq data, however our focus is on using them to study single cell RNA-seq. Dr. Salzman will discuss analysis of single cell RNAseq data including findings from analysis of > 100K carefully curated single cells from 12 human tissues and spermatogenic development.

Access Zoom recording.

Hongzhe Li (Lee), Perelman Professor of Biostatistics, Epidemiology and Informatics, University of Pennsylvania School of Medicine - Interrogating the Gut Microbiome: Estimation of Bacterial Growth Rates and Prediction of Biosynthetic Gene Clusters

The gut microbiome plays an important role in maintenance of human health. High-throughput shotgun metagenomic sequencing of a large set of samples provides an important tool to interrogate the gut microbiome. Besides providing footprints of taxonomic community composition and genes, these data can be further explored to study the bacterial growth dynamics and metabolic potentials via generation of small molecules and secondary metabolites. In this talk, Dr. Lee will present several computational and statistical methods for estimating the bacterial growth rate for metagenome-assembled genomes (MAGs) and for predicting all biosynthetic gene clusters (BGCs) in bacterial genomes. The key statistical and computational tools used include optimal permutation recovery based on low-rank matrix projection and improved LSTM deep learning methods to improve prediction of BGCs. He will demonstrate the application of these methods using several ongoing microbiome studies of inflammatory bowel disease at the University of Pennsylvania.

Access Zoom recording.

Katherine S. Pollard, Professor, Department of Epidemiology & Biostatistics, Institute for Human Genetics, and Institute for Computational Health Sciences, UC San Francisco - CellWalker: a network model to resolve gene regulatory elements in single cells

Genomics can be used to quantify gene expression and characterize gene regulatory elements in human tissues, including how genetic mutations alter gene regulation in disease. Single-cell and bulk tissue experiments have complementary strengths and weaknesses, and alone neither strategy can fully capture regulatory elements across the diversity of cells in complex tissues. To solve this problem, we developed CellWalker, a method that integrates single-cell open chromatin (scATAC-seq) data with gene expression (RNA-seq) and epigenetic data using a network model. We demonstrate the model’s robustness to sparse annotations and measurement noise using simulations and combined RNA-seq and ATAC-seq in individual cells. Applying CellWalker to the developing brain and to heart disease, we identify cells transitioning between states, resolve regulatory elements to cell types, and map disease mutations to specific kinds of cells. Our modeling approach has been implemented in an R package called CellWalkR.

Access Zoom recording.

Vineet Bafna, Professor of Computer Science, UC San Diego Health Sciences - Extrachromosomal and other mechanisms of oncogene amplification in cancer

Increase in the number of copies of tumor promoting (onco-) genes is a hallmark of many cancers, and cancers with copy number amplifications are often associated with poor outcomes. Despite their importance, the mechanisms causing these amplifications are incompletely understood. In this talk, we describe our recent results suggesting that a large faction of amplification is due to formation of extrachromosomal DNA (ecDNA). EcDNA play a critical role in tumor heterogeneity, accelerated cancer evolution, and drug resistance through their unique mechanism of non-chromosomal inheritance. While predominant, ecDNA are not the only mechanism to cause amplification. We also describe recent algorithmic methods required to distinguish ecDNA from other mechanisms including Breakage Fusion Bridge formation, Chromothripsis, and simpler events such as tandem duplications and translocations. The talk is a mix of published and unpublished work, largely in collaboration with Paul Mischel’s lab at UCSD. EcDNA was recently recognized as one of the grand challenges of cancer research by Cancer Research UK and the National Cancer Institute.

Access Zoom recording.

Rumi Chunara, Associate Professor of Engineering (Computer Science) and Global Public Health (Biostatistics/Epidemiology), New York University - Machine Learning for Health and Equity & Health and Equity for Machine Learning

As machine learning methods become embedded in society, it has become clear that the data used, objectives selected, and questions we ask are all critical. My work looks at data and machine learning from a public health and equity lens. First, this motivates the design and development of data mining and machine learning methods to address challenges related to data and goals of public health, such as generating better hyper-local features to represent environmental attributes addressing challenges of sparsity, irregularity and representativeness of data. Second, principles of community and equity inspire innovations in machine learning. In this realm my work has leveraged causal models and machine learning to address realistic challenges of data collection and model use across environments, such as domain adaptation that improves prediction in under-represented population sub-groups by leveraging invariant information across groups when possible, and developing models to specifically incorporate structural factors to better account for and address sources of bias and disparities. A focus on public health, which is concerned with the individual, collective, and environmental factors that affect the health of human populations provides a principled approach spanning data, algorithms and questions to both mitigate bias and proactively design inclusive innovations.

Access Zoom recording.

Peter Szolovits, Professor of Computer Science and Engineering, Massachusetts Institute of Technology - How can computers provide clinical decision support?

The history of computers collecting and analyzing clinical data goes back to the 1950s, beginning with Alan Turing’s speculations and continuing with Morris Collen’s pioneering collection of multiphasic health checkup data at Kaiser, and Robert Ledley and Lee Lusted’s outline of how one could use symbolic logic and decision analysis to help understand physicians’ reasoning. The 1960s to 1980s saw development of new models for clinical reasoning based on rules, matching of prototypical patterns, explicit modeling of human decision making, sophisticated methods of probabilistic reasoning, etc., and applications of these ideas to clinical decision support. Since the 1990s, large collections of experiential data from electronic health records have become more widely available, and consequently we have turned to machine learning methods to build classification and prediction models from such data. The next challenges, in my view, focus on how one can combine biomedical knowledge with clinical data collections of heterogeneous data types to learn models that are accurate, explainable, fair, and clinically useful.

Access Zoom recording.

Susan Murphy, Professor of Statistics and of Computer Science, and Radcliffe Alumnae Professor at the Radcliffe Institute, Harvard University - We used RL but…. Did it work?!

Digital Healthcare is a growing area of importance in modern healthcare due to its potential in helping individuals improve their behaviors so as to better manage chronic health challenges such as hypertension, mental health, cancer and so on. Digital apps and wearables, observe the user’s state via sensors/self-report, deliver treatment actions (reminders, motivational messages, suggestions, social outreach,…) and observe rewards repeatedly on the user across time. This area is seeing increasing interest by reinforcement learning (RL) researchers with the goal of including in the digital app/wearable an RL algorithm that “personalizes” the treatments to the user. But after RL is run on a number of users, how do we know whether the RL algorithm actually personalized the sequential treatments to the user? In this talk, we report on our first efforts to address this question after our RL algorithm was deployed on individuals with hypertension.

Access Zoom recording.

Jian Ma, Associate Professor, Computational Biology, School of Computer Science, Carnegie Mellon University - Probing the Nuclear Organization with Machine Learning

The chromosomes of the human genome are organized in three-dimensions by compartmentalizing the cell nucleus and different genomic loci also interact with each other. However, the principles underlying such nuclear genome organization and its functional impact remain poorly understood. In this talk, I will introduce some of our recent work in developing machine learning methods by utilizing whole-genome mapping data to study the higher-order genome organization. In particular, I will highlight a new method based on hypergraph representation learning to probe the 3D genome structures in single cells. Our methods reveal the spatial localization of chromosome regions and exploit chromatin interactome patterns within the cell nucleus in different cellular conditions and at different scales. We hope that these algorithms will provide new insights into the structure and function of nuclear organization in health and disease.

Access Zoom recording.

Molly Przeworski, Professor of Biological Sciences and of Systems Biology, Columbia University - What drives the dependence of human germline mutation on sex, age and time?

Germline mutation is the source of all heritable differences and therefore of fundamental importance. In mammals, we know from longstanding analyses of phylogenetic patterns on the X-chromosome and autosomes, and from recent studies of human pedigrees, that most mutations come from fathers, and more are transmitted from older parents than from younger ones. The textbook view is that these patterns reflect replication errors, as there are both more mutations and more cell divisions in the male than in the female germline. I will present multiple lines of evidence that call this view into question. I will argue instead that current data are best explained by a much larger role of DNA damage in the genesis of germline mutations than previously appreciated, and draw out implications for why mutation rates depend on sex and age and how they evolve over time.

Winter 2021

Abstracts, when available, are included in the drop-down

Kathryn Roeder, UPMC Professor of Statistics and Life Sciences in the Departments of Statistics and Data Science and Computational Biology, Carnegie Mellon University - Statistical challenges in the analysis of single-cell RNA-seq from brain cells

Quantification of gene expression using single cell RNA-sequencing of brain tissues, can be a critical step in the understanding of cell development and differences between cells sampled from case and control subjects. We describe statistical challenges encountered analyzing expression of brain cells in the context of two projects. First, over-correction has been one of the main concerns in employing various data integration methods that risk removing the biological distinctions, which is harmful for cell type identification. Here, we present a simple yet surprisingly effective transfer learning model named cFIT for removing batch effects across experiments, technologies, subjects, and even species. Second, gene co-expression networks yield critical insights into biological processes, and single-cell RNA sequencing provides an opportunity to target inquiries at the cellular level. However, due to the sparsity and heterogeneity of transcript counts, it is challenging to construct accurate gene co-expression networks. We develop an alternative approach that estimates cell-specific networks for each single cell. We use this method to identify differential network genes in a comparison of cells from brains of individuals with autism spectrum disorder and those without.

Access Zoom recording.

Stephanie Hicks, Assistant Professor in the Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health - Scalable statistical methods and software for single-cell data science

Single-cell RNA-Seq (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. However, single-cell data present unique challenges that have required the development of specialized methods and software infrastructure to successfully derive biological insights. Compared to bulk RNA-seq, there is an increased scale of the number of observations (or cells) that are measured and there is increased sparsity of the data, or fraction of observed zeros. Furthermore, as single-cell technologies mature, the increasing complexity and volume of data require fundamental changes in data access, management, and infrastructure alongside specialized methods to facilitate scalable analyses. I will discuss some challenges in the analysis of scRNA-seq data and present some solutions that we have made towards addressing these challenges.

Access Zoom recording.

Peter Kharchenko, Gilbert S. Omenn Associate Professor of Biomedical Informatics, Harvard University - Comparative analysis of disease-oriented single-cell transcriptional collections

Single-cell RNA-seq assays are being increasingly applied in complex study designs, which involve measurements of many samples, commonly spanning multiple individuals, conditions, or tissue compartments. Joint analysis of such extensive, and often heterogeneous, sample collections requires a way of identifying and tracking recurrent cell subpopulations across the entire collection, and an effective way of exploring contrasts between samples. We develop comparative approaches for analysis of case-control study designs, commonly used to study the impact of disease or a drug on a particular tissue. The analysis starts by establishing probabilistic mapping – a joint graph – connecting all cells within the collection. The graph can then be used to propagate information between samples and to identify cell communities that show consistent grouping across broad subsets of the collected samples. The contrast between conditions is then formulated in terms of i) compositional shifts between different cell populations, ii) transcriptional shifts within the distinct cell populations, and iii) within-group expression state variability. The compositional data analysis techniques are applied in the context of cell types hierarchies to suggest most parsimonious explanation of the changes. The differential expression analysis is used to identify most affected cell types, and characterize the likely functional interpretation of the cell type-specific changes. We illustrate the application of these methods in the context of different studies of human disease, including cancer and diseases affecting the brain.

Access Zoom recording.

Richard Bonneau, Professor of Biology and Computer Science, and Director, Center for Genomics and Systems Biology, New York University - Inference of biological networks with biophysically motivated methods

Via a confluence of genomic technology and computational developments the possibility of network inference methods that automatically learn large comprehensive models of cellular regulation is closer than ever. This talk will focus on enumerating the elements of computational strategies that, when coupled to appropriate experimental designs, can lead to accurate large-scale models of chromatin-state and transcriptional regulatory structure and dynamics. We highlight four research questions that require further investigation in order to make progress in network inference: using overall constraints on network structure like sparsity, use of informative priors and data integration to constrain individual model parameters, estimation of latent regulatory factor activity under varying cell conditions, and new methods for learning and modeling regulatory factor interactions. I will contrast two recent studies from the lab that focus on inference from single-cell and spatial transcriptomics aimed at healthy and diseased brain and spinal tissues.

Access Zoom recording.

Xihong Lin, Professor of Biostatistics, Harvard T.H. Chan School of Public Health - Learning from COVID-19 Data on Transmission, Health Outcomes, Interventions and Vaccination

COVID-19 is an emerging respiratory infectious disease that has become a pandemic. In this talk, I will first provide a historical overview of the epidemic in Wuhan. I will then provide the analysis results of 32,000 lab-confirmed COVID-19 cases in Wuhan to estimate the transmission rates using Poisson Partial Differential Equation based transmission dynamic models. This model is also used to evaluate the effects of different public health interventions on controlling the COVID-19 outbreak, such as social distancing, isolation and quarantine. I will present the results on the epidemiological characteristics of the cases. The results show that multi-faceted intervention measures successfully controlled the outbreak in Wuhan. I will next present transmission regression models for estimating transmission rates in USA and other countries, as well as factors including intervention effects using social distancing, test-trace-isolate strategies that affect transmission rates. I will discuss estimation of the proportion of undetected cases, including asymptomatic, pre-symptomatic cases and mildly symptomatic cases, the chances of resurgence in different scenarios, prevalence, and the factors that affect transmissions. I will also present the US county-level analysis to study the demographic, social-economic, and comorbidity factors that are associated with COVID-19 case and death rates. I will also present the analysis results of >500,000 participants of the HowWeFeel project on health outcomes and behaviors in US, and discuss the factors associated with infection, behavior, and vaccine hesitancy. I will provide several takeaways and discuss priorities.

Access Zoom recording.

Benjamin Langmead, Associate Professor, Department of Computer Science, Johns Hopkins University - Advances in pan-genomics for addressing reference bias

DNA and RNA sequencing data analysis often begins with aligning sequencing reads to a reference genome, with the reference represented as a linear string of bases. But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream statistical and biological results. This is a major concern in human genomics; we do not want diagnostics and therapeutics to be differentially effective depending on a patient’s genetic background. Meanwhile, recent Bioinformatics advances allow us to index and align sequencing reads to references that include many population variants. I will describe this journey from the early days of efficient genome indexing, continuing through graph-shaped references and references that include many genomes. I will emphasize recent results, including results from my group and collaborators showing how to optimize simple and complex pan-genome representations for effective avoidance of reference bias. Much of this work is collaborative with Travis Gagie, Christina Boucher, Alan Kuhnle and others.

Access Zoom recording.

Edward H. (Ted) Shortliffe MD, PhD, Chair Emeritus and Adjunct Professor, Department of Biomedical Informatics, Columbia University; President and CEO Emeritus, American Medical Informatics Association (AMIA); Senior Executive Consultant, IBM Watson Health; Former Professor of Medicine and Computer Science, Stanford University (1979-2000); and Founding Director, - Biomedical Informatics Training Program @ Stanford Biomedical AI: Its Roots, Evolution, and Early Days at Stanford

Five decades have passed in the evolution of Artificial Intelligence in Medicine (AIM), a field that has evolved substantially while tracking the corresponding changes in computer science, hardware technology, communications, and biomedicine. Emerging from medical schools and computer science departments in its early years, the AIM field is now more visible and influential than ever before, paralleling the enthusiasm and accomplishments of AI more generally. This talk will briefly summarize some of AIM history, providing an update on the status of the field as we enter our second half-century. My remarks on this subject will emphasize the role that Stanford played in the emergence of the field. They will also offer the perspective of an informatics journal editor-in-chief who has seen many state-of-the-art AIM papers and thereby recognizes the tension between applying existing methods to new problems and developing new science that advances the field in a generalizable way. In addition, the inherent complexity of medicine and of clinical care necessitates that we address not only decision-making performance but also issues of usability, workflow, transparency, safety, and the pursuit of persuasive results from formal clinical trials. These requirements contribute to an ongoing investigative agenda that means fundamental AIM research will continue to be crucial and will define our accomplishments in the decades ahead.

Access Zoom recording. Access presentation slides (PDF).

Dr. Judy W Gichoya, Assistant Professor of Interventional Radiology And Informatics, Emory University - Operationalizing Fairness in Medical Algorithms: A grand challenge

The year 2020 has brought into focus a second pandemic of social injustice and systemic bias with the disproportionate deaths observed for minority patients infected with COVID. As we observe an increase in development and adoption of AI for medical care, we note variable performance of the models when tested on previously unseen datasets, and also bias when the outcome proxies such as healthcare costs are utilized. Despite progressive maturity in AI development with increased availability of large open source datasets and regulatory guidelines, operationalizing fairness is difficult and remains largely unexplored. In this talk, we review the background/context for FAIR and UNFAIR sequelae of AI algorithms in healthcare, describe practical approaches to FAIR Medical AI, and issue a grand challenge with open/unanswered questions.

Access Zoom recording.

Ryan Tibshirani, Associate Professor, Department of Statistics and Machine Learning Department, Carnegie Mellon University - An Ecosystem for Tracking and Forecasting the Pandemic

Data is the foundation on which statistical modeling rests. To provide a better foundation for COVID-19 tracking and forecasting, the Delphi group launched an effort called COVIDcast, which has many parts: 1. Unique relationships with partners in tech and healthcare granting us access to real-time data on pandemic activity. 2. Code and infrastructure to build COVID-19 indicators, continuously-updated and geographically comprehensive. 3. A historical database of all indicators, including revision tracking, currently with hundreds of millions of observations. 4. A public API serving new indicators daily (and R and Python packages for client support). 5. Interactive maps and graphics to display our indicators. 6. Forecasting and modeling work building on the indicators. In this talk, I’ll summarize the various parts, and highlight some interesting findings so far. I’ll also describe ways you can get involved yourself, access the data we’ve collected, and leverage the tools we’ve built.

Access Zoom recording.

Kristin Swanson, Professor of Neurosurgery, Mayo Clinic - Sex, Drugs and Radiomics of Brain Cancer

Glioblastoma are notoriously aggressive, malignant primary brain tumors that have variable response to treatment. This presentation will focus on the integrative role of 1) biological sex-differences, 2) heterogeneity in drug-delivery and 3) intra-tumoral molecular diversity (revealed by radiomics) in capturing and predicting this variable response to treatment. Specifically, I will highlight burgeoning insights into sex differences in tumor incidence, outcomes, propensity and response to therapy. I will further, quantify the degree to which heterogeneity in drug delivery, even for drugs that are able to bypass the blood-brain barrier, contributes to differences in treatment response. Lastly, I will propose an integrative role for spatially resolved MRI-based radiomics models to reveal the intra-tumoral biological heterogeneity that can be used to guide treatment targeting and management.

Access zoom recording.

Fall 2020

Abstracts, when available, are included in the drop-down

Bin Yu (Professor of Statistics, UC Berkeley) - Curating a COVID-19 data repository and forecasting county-level death counts in the United States

As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative death counts at the county-level in the United States up to two weeks ahead. Using data from January 22 to June 20, 2020, we develop and combine multiple forecasts using ensembling techniques, resulting in an ensemble we refer to as Combined Linear and Exponential Predictors (CLEP). Our individual predictors include county-specific exponential and linear predictors, a shared exponential predictor that pools data together across counties, an expanded shared exponential predictor that uses data from neighboring counties, and a demographics-based shared exponential predictor. We use prediction errors from the past five days to assess the uncertainty of our death predictions, resulting in generally-applicable prediction intervals, Maximum (absolute) Error Prediction Intervals (MEPI). MEPI achieves a coverage rate of more than 94% when averaged across counties for predicting cumulative recorded death counts two weeks in the future. Our forecasts are currently being used by the non-profit organization, Response4Life, to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country. We hope that our forecasts and data repository at https://covidseverity.com can help guide necessary county-specific decision-making and help counties prepare for their continued fight against COVID-19.

Philip Stark (Professor, Associate Dean of Mathematical and Physical Sciences at UC Berkeley) - Evidence-Based Elections

RanElections rely on people, hardware, and software, all of which are fallible and subject to manipulation. Well resourced nation-states continue to attack U.S. elections and domestic election fraud is not unheard of. Voting equipment is built by private vendors–some foreign, but all using foreign parts. Many states even oursource election reporting to foreign firms. How can we conduct and check elections in a way that provides evidence that the reported winners really won–despite malfunctions and malfeasance? Evidence-based elections require voter-verified (generally, hand-marked) paper ballots kept demonstrably secure throughout the canvass and manual audits of election results against the trustworthy paper trail. Hand-marked paper ballots are far more trustworthy than machine-marked ballots for a variety of reasons. Two kinds of audits are required to provide affirmative evidence that outcomes are correct: _compliance audits_ to establish whether the paper trail is complete and trustworthy, and _risk-limiting audits_ (RLAs). RLAs test the hypothesis that an accurate manual tabulation of the votes would find that one or more reported winners did not win. To reject that hypothesis means there is convincing evidence that a full hand tally would confirm the reported results. For a broad variety of social choice functions, including plurality, multi-winner plurality, supermajority, proportional representation rules such as D’Hondt, Borda count, approval voting, and instant-runoff voting (aka ranked-choice voting), the hypothesis that one or more outcomes is wrong can be reduced to the hypothesis that the means of one or more lists of nonnegative numbers is not greater than 1/2. Martingale methods for testing such nonparametric hypotheses sequentially are especially practical. RLAs are in law in several states and have been piloted in more than a dozen; there have been roughly 60 pilots in jurisdictions of all sizes, including roughly 10 audits of statewide contests. Open-source software to support RLAs is available.

Brad Efron (Max H. Stein Professor and Professor of Statistics and of Biomedical Data Science at Stanford University) - Prediction, Estimation, and Attribution

The scientific needs and computational limitations of the Twentieth Century fashioned classical statistical methodology. Both the needs and limitations have changed in the Twenty-First, and so has the methodology. Large-scale prediction algorithms – neural nets, deep learning, boosting, support vector machines, random forests – have achieved star status in the popular press. They are recognizable as heirs to the regression tradition, but ones carried out at enormous scale and on titanic data sets. How do these algorithms compare with standard regression techniques such as Ordinary Least Squares or logistic regression? Several key discrepancies will be examined, centering on the differences between prediction and estimation or prediction and attribution (that is, significance testing). Most of the discussion is carried out through small numerical examples. The talk does not assume familiarity with prediction algorithms.

Eugene Katsevich (Assistant Professor of Statistics at Wharton University of Pennsylvania) - Statistical analysis of single cell CRISPR screens

Mapping gene-enhancer regulatory relationships is key to unraveling molecular disease mechanisms based on GWAS associations in non-coding regions. This problem is notoriously challenging: there is a many-to-many mapping between genes and enhancers, and enhancers can be located far from their target genes. Recently developed CRISPR regulatory screens (CRSs) based on single cell RNA-seq (scRNA-seq) are a promising high-throughput experimental approach to this problem. They operate by infecting a population of cells with thousands of CRISPR guide RNAs (gRNAs), each targeting an enhancer. Each cell receives a random combination of CRISPR gRNAs, which suppress the action of their corresponding enhancers. The gRNAs and whole transcriptome in each cell are then recovered through scRNA-seq. CRSs provide more direct evidence of regulation than existing methods based on epigenetic data or even chromatin conformation. However, the analysis of these screens presents significant statistical challenges, some inherited from scRNA-seq analysis (modeling single cell gene expression) and some unique to CRISPR perturbation screens (the confounding effect of sequencing depth). In this talk, I will first give some background on single cell CRISPR screen technology. I will then present the first genome-wide single cell CRS dataset (Gasperini et al. 2019) and discuss challenges that arose in its initial analysis. Finally, I will present a novel methodology for the analysis of this data based on the conditional randomization test. The key idea is to base inference on the randomness in the assortment of gRNAs among cells rather than on the randomness in single cell gene expression, since the former is easier to model than the latter.

Marzyeh Ghassemi (Assistant Professor at the University of Toronto in Computer Science and Medicine) - Don’t Expl-AI-n Yourself: Exploring "Healthy" Models in Machine Learning for Health

Despite the importance of human health, we do not fundamentally understand what it means to be healthy. Health is unlike many recent machine learning success stories – e.g., games or driving – because there are no agreed-upon, well-defined objectives. In this talk, Dr. Marzyeh Ghassemi will discuss the role of machine learning in health, argue that the demand for model interpretability is dangerous, and explain why models used in health settings must also be “healthy”. She will focus on a progression of work that encompasses prediction, time series analysis, and representation learning.

Francesca Dominici (Clarence James Gamble Professor of Biostatistics, Population and Data Science Harvard T.H. Chan School of Public Health) - Air Pollution, COVID19, and Race: Data Science Challenges and Opportunities

The coronavirus will likely kill thousands of Americans. But what if I told you about a serious threat to American national security. This emergency comes from climate change and air pollution. To help address this threat, we have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels dividing the country into 1-square-kilometer zones across the continental U.S. We have paired this information with health data contained in Medicare claims records from the last 12 years, which includes 97% of the population ages 65 or older. We also developed statistical methods for causal inference and computational efficient algorithms for the analysis of over 550 million health records. The result? This data science platform is telling us that federal limits on the nation’s most widespread air pollutants are not stringent enough. Our research shows that short- and long-term exposure to air pollution is killing thousands of senior citizens each year. Our research shows the critical new role of data science in public health and the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. We will discuss these and other challenges.

Dr. Danielle Belgrave (Principal Research Manager, Microsoft Research, Cambridge (UK)) - Machine learning for personalised healthcare: a human-centred approach

Machine learning advances are opening new routes to more precise healthcare, from the discovery of disease subtypes for stratified interventions to the development of personalised interactions supporting self-care between clinic visits. This offers an exciting opportunity for machine learning techniques to impact healthcare in a meaningful way. In this workshop, I will present recent work on probabilistic graphical modelling to enable a more personalised approach to healthcare. The underlying motivation of these methods is to understand patient heterogeneity in order to provide more personalised treatment and intervention strategies. An important element of developing these models is collaboration with domain experts such as social scientists who have a deep understanding of the user perspective of these algorithms. We will use motivating examples from mental healthcare and asthma and allergic diseases.

Katherine Heller (AI Researcher at Google Medical Brain, and Assistant Professor in Statistical Science at Duke University) - Machine Learning in Real-world Healthcare Settings: How Far We've Come and Where We Are Going

In this talk, I will discuss three real-world Health applications of Machine Learning research, the progress that we have made in deployment to hospitals or directly to individuals, and where we hope to be heading next. In the first part, I will discuss Sepsis Watch, our Sepsis prediction system that has been deployed to the emergency departments of Duke University hospitals. This system performs prediction for incoming patients through a combination of Gaussian Processes, which estimate patient features in continuous time from uneven measurements, and Recurrent Neural Networks. Next I discuss Graph-coupled HMMs, work that we have done making individual-level predictions of disease spread in a social network in influenza, and how this might affect prediction abilities in other diseases, such as Coronavirus. Lastly, I will discuss the iOS app developed to record data on people with Multiple Sclerosis outside of a clinic environment, what collected data and basic analyses imply for our ability to do symptom and subpopulation prediction in this setting, and where we are headed in the future.

Alexandra Chouldechova (Estella Loomis McCandless Assistant Professor of Statistics and Public Policy, Carnegie Mellon University) - Algorithm-assisted decision making in child welfare

Every year, there are more than 4 million referrals made to child protection agencies across the US. The practice of screening calls is left to each jurisdiction to follow local practices and policies, potentially leading to large variation in the way in which referrals are treated across the country. While increasing access to linked administrative data is available, it is difficult for workers to make systematic use of historical information about all the children and adults on a single referral call. Jurisdictions around the country are thus increasingly turning to predictive modeling approaches to help distill this rich information. The end result is typically a single risk score reflecting the likelihood of a near-term adverse event. Yet the use of predictive analytics in the area of child welfare remains highly contentious. There is concern that some communities—such as those in poverty or from particular racial and ethnic groups—will be disadvantaged by the reliance on government administrative data. In this talk, I will describe some of the work we have done both in the lab and in the community as part of developing, deploying and evaluating a prediction tool currently in use in the Allegheny County Office of Children, Youth and Families.

Finale Doshi-Velez (Associate Professor of Computer Science, Harvard University) - Interpretability and Human Validation of Machine Learning

As machine learning systems become ubiquitous, there is a growing interest in interpretable machine learning — that is, systems that can provide human-interpretable rationale for their predictions and decisions. In this talk, I’ll first give examples of why interpretability is needed in some of our work in machine learning for health, discussing how human input (which would be impossible without interpretability) is crucial for getting past fundamental limits of statistical validation. Next, I’ll speak about some of the work we are doing to understand interpretability more broadly: what exactly is interpretability, and how can we assess it? By formalizing these notions, we can hope to identify universals of interpretability and also rigorously compare different kinds of systems for producing algorithmic explanations.

Includes joint work with Been Kim, Andrew Ross, Mike Wu, Michael Hughes, Menaka Narayanan, Sam Gershman, Emily Chen, Jeffrey He, Isaac Lage, Roy Perlis, Tom McCoy, Gabe Hope, Leah Weiner, Erik Sudderth, Sonali Parbhoo, Marzyeh Ghassemi, Pete Szolovits, Mornin Feng, Leo Celi, Nicole Brimmer, Tristan Naumann, Rohit Joshi, Anna Rumshisky, Omer Gottesman, Emma Brunskill, Yao Liu, Sonali Parbhoo, Joe Futoma, and the Berkman Klein Center.

Spring 2020

Abstracts, when available, are included in the drop-down

Sheng Wang (Stanford Postdoctoral Researcher) - Learning for Never-before-seen Biomedicine

We are all going through a hard time with COVID-19. In fact, COVID-19 is just the tip of the iceberg with many other unsolved biomedical problems, such as cancer early identification and finding side effects of new drugs. These problems seem to be independent of each other and have so far been tackled by different biologists. In this talk, I will argue that, behind these different problems is the same computational challenge, that is, how to understand and predict in never-before-seen situations. In addition to powerful predictive models, what is really needed are tools that generalize well to new drugs, new diseases, and new cohorts.

My talk will focus on our novel machine learning method developed to tackle two kinds of never-before-seen situations: never-before-seen class and never-before-seen cohort. I will first introduce how we classify samples into never-before-seen classes by embedding noisy and large-scale biomedical ontologies, resulting in new discoveries in protein functions, cell types, and rare diseases. Next, I will introduce our solution to understand and characterize a never-before-seen cohort. Instead of finding which features are important, we answer the question of why these features are important using a novel multiscale biomedical knowledge graph. This multiscale knowledge graph is constructed using millions of scientific papers and millions of experimental associations, providing up-to-date and scalable evidence for observations in our multi-scale biomedical world. I will conclude with a vision of future directions for never-before-seen biomedicine.

Chiara Sabatti (Professor of Biomedical Data Science and of Statistics) - Fairness and uncertainty assessment

Recent progress in machine learning (ML) provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do we have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? I will introduce examples of diverse domain applications where these questions are important, as well as statistical ideas to ensure that the learned models apply to individuals in an equitable manner. In recent work with Yaniv Romano, Rina Barber, and Emmanuel Candes, we show how to achieve some fairness objectives we do not need to “open up the black box,” and try understanding its underpinnings. Rather, we discuss broad methodologies — ex. conformal inference — that can be wrapped around any black box to produce results that can be trusted and that are “fair.‘’

Hau-tieng Wu (Associate Professor, Department of Mathematics and Department of Statistical Science at Duke University) - Modern signal processing tools for high-frequency biomedical time series and clinical applications

Compared with snapshot health information, long-term and high-frequency physiological time series provides health information from the other dimension. I will discuss recently developed signal processing tools in nonlinear-type time-frequency analysis and manifold learning inspired by dealing with this kind of time series. The developed tools will simultaneously handle several challenges when extracting useful biorhythm features—the time series is usually of single channel and composed of multiple oscillatory components with complicated statistical features, like time-varying amplitude, frequency and non-sinusoidal pattern, and the signal quality is often compromised by nonstationary noise and artifact. I will demonstrate how to apply it to some clinical challenges. The established theoretical supports will also be discussed if time permits.

Jonathan Pritchard (Professor of Genetics and of Biology) - Why are human complex traits so enormously polygenic? Lessons from molecular biomarker traits

One of the central challenges in genetics is to understand the mapping from genetic variation to phenotypic variation. During the past 15 years, genome-wide association studies (GWAS) have been used to study the genetic basis of a wide variety of complex diseases and other traits. One striking finding from this work has been that for a wide range of complex traits such as height or schizophrenia, even the most important loci in the genome contribute just a small fraction of the phenotypic variance. Instead, most of the variance comes from tiny contributions from tens to hundreds of thousands of variants spread across most of the genome. Our group has argued that these observations do not fit neatly into standard conceptual models of genetics. In recent papers, we have proposed one model to explain this, which we refer to as the “omnigenic” model.

In this talk, I will review our past work in this area, and describe new work that we have done on the genetic basis of three molecular traits—urate, IGF-1, and testosterone—that are biologically simpler than most diseases, and for which we know a great deal in advance about the core genes and pathways. For these molecular traits, we observe huge enrichment of significant signals near genes involved in the relevant biosynthesis, transport, or signaling pathways. However, even these molecular traits are highly polygenic, with most of the variance coming not from core genes, but from thousands to tens of thousands of variants spread across most of the genome. In summary, our models help to illustrate why so many variants affect risk for any given disease.

Fei Jiang (Assistant Professor of Epidemiology & Biostatistics, UCSF) - Bayesian Change Point Detection for Signal Processing and Functional Censored Quantile Regression for Stroke Study

The speaker will present a Bayesian model selection (BMS) approach to detect abnormalities in the data from magnetic resonance imaging guided radiation therapy devices. The BMS method effectively identifies the true abnormalities and suppress the spurious ones. The speaker will discuss several extensions, including detecting structural changes in heat-maps and extracting dynamic resting-state functional connectivity from brain images. The speaker will also introduce a functional censored quantile regression model to describe the time-varying relationship between time-to-event outcomes and corresponding functional covariates. The method was used to analyze the functional relationship between ambulatory blood pressure trajectories and clinical outcome in stroke patients.

Donald Redelmeier (Senior Scientist, Sunnybrook Health Sciences Centre) - Sweet-Spot Analysis in Randomized Trials

Randomized trials in clinical research tend to recruit diverse patients including some individuals who may be unresponsive to the treatment. Here we introduce a methodology to test whether patients near a sweet-spot experience more relative benefit than patients at the extremes. We then demonstrate this methodology based on a randomized trial of defibrillators to reduce all-cause mortality in patients with heart failure. An awareness of this methodology, we suggest, may help for identifying a sweet-spot in a randomized trial where some patients experience more relative benefit than other patients.

James Zou (Assistant Professor of Biomedical Data Science) - How to make AI forget you? Adventures in making algorithms more grown-up

This talk explores what it means to develop “grown-up” machine learning algorithms and why they are critical for applications such as medicine and healthcare. I propose three desiderata for grown-up learning algorithms: accountability for mistake/success, transparency in predictions, and flexibility in giving individuals control over their data. These properties turn out to be quite different from what is typically used or studied in machine learning and statistics. I will propose some technical definitions, example applications in healthcare and new algorithms that take an initial step in this direction. We will also teach people (briefly) how to read heart ultrasound.

Manuel Rivas (Assistant Professor of Biomedical Data Science) - Genetics of 35 blood and urine biomarkers in the UK Biobank

Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 35 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 large-effect protein truncating, missense, and copy-number variants. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian Randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multi-PRS models using all 35 PRSs simultaneously. We assessed sex-specific genetic effects and find striking patterns for testosterone with marked improvements in prediction when training a sex-specific model. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multi-PRS relative to single-disease PRSs for renal failure, myocardial infarction, type 2 diabetes, gout, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease. For the last 15 minutes of the presentation, I’ll briefly touch base on recent progress in COVID-19 host genetics efforts.

Winter 2020

Abstracts, when available, are included in the drop-down

Andrew Peterson, Chief Scientific Officer, MedGenome - The Challenges and Benefits of Diversity in Data

A problem that has received attention in recent years is that genetic data is heavily and disproportionately based towards individuals of European origin. Equally apparent upon consideration is that data on disease incidence, progression and treatment outcomes is heavily biased towards that coming from North American and European health care delivery systems. Why would we want to change this situation and bring more diversity into the data that we have available for improving health outcomes and treatment options? If it is a problem that should be solved, what are the barriers to solving it? Finally what are the data analysis challenges that arise as we normalize the contributors to available data.

Jelena Bradic, Associate Professor of Mathematics, UC San Diego - Time to event data: New approaches for prediction and inference

Estimating causal effects for survival outcomes in the high-dimensional setting is an extremely important topic for many biomedical applications as well as areas of social sciences. We propose a new orthogonal score method and a new Hazards Difference (HDi) estimator, for treatment effect estimation and inference that results in asymptotically valid confidence intervals. We apply our methods to study the treatment effect of radical prostatectomy versus conservative management for prostate cancer patients using the SEER-Medicare Linked Data. Time permitted, I would also like to discuss a new deep learning algorithm we have recently developed for time-to-event data and demonstrate its effectiveness on a number of open real datasets.

Dominik Rothenhäusler, Assistant Professor of Statistics at Stanford - Invariance, causality, and replicability

Heterogeneity across sub-populations can be beneficially exploited for causal inference and to improve the replicability of feature selections across distributions. The key is to encourage models to exhibit invariance across settings and interventions. The novel methodology potentially offers more robustness and ‘causal-oriented’ interpretation of results, compared to standard regression and classification methods.

Priya Moorjani, Assistant Professor of Genetics, Genomics and Development at UC Berkeley - An Evolutionary Perspective on the Human Mutation Rate

Germline mutations are the main drivers of evolution and the source of many heritable diseases. Understanding the rate and mechanisms by which mutations occur is of paramount importance for studies of human genetics (to interpret the incidence of rare diseases) and evolutionary biology (to date evolutionary events). Despite strong constraints in replication machinery across species, recent studies have documented considerable interspecies and inter-individual variation. Both the overall mutation rate and mutation spectra (the relative proportions of different mutation types) have been shown to differ across species and among human populations. Whole-genome sequencing of pedigrees has enabled direct survey of newly arising mutations (i.e., de novo mutations) and rather than answering questions about the mechanisms involved, it has revealed many puzzles about mutation rate and its evolution. In this talk, I will present analysis of whole genome sequences of pedigrees of humans and other primates and discuss how sex, age and time impact the evolution of mutation rate. These analyses provide insights about the process of mutagenesis and help build a reliable molecular clock for dating evolutionary events.

Liang Liang (Basic Life Science Researcher in Genetics at Stanford) & Robert Tibshirani, Professor, Stanford Department of Biomedical Data Science and of Statistics - The Metabolic Clock of Human Pregnancy

Metabolism during pregnancy is a dynamic and precisely programmed process, the failure of which can bring devastating consequences to the mother and fetus. To define a high-resolution temporal profile of metabolites during healthy pregnancy and the postpartum period, we carried out an untargeted metabolome investigation of 784 blood samples collected weekly from 30 healthy pregnant Danish women. The study revealed broad metabolome changes during normal pregnancy: of 9,651 detected metabolic features, 4,995 were significantly changed (FDR < 0.05). Of these, 460 annotated compounds (of 687 total) and 34 human metabolic pathways (of 48 total) were significantly altered during pregnancy, revealing a highly choreographed metabolic profile. Using linear models, we built a metabolic clock with metabolites that times gestational age in high accordance with first-trimester ultrasound. Our study represents the first weekly characterization of the human pregnancy metabolome, providing a high-resolution view for understanding human pregnancy with a potential clinical utility.

Hajime Uno, Assistant Professor, Department of Data Sciences at Dana-Farber Cancer Institute/Harvard Medical School - Adaptive long-term restricted mean survival time approach to compare time-to-event outcomes in randomized controlled trials for immunotherapy

Logrank/hazard ratio (HR) test/estimation approach has been routinely used in almost all cancer clinical trials with time-to-event outcomes. Although the logrank test is an asymptotically valid nonparametric test, it is not the most powerful test when the pattern of the difference is non-proportional hazards (PH). Also, interpretation of the HR is not obvious when the PH assumption does not hold. For immunotherapy trials, we often see a delayed difference pattern, where the conventional logrank/HR approach is not appropriate for testing equality nor estimating the magnitude of the treatment effect. Restricted mean survival time (RMST)-based analysis has been proposed as an alternative to the logrank/HR approach. It provides a robust and interpretable summary of the treatment effect. However, it is known that the standard RMST-based approach offers lower power than the logrank/HR approach in delayed difference scenarios. We propose a new prespecified RMST-based test and a corresponding treatment effect estimation procedure, particularly when a delayed difference pattern is expected at the design stage. Simulation studies show how effectively the proposed method can detect the treatment difference, compared to the logrank test, various weighted logrank tests, MaxCombo test, standard RMST-based tests, and so on.

Kyle Gaulton, Assistant Professor of Pediatrics, UC San Diego - Interpreting complex disease genetics using epigenomics

Determining the regulatory activity of non-coding genetic variants in specific cell types and cellular contexts is key to understanding the genetic basis of complex disease. In our work we profile the epigenome and annotate the function of disease risk variants using high-throughput molecular assays in combination with statistical approaches including those developed in our lab. I will describe several recent studies where we have used epigenomics to interpret the genetics of complex disease. First, we have mapped regulatory programs of cell types in the human pancreas and numerous other tissues using single cell accessible chromatin, from which we derived disease-relevant cell types and inferred cell type-specific regulatory programs of fine-mapped diabetes and other complex disease risk variants. Second, we annotated regulatory variants at diabetes risk loci by combining high-throughput, allele-specific binding assays of hundreds of TFs with accessible chromatin and dense fine-mapping data. Third, using a novel approach for estimating allelic imbalance we identified disease variants with allelic effects on TF binding and accessible chromatin generated from cells exposed to different in vitro stimuli. Together these studies represent examples of how mapping the epigenome can provide the cell type- and context-specific regulatory activity of non-coding variants and reveal their contribution to disease.

Le Cong, Assistant Professor of Pathology at Stanford - Data-Inspired CRISPR Barcoding and Single-Cell Decoding for Dissecting Cancer Evolution

Emerging genomics technology such as CRISPR gene-editing and single-cell sequencing provide us with increasing resolution to unravel the dynamic biology in human health and diseases. Our group aim to develop computation-optimized CRISPR designs for cell barcoding and editing that allows the probing of cancer biology at single-cell resolution. This “experimental + data-analytical” toolkit will enable studies on cancer cell evolution and interaction, as well as potential applications in gene therapy, immunology, and other fields. Ultimately, our long-term goal is to work collaboratively with our collaborating teams to bridge advances in gene-editing, genomics, and data science.

Devan Mehrotra, Vice President, Biostatistics at Merck & Co. - Survival Analysis using a 5-STAR Approach in Randomized Clinical Trials

Randomized clinical trials are commonly designed to assess whether a test treatment prolongs survival relative to a control treatment. Increased patient heterogeneity, while desirable for generalizability of results, weakens the ability of common statistical approaches to detect treatment differences, potentially hampering the regulatory approval of safe and efficacious therapies. A novel solution to this problem is proposed. A list of baseline covariates that have the potential to be prognostic for survival under either treatment is pre-specified in the analysis plan (Step 1). At the analysis stage, using observed survival times but blinded to patient-level treatment assignment, the covariate list is shortened via elastic net Cox regression (Step 2) for input into a conditional inference tree algorithm that segments the heterogeneous trial population into subpopulations (strata) of prognostically homogeneous patients (Step 3). After patient-level treatment unblinding, a treatment comparison is done within each formed stratum (Step 4) and stratum-level results are combined for statistical inference using an adaptive strategy (Step 5). The impressive power-boosting performance of our proposed 5-step stratified testing and amalgamation routine (5-STAR) relative to the logrank test and other methods for survival analysis is illustrated using two real datasets and simulations. An R package is available for implementation.

Lu Tian, Professor of Biomedical Data Science and, by courtesy, of Statistics at Stanford, & Steve Yadlowsky, PhD Candidate in Electrical Engineering, Stanford - Estimation and Validation of a Class of Conditional Average Treatment Effects Using Observational Data

While sample sizes in randomized clinical trials are large enough to estimate the average treatment effect well, they are often insufficient for estimation of treatment-covariate interactions critical to studying data-driven precision medicine. Observational data from real world practice may play an important role in alleviating this problem. One common approach in trials is to predict the outcome of interest with separate regression models in each treatment arm, and recommend interventions based on the contrast of the predicted outcomes. Unfortunately, this simple approach may induce spurious treatment-covariate interaction in observational studies when the regression model is misspecified. Motivated by the need of modeling the number of relapses in multiple sclerosis patients, where the ratio of relapse rates is a natural choice of the treatment effect, we propose to estimate the conditional average treatment effect (CATE) as the relative ratio of the potential outcomes, and derive a doubly robust estimator of this CATE in a semiparametric model of treatment-covariate interactions. We also provide a validation procedure to check the quality of the estimator on an independent sample. We conduct simulations to demonstrate the finite sample performance of the proposed methods, and illustrate the advantage of this approach on real data examining the treatment effect of dimethyl fumarate compared to teriflunomide in multiple sclerosis patients.

Fall 2019

Abstracts, when available, are included in the drop-down

Gilmer Valdes, Assistant Professor of Radiation Oncology at UCSF - Breaking the tradeoff between interpretability and accuracy of machine learning algorithms

Machine learning algorithms that are both interpretable and accurate are essential in applications such as medicine where errors can have a dire consequence. Unfortunately, there is currently a tradeoff between accuracy and interpretability among state-of-the-art machine learning methods. Decision trees or Linear models are interpretable and are therefore used extensively throughout medicine. They are, however, consistently outperformed in accuracy by other, less-interpretable algorithms, such as ensemble methods or neural networks. Here we present three algorithms that aim to address the tradeoff between interpretability and accuracy: 1) The Additive Tree (AddTree); a novel framework for constructing decision trees with the same architecture as CART but with improved accuracy, 2) The Conditional Super Learner (CSL), an algorithm which selects the best model candidate from a library conditional on the covariates, 3) Expert Augmented Machine Learning (EAML), an algorithms that automatically extracts clinical priors and combine it with machine-learned models to detect hidden confounders and build robust models with significantly less data. Extensive empirical evidence to illustrate the advantages and disadvantages of these three algorithms will be presented. Theoretical results will be also highlighted. Finally, we will choose the prediction of hospital mortality for Intensive Care Unit (ICU) patients to highlight the points discussed throughout the presentation.

Eric Jorgenson, PhD, Research Scientist at the Division of Research (DOR), Kaiser Permanente Northern California (KPNC) - Genetic variation in the SIM1 locus is associated with erectile dysfunction

Erectile dysfunction affects millions of men worldwide. Twin studies support the role of genetic risk factors underlying erectile dysfunction, but no specific genetic variants have been identified. We conducted a large-scale genome-wide association study of erectile dysfunction in 36,649 men in the multiethnic Kaiser Permanente Northern California Genetic Epidemiology Research in Adult Health and Aging cohort. We also undertook replication analyses in 222,358 men from the UK Biobank. In the discovery cohort, we identified a single locus (rs17185536-T) on chromosome 6 near the single-minded family basic helix-loop-helix transcription factor 1 (SIM1) gene that was significantly associated with the risk of erectile dysfunction (odds ratio = 1.26, P = 3.4 Å~ 10−25). The association replicated in the UK Biobank sample (odds ratio = 1.25, P = 6.8 Å~ 10−14), and the effect is independent of known erectile dysfunction risk factors, including body mass index (BMI). The risk locus resides on the same topologically associating domain as SIM1 and interacts with the SIM1 promoter, and the rs17185536- T risk allele showed differential enhancer activity. SIM1 is part of the leptin–melanocortin system, which has an established role in body weight homeostasis and sexual function. Because the variants associated with erectile dysfunction are not associated with differences in BMI, our findings suggest a mechanism that is specific to sexual function.

Jingshen Wang, Assistant Professor in Biostatistics at UC Berkeley - Inference on Treatment Effects after Model Selection with application to subgroup analysis

Inferring cause-effect relationships between variables is of primary importance in many sciences. In this talk, we start by discussing two approaches for making valid inference on treatment effects when a large number of covariates are present. The first approach is to perform model selection and then to deliver inference based on the selected model. If the inference is made ignoring the randomness of the model selection process, then there could be severe biases in estimating the parameters of interest. While the estimation bias in an under-fitted model is well understood, a lesser-known bias that arises from an over-fitted model will be addressed. The over-fitting bias can be eliminated through data splitting at the cost of statistical efficiency, and we propose a repeated data splitting approach to mitigate the efficiency loss. The second approach concerns the existing methods for debiased inference. We show that the debiasing approach is an extension of OLS to high dimensions. A comparison between these two approaches provides insights into their intrinsic bias-variance trade-off, and the debiasing approach may lose efficiency in observational studies. For the second part of the talk, we discuss a generalization on how to estimate treatment effects for groups of individuals that share similar features in observational studies. More importantly, even if the estimated treatment effects suggests a promising subgroup (i.e. the group with the maximal treatment effect), we address the question of how good the subgroup really is.

Scott Linderman, Assistant Professor of Statistics at Stanford - Models and Algorithms for Understanding Neural and Behavioral Data

The trend in neural recording capabilities is clear: we can record orders of magnitude more neurons now than we could only a few years ago, and technological advances do not seem to be slowing. Coupled with rich behavioral measurements, genetic sequencing, and connectomics, these datasets offer unprecedented opportunities to learn how neural circuits function. But they also pose serious modeling and algorithmic challenges. How do we develop probabilistic models for such heterogeneous data? How do we design models that are flexible enough to capture complex spatial and temporal patterns, yet interpretable enough to provide new insight? How do we construct algorithms to efficiently and reliably fit these models? I will present some of our recent work on recurrent switching linear dynamical systems and corresponding Bayesian inference algorithms that aim to overcome these challenges, and I will show how these methods can help us gain insight into complex neural and behavioral data.

Erick Matsen, Associate Professor of Genome Sciences and of Statistics at University of Washington - Phylogenetic Variational Bayes

Bayesian posterior distributions on phylogenetic trees remain difficult to sample despite decades of effort. The complex discrete and continuous model structure of trees means that recent inference methods developed for Euclidean space are not easily applicable to the phylogenetic case. Thus, we are left with random-walk Markov Chain Monte Carlo (MCMC) with uninformed tree modification proposals; these traverse tree space slowly because phylogenetic posteriors are concentrated on a small fraction of the very many possible trees. In this talk, I will start by motivating Bayesian phylogenetics via our work on how anti-HIV antibodies gain broad and potent binding properties. I will then describe our wild adventure developing efficient alternatives to random-walk MCMC, which has concluded successfully with the development of a variational Bayes formulation of Bayesian phylogenetics. This formulation leverages a “factorization” of phylogenetic posterior distributions that we show is rich enough to capture the shape of posteriors inferred from real data. Our proof-of-concept implementation of variational inference using this method gives very promising results, and I will describe our ongoing efforts to develop an efficient implementation that integrates with modern modeling frameworks.

Sonia Petrone, Professor and PhD Director of Statistics at Bocconi University, Milano - Bayes, empirical Bayes and recursive quasi-Bayes learning in mixture models

Mixture models are popular tools for inference with heterogeneous data in a wide range of fields. We consider the problem of prediction and inference in mixture models with streaming data. The Bayesian approach can provide a clean solution, as we will briefly review in the first part of the talk. However, analytic computations are involved. Nowadays pressure for fast computations, especially with streaming data and online learning, brings renewed interest in faster, although possibly sub-optimal, solutions. Approximate algorithm may offer these solutions, but often loose clear (Bayesian) statistical properties. Embedding the algorithm in a full probabilistic framework may illuminate.

This is what we do here. We reconsider a recursive algorithm proposed by M. Newton and collaborators for sequential learning in nonparametric mixture models. The so-called Newton algorithm is simple and fast, but theoretically intriguing. Although proposed as an approximation of a Bayesian solution, its quasi-Bayes properties remain an unsolved question. By framing the algorithm into a probabilistic setting, we can shed light on the underlying statistical model, that we show to be, asymptotically, an exchangeable mixture model, with a novel prior on densities. In this clean probabilistic framework, several applications and extensions become fairly natural, as we also illustrate in simulation studies. *This is joint work with Sandra Fortini.

Emma Pierson, Ph.D. Candidate in Computer Science at Stanford - Inequality in criminal justice and healthcare

Statistical analysis offers a powerful tool for quantifying and reducing social inequality. This talk describes statistical analyses of racial inequality in two domains: policing and pain.

Serena Yeung, Assistant Professor of Biomedical Data Science at Stanford - Computer vision methods for observational study of hospital care processes

Many aspects of hospital care ranging from performing surgery to patient care in the ICU are complex and have significant impact on patient outcomes, yet data for studying the execution of these care pathways is typically sparse and not amenable to data-driven analyses. A major challenge has been the lack of a scalable means to collect rich data describing care execution. In this talk I will describe the potential of using computer vision interpretation of video captured in hospital environments to provide this source of data. I will discuss computer vision based observational study of hospital care processes in two environments, the operating room and the ICU. I will present key technical challenges of interpreting human behavior from video capture, and describe ongoing work on approaches for tackling them.

Babak Shahbaba, Professor of Statistics and Computer Science at UC Irvine - Neural Data Analysis— From Stochastic Process Modeling to Deep Learning

In this talk, I will discuss an ongoing project aimed at understanding the neural basis of complex behaviors and temporal organization of memories. More specifically, I will focus on a unique electrophysiological experiment designed to address fundamental and unresolved questions about hippocampal function. Our goal is to elucidate the neural mechanisms underlying the memory for sequences of events, a defining feature of episodic memory. To this end, we have used high-density electrophysiological techniques to record neural activity (spikes and local field potentials) in hippocampal region CA1 as rats perform an odor sequence memory task. Importantly, this nonspatial approach allows us to determine whether spatial coding properties (thought to be fundamental to hippocampal memory function) extend to the nonspatial domain. To answer this question, we have developed a set of flexible inferential methods based on Gaussian process models for detecting neural patterns and a set of powerful predictive models based on deep learning algorithms for neural decoding. Our findings could lead to unprecedented insight into the neural mechanisms underlying memory impairments.

Spring 2019

Abstracts, when available, are included in the drop-down

Steve Yadlowsky, Ph.D. Candidate, Electrical Engineering at Stanford - Bounds on the conditional and average treatment effect with unobserved confounding factors

We study estimation of causal effects when the dependence of treatment assignments on unobserved confounding factors is bounded. First, we quantify bounds on the conditional average treatment effect under a bounded unobserved confounding model, first studied by Rosenbaum for the average treatment effect. Then, we propose a semi-parametric model to bound the average treatment effect and provide a corresponding inferential procedure, allowing us to derive confidence intervals of the true average treatment effect. Our semi-parametric method extends Chernozhukov et al.’s double machine learning method for the average treatment effect, which assumes all confounding variables are observed. As a result, our method allows applications in problems involving covariates of a higher dimension than traditional sensitivity analyses, e.g., covariate matching, allow. We complement our methodological development with optimality results showing that in certain cases, our proposed bounds are tight. In addition to our theoretical results, we perform simulation and real data analyses to investigate the performance of the proposed method, demonstrating the accuracy of the new confidence intervals in practical finite sample regimes.

Jimmie Ye, Assistant Professor of Epidemiology & Biostatistics, UCSF - Multiplexed Single Cell RNA-sequencing

Droplet single-cell RNA-sequencing (dscRNA-seq) has enabled rapid, massively parallel profiling of transcriptomes. However, assessing differential expression across multiple individuals has been hampered by inefficient sample processing and technical batch effects. Here we describe a computational tool, demuxlet, that harnesses natural genetic variation to determine the sample identity of each droplet containing a single cell (singlet) and detect droplets containing two cells (doublets). These capabilities enable multiplexed dscRNA-seq experiments in which cells from unrelated individuals are pooled and captured at higher throughput than in standard workflows. Using simulated data, we show that 50 single-nucleotide polymorphisms (SNPs) per cell are sufficient to assign 97% of singlets and identify 92% of doublets in pools of up to 64 individuals. Given genotyping data for each of eight pooled samples, demuxlet correctly recovers the sample identity of >99% of singlets and identifies doublets at rates consistent with previous estimates. We apply demuxlet to assess cell-type-specific changes in gene expression in 8 pooled lupus patient samples treated with interferon (IFN)-β and perform eQTL analysis on 23 pooled samples.

Karol Estrada, Director of Statistical Genetics at BioMarin Pharmaceutical, Inc. - Using human genetics to identify other genetic forms of short stature

The cost for developing a new drug has been increasing dramatically over the last forty years. Many reasons can be attributed to this. The major challenge is that easyto-solve diseases have already been tackled and now, more advanced technologies and scientific breakthroughs are needed to treat the diseases for which there is high medical need. On this talk, I’ll showcase how we are using human genetic data from large-scale studies to identify opportunities to validate and repurpose already existing drugs.

James Zou, Assistant Professor of Biomedical Data Science at Stanford & Amarita Ghorbani, Ph.D. Student, Electrical Engineering), & Abubakar Abid (Ph.D. Student, Electrical Engineering - What is your data worth? Quantifying the value of data in machine learning

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, Gov. Newsom recently proposed “data dividend” whereby consumers are compensated by companies for the data that they generate. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, our experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it gives actionable insights on what types of data benefit or harm the prediction model; 2) weighting training data by Shapley value improves domain adaptation. This is joint work with Amirata Ghorbani.

In the second part, Abubakar Abid, my PhD student, will present Gradio, a new framework to efficiently share and test ML models in the wild.

Stephen Montgomery, Associate Professor of Pathology and of Genetics at Stanford - Partying with the outliers: identifying large-effect rare variants using functional genomics

We are using functional genomics (i.e. gene expression, methylation, RNA-seq) to identify large effect rare variants that influence human traits. These variants would be challenging to identify through genome-wide association studies (GWAS) due to the low numbers of observations of each allele and the large number of variants. By looking at functional genomics outliers, our general approach is to identify rare variants in specific genes and individuals with outlier levels to use as a candidate subset for trait association testing; thereby reducing the multiple testing burden. This has been a promising approach but is more ad hoc in its application. When is regression sufficient? When should one consider using molecular outliers? What constitutes an outlier? How do we effectively integrate across multiple layers of molecular data? Let’s discuss!

Xin Shi, Reader in Applied Statistics at Manchester Metropolitan University and Director of Manchester Met China Centre - Big data for health management—early diagnosis, intervention and prevention

Health is the most important demand for humans. Long and healthy life is one of the primary research subjects in human health research. However, it is difficult to accurately access health status at a very early stage, with the aim of determining appropriate interventions to maintain good health and wellbeing. Therefore, it is essential to optimize human health management polices and assess the risk factors associated with health status. Human health management is the process and means for health risk factor monitoring, prognostics, intervention and control based on our knowledge of human health and prevention using non-clinical and clinical linkage data. Some symptoms that could indicate potential advanced disease or chronic disease can often be ignored or missed. This will lead to serious delay in clinical diagnosis and timely treatment intervention. Subsequently, it will increase the medical treatment costs as well as increasing the patient’s physical, mental and financial burden.

Our study aims to develop a systematic approach, which integrates statistical and artificial intelligent health big data modeling into optimal health management decision-making with mobile application. By developing statistical modeling method for health big data on early diagnosis, prevention and intervention, we are developing a multi stage delay-time model to investigate risk factors and predict heath status at an earlier stage of disease/illness progression using linked clinical and non-clinical data. In this talk, we will present our recent research outcomes and discuss the challenges for the future study.

John Witte, Professor of Epidemiology & Biostatistics at UCSF - Construction and application of polygenic risk scores

Over the past decade genome-wide association studies (GWAS) have found thousands of variants associated with hundreds of phenotypes. The conventional GWAS approach evaluates each variant individually. However, these almost always have a small effect on a given phenotype. To address this limitation, one can instead combine variants together into a polygenic risk score (PRS), which can be more strongly associated with phenotypes. This suggests that a PRS may be useful for predicting phenotypes. For example, indicating which individuals are at a substantially increased risk of cancer and should undergo more active screening). While there is much hope and excitement surrounding the potential use of PRS, there also remain a number of unanswered questions and concerns with this approach. I will consider some of the critical issues, including how to best construct PRS from GWAS data, with an application to multiple different cancers in the UK Biobank cohort.

Jennifer Listgarten, Professor of Electrical Engineering and Computer Science at UC Berkeley - Accelerating protein and molecule engineering with machine learning-based optimization

We present a new method for design problems wherein the goal is to maximize or specify the value of one or more properties of interest. For example, in protein design, one may wish to find the protein sequence that maximizes fluorescence. We assume access to one or more, potentially black box, stochastic “oracle” predictive functions, each of which maps from input (e.g., protein sequences) design space to a distribution over a property of interest (e.g. protein fluorescence). At first glance, this problem can be framed as one of optimizing the oracle(s) with respect to the input. However, many state-of-the-art predictive models, such as neural networks, are known to suffer from pathologies, especially for data far from the training distribution. Thus we need to modulate the optimization of the oracle inputs with prior knowledge about what makes `realistic’ inputs (e.g., proteins that stably fold). Herein, we propose a new method to solve this problem, Conditioning by Adaptive Sampling, which yields state-of-the-art results on a simulated protein fluorescence problem, as compared to other recently published approaches. Formally, our method achieves its success by using model-based adaptive sampling to estimate the conditional distribution of the input sequences given the desired properties.

Adam Auton, Principal Scientist at 23andme - Participant-Powered Research

23andMe’s mission is to help people access, understand, and benefit from the human genome. In this talk, I will provide an overview of research studies conducted at 23andMe, and outline how we engage our customers in scientific research via the the 23andMe service. Using this approach, 23andMe has developed the world’s largest consented, re-contactable database for genetic research, with more than 5 million customers, a research consent rate over 80%, and over one billion phenotypic data points. I will discuss how the 23andMe Research team has leveraged this database to drive scientific discovery that can lead to novel therapies offering benefits for patients.

Winter 2019

Abstracts, when available, are included in the drop-down

Mengdi Weng, Assistant Professor of Operations Research and Financial Engineering at Princeton University - Statistical State Compression and Primal-Dual Pi Learning

Recent years have witnessed increasing empirical successes in reinforcement learning. However, many statistical questions about reinforcement learning are not well understood even in the most basic setting. For example, how many sample transitions are needed and sufficient for estimating a near-optimal policy for Markov decision problem (MDP)? In the first part, we survey recent advances on the methods and complexity for Markov decision problems (MDP) with finitely many state and actions – a most basic model for reinforcement learning. In the second part we study the statistical state compression of general finite-state Markov processes. We propose a spectral state compression method for learning state features and aggregation structures from data. The state compression method is able to “ sketch” a black-box Markov process from its empirical data, for which both minimax statistical guarantees and scalable computational tools are provided. In the third part, we propose a bilinear primal-dual pi learning method for learning the optimal policy of MDP, which utilizes given state features. The method is motivated from a saddle point formulation of the Bellman equation. Its sample complexity depends only on the number of parameters and is variant with respect to the dimension of the problem, making high-dimensional reinforcement learning possible using “small” data.

Samiran Ghosh, Associate Professor of Biostatistics at Wayne State University - Non-Inferiority Design in Comparative Effectiveness Research: Should We be Bayesian for a While?

Randomized controlled trials (RCT’s) are an indispensable source of information about efficacy of treatments in almost any disease area. With the availability of multiple treatment options, comparative effectiveness research (CER) is gaining importance for better and informed health care decisions. However design and analysis of effectiveness trial is much more complex than the efficacy trial. The effect of including an active comparator arm/s in a RCT is immense. This gives rise to superiority and non-inferiority trials. The non-inferiority (NI) RCT design plays a fundamental role in CER, which will be also focus of this talk. In the past decade many statistical methods have been developed, though largely in the Frequentist setup. However, availability of historical placebo-controlled trial is useful and if integrated in the current NI trial design, can provide better precision for CER. This may reduce sample size burden and improves statistical power significantly in current trial. Bayesian paradigm provides a natural path to integrate historical as well as current trial data via sequential learning in the NI setup. In this talk we will discuss both fraction margin and fixed margin based Bayesian approach for three-arm NI trial. We will also discuss some interesting open problems related to CER using NI trial.

Aaron Newman, Assistant Professor of Biomedical Data Science at Stanford University - In silico dissection of complex tissues from genomic data

Tissue composition is a major determinant of phenotypic variation and a key factor influencing disease outcomes. Although single-cell RNA sequencing has emerged as a powerful technique for characterizing cellular heterogeneity, it is currently impractical for large sample cohorts and cannot be applied to fixed specimens collected as part of routine clinical care. Over the last decade, a number of computational techniques have been described for dissecting cellular content directly from genomic profiles of mixture samples. In this talk, I will review key computational and statistical considerations for “digital cytometry” applications. I will also discuss basic and translational efforts from our group to leverage cell signatures derived from diverse sources, including single-cell reference profiles, to infer cell type abundance and cell type-specific gene expression profiles from bulk tissue transcriptomes. Digital cytometry has the potential to augment single cell profiling efforts, enabling cost-effective, high throughput tissue characterization without the need for antibodies, disaggregation, or viable cells.

Jeremy Freeman, Director of Computational Biology at Chan Zuckerberg Initiative - Open and collaborative computational biology

Across modern biology, open-source scientific software is increasingly critical for progress. At the Chan Zuckerberg Initiative, we are supporting this ecosystem through both grantmaking and software development. I’ll describe specific ongoing efforts for the storage, analysis, and visualization of cell biology sequencing and imaging data. I’ll also highlight our broader ideas for building and supporting data sharing and open ecosystems for computational biology more generally.

Ying Qing Chen, Affiliate Faculty of Biostatistics at University of Washington - Extraordinary Power of Statistics for a Scientific Breakthrough in HIV/AIDS Prevention Research: Design and Methods for the HPTN 052 Study

The HIV Prevention Trial Network (HPTN) 052 Study is a Phase III, controlled, randomized clinical trial to assess the effectiveness of immediate versus delayed antiretroviral therapy strategies on sexual transmission of HIV-1 (Cohen, et al., 2016). It was selected by the “Science Magazine” as the Scientific Breakthrough of the Year 2011 (Alberts, 2011). In this talk, we will focus on the design and methods that underlie this landmark study in HIV Treatment-as-Prevention, and discuss the lessons that we have learned for future prevention research. References: Alberts, B (2011) Science breakthroughs, Science, 334: 1604; Cohen, MS, Chen, YQ, McCauley, M, et al. (2016) Antiretroviral therapy for the prevention of HIV transmission. New England Journal of Medicine, 375: 830-839.

Mohsen Bayati, Associate Professior of Operations, Information & Technology at Stanford University Graduate School of Business - Reducing Exploration in Personalized Decision-Making

Recently, there has been a surge in applying statistical learning methods in healthcare, to build models for predicting adverse outcomes from patient covariates. These predictions are then used to optimize allocation of scarce resources or treatment decisions. However, when the treatments are new, decision-making should optimize a trade-off between two objectives: (1) learning decision outcomes as functions of individual-specific covariates (exploration) and (2) maximizing benefit of the decisions. Current literature on this problem, theory of contextual multi-armed bandits, focuses on algorithms that rely on forced-exploration to address this trade-off. However, forced-exploration can be considered costly or unethical in certain decision-making tasks (e.g., hospital quality improvement initiatives). In this talk, we first introduce an algorithm that leverages freeexploration from patient covariates and achieves rate optimal objective. We also show, empirically, that the algorithm significantly reduces exploration, compared to the existing benchmarks. Next, we focus on settings when past data on decision outcomes is available. Motivated by recent literature on low-rank matrix estimation, we design algorithms that avoid unnecessary exploration by targeting the learning towards shared similarities among decisions or patients. We then demonstrate performance of the proposed methods to estimate the personalized effect of a glucose inhibitor drug (Metformin) for pre-diabetic treatment.

Haiyan Huang, Associate Professor of Statistics at UC Berkeley - GeneFishing: a method to reconstruct context-specific portraits of biological processes and its application to cholesterol metabolism

Rapid advances in genomic technologies have led to a wealth of diverse data, from which novel discoveries can be gleaned through the application of robust statistical and computational methods. Here we describe GeneFishing, a computational approach to reconstruct context-specific portraits of biological processes by leveraging genegene co-expression information. GeneFishing incorporates multiple high-dimensional statistical ideas, including dimensionality reduction, clustering, subsampling and results aggregation, to produce robust results. To illustrate the power of our method, we applied it using 21 genes involved in cholesterol metabolism as “bait”, to “fish out” (or identify) genes not previously identified as being connected to cholesterol metabolism. Using simulation and real datasets, we found the results obtained through GeneFishing were more interesting for our study than those provided by related geneprioritization methods. In particular, application of GeneFishing to the GTEx liver RNAseq data not only re-identified many known cholesterol-related genes, but also pointed to glyoxalase I (GLO1) as a novel gene implicated in cholesterol metabolism. In a follow-up experiment, we found that GLO1 knock-down in human hepatoma cell lines increased levels of cellular cholesterol ester, validating a role for GLO1 in cholesterol metabolism. In addition, we performed pan-tissue analysis by applying GeneFishing on various tissues and identified many potential tissue-specific cholesterol metabolism related genes. GeneFishing appears to be a powerful tool for identifying novel related components of complex biological systems and may be employed across a wide range of applications.

Carolin Coljin, Professor of Mathematics at Simon Fraser University - Trees and infectious disease outbreaks

With the development of rapid, low-cost and readily available sequencing technologies, there is a need for quantitative methods to help interpret sequence datasets and relate them to the dynamics of biological systems. Trees (in the sense of graphs with no cycles) are a mainstay of how we represent and understand sequence data. I will introduce several flavours of trees with their motivating applications, and will describe a metric — in the sense of a true distance function — on unlabelled binary trees; this metric is derived from polynomials on the unlabelled trees. In the second part of the talk I will describe inference tools using trees, in the context of infectious disease: we use a mapping between phylogenetic trees and transmission trees to construct a Bayesian MCMC approach to estimate who infected whom and when. I will describe extensions of this inference approach to simultaneous reconstruction of outbreaks in different clusters, and conclude with a description of open problems and challenges in this area.

Lexin Li, Professor of Biostatistics and Epidemiology at UC Berkeley - Some Statistical Problems in Brain Functional Connectivity Analysis

Brain functional connectivity maps the intrinsic functional architecture of the brain through correlations in neurophysiological measures of brain activities. Accumulated evidences have suggested that it holds crucial insights of pathologies of a wide range of neurological disorders. Brain functional connectivity analysis is at the foreground of neuroscience research, and is drawing increasing attention in the statistics field as well. A connectivity network is characterized by a graph, where nodes represent brain regions, and links represent statistical dependence that is often encoded by partial correlation. Such a graph is inferred from the matrixvalued neuroimaging data such as electroencephalography and functional magnetic resonance imaging. In this talk, we examine a number of statistical problems arising in brain connectivity analysis, including multigraph penalized estimation, graph-based hypothesis testing, and dynamic connectivity network modeling.