**Spring 2019**

All colloquia will be held at 4pm in AUST 108, unless otherwise noted. Coffee will be served at 3:30pm in room 326.

**Information about past colloquia is available here.**

Colloquium is organized by Professor Yuwen Gu.

## Karthik Bharath; University of Nottingham

### Invariant aspects of functional data and their statistical implications

#### May 1, 2019

While analysing the shape of brain tumours from images, methods should be agnostic to variations due to rotations, scalings and reparameterisations of the curves representing the tumours. Similar considerations regarding phase/temporal variation may be applicable when analysing growth curves and temperature measurements over time across different locations. How can one isolate those features in the data that are unaffected by such (possibly) nuisance variations?

In this talk, I will describe a group-theoretic setting, compatible with the metric structure of the data space of functions and curves, that provides a natural way to model such variation. The ensuing invariant properties offer insight into the dimensionality of the data, and impact various statistical models and methods for functional data.

## Dipankar Bandyopadhyay; Virginia Commonwealth University

### Bayesian Nonparametric Policy Search with Application to Periodontal Recall Intervals

#### April 24, 2019

Tooth loss from periodontal disease is a major public health burden in the United States. Standard clinical practice is to recommend a dental visit every six months; however, this practice is not evidence based, and poor dental outcomes and increasing dental insurance premiums indicate room for improvement. We consider a tailored approach that recommends recall time based on patient characteristics and medical history to minimize disease progression without increasing resource expenditures. We formalize this method as a dynamic treatment regime which comprises a sequence of decisions, one per stage of intervention, that follow a decision rule mapping current patient information to a recommendation for their next visit time. The dynamics of periodontal health, visit frequency, and patient compliance are complex, yet the estimated optimal regime must be interpretable to domain experts if it is to be integrated into clinical practice. In this talk, we combine non-parametric Bayesian dynamic modeling with policy-search algorithms to estimate the optimal dynamic treatment regime within an interpretable class of regimes. Both simulation experiments and application to a rich database of electronic dental records from the HealthPartners HMO show that our proposed method may lead to better dental health without increasing the average recommended recall time, relative to competing methods.

## Francis Zwiers; University of Victoria

### On the challenge of making reliable inferences about extreme precipitation events

#### April 22, 2019

The design, management and adaptation of infrastructure to a changing climate makes extensive use of estimates of the expected frequency and intensity of extreme precipitation events. Such information is also used for disaster response planning, land use planning and regulation, and many other purposes. The primary source of information that is used to meet this requirement is from meteorological stations that are irregularly scattered across the landscape and have operated for variable periods of time measured in years to multiple decades. While many methods have been proposed to extract as much information as possible from the available data about the far righthand tail of the precipitation distribution at a given location, the limitations to our ability to well-constrain estimates of high quantiles and whether they are changing over time, as might be expected in a warming climate, are not well appreciated. The advent of large ensemble climate simulations with global, and now higher resolution regional climate models, allows us to develop an improved appreciation of the amount of observational data that may be required to estimate high quantiles well, and to determine how and why they are changing. This talk will provide an overview of work undertaken in collaboration with my group on extreme precipitation and its possible future change. Topics will include (i) an assessment of difficult to test stability assumptions that are made as a matter of course in the analysis of extreme precipitation, (ii) quantification of how much information is required to reliably constrain non-stationary models of extreme precipitation, and time permitting, (iii) a proposed approach to the estimation of probable maximum precipitation – a quantity that is often used in dam design.

## Bhramar Mukherjee; University of Michigan

### Handling Sampling and Selection Bias in Phenome-wide Association Studies

#### April 17, 2019

In this talk I will discuss statistical challenges and opportunities with joint analysis of electronic health records and genomic data through “Phenome-Wide Association Studies (PheWAS)”. I will posit a modeling framework that helps us to understand the effect of both selection bias and outcome misclassification in assessing genetic associations across the medical phenome. I will use data from the UK Biobank and the Michigan Genomics Initiative, a longitudinal biorepository at Michigan Medicine, launched in 2012 to illustrate the analytic framework. The examples illustrate that understanding sampling design and selection bias matters for big data, and are at the heart of doing good science with data. This is joint work with Lauren Beesley and Lars Fritsche at the University of Michigan.

## Debanjan Bhattacharjee; Utah Valley University

### Sequentially Estimating the Required Optimal Observed Number of Tagged Items with Bounded Risk in the Recapture Phase Under Inverse Binomial Sampling

#### April 12, 2019

In the context of statistical ecology, a biologist may want to estimate the number of fish of certain species residing in a specific part of a large lake or an ocean. A forester working on behalf of the department of parks and wildlife may want to estimate how many deer are there in a large habitat. These are couple of motivating examples of importance where a common thread lies in the estimation of the size (N) of a closed and finite population. Estimation of a closed population size under inverse binomial sampling consists of four basic steps: First, one captures t items, then tag these t items, followed by releasing the t tagged items back to the population. Then, one draws an item from the population one-by-one until stagged items are recaptured where s is fixed in advance. In the recapturing stage (fourth step), items are normally drawn with replacement. But, without replacement sampling will not impact much if N is large. Under squared error loss (SEL) as well as weighted SEL, we propose sequential methodologies to come up with bounded risk point estimators of an optimal choice of s; leading to an appropriate sequential estimator of N. The sequential estimation methodologies are supplemented with appropriate first-order asymptotic properties which are followed by extensive data analyses.

## Patrick Flaherty; University of Massachusetts

### MAP Estimation for the Gaussian Mixture Model via Modern Optimization Methods

#### April 10, 2019

The Gaussian mixture model is a prototype for studying model-based clustering methods and a practical tool for real data analysis. The maximum likelihood or maximum a-posteriori (MAP) parameter estimates are typically estimated using the expectation-maximization (EM) algorithm. Viewed from the perspective of an optimization problem, we know that the EM algorithm can be viewed as a solution of a particular relaxation of a mixed-integer nonlinear optimization problem. Taking this optimization perspective further, we find that many other methods for estimating the parameters can be viewed as particular relaxations of the optimization problem. Using the optimization perspective, we develop some novel methods for MAP estimation for the Gaussian mixture model. These methods naturally handle complex prior constraints that would be difficult to formulate using standard distributions. Though this is a work in progress, we show some numerical experiments that give us confidence that the approach can be made scalable to large data sets.

## Donald Berry; University of Texas

### The Bayesian Revolution in Medical Research

#### April 3, 2019

Randomization was introduced into clinical trials by Bradford Hill in the 1940s. The RCT was revolutionary. It changed the stuff of medicine from case studies and expert opinion into a real science. The RCT became the gold standard. It was so revered that nobody wanted to change it or let others change it. As a consequence the RCT has remained pretty much the same over the last 80 years. Until now.

Bayesians have always thought they knew better. But only recently have Bayesians made inroads into clinical trial design, taking the RCT to new levels. I will explain why and how this happened. I will give examples of clinical trials in the new millennium. I will focus on adaptive basket trials and adaptive platform trials because they seem to be the niche most clearly having a role for the Bayesian approach. Both belie the old saw, “Keep it simple, stupid.” In medicine at least, KISS will come to mean, “Keep it simple and stupid.”

## Linglong Kong; University of Alberta

### A General Framework for Quantile Estimation with Incomplete Data

#### March 27, 2019

Quantile estimation has attracted significant research interests in recent years. However, there has been only a limited literature on quantile estimation in the presence of incomplete data. In this paper, we propose a general framework to address this problem. Our framework combines the two widely adopted approaches for missing data analysis, the imputation approach and the inverse probability weighting approach, via the empirical likelihood method. The proposed method is capable of dealing with many different missingness settings. We mainly study three of them: (i) estimating the marginal quantile of a response that is subject to missingness while there are fully observed covariates; (ii) estimating the conditional quantile of a fully observed response while the covariates are partially available; and (iii) estimating the conditional quantile of a response that is subject to missingness with fully observed covariates and extra auxiliary variables. The proposed method allows multiple models for both the missingness probability and the data distribution. The resulting estimators are multiply robust in the sense that they are consistent if any one of these models is correctly specified. The asymptotic distributions are established using the empirical process theory.

## Victor Pena; CUNY

### Criteria for Bayesian hypothesis testing in two-sample problems

#### March 13, 2019

We define new criteria for prior choice in two-sample hypothesis tests and find classes of prior distributions that satisfy them (and classes that don't). The criteria have a common starting point: a hypothetical situation where perfect knowledge about one of the groups is attained, while the data for the other group are assumed to be fixed. In such a scenario, the Bayes decision of the two-sample problem should "converge" to the Bayes decision of a one-sample test where we know the distribution of the group for which we gain perfect information. The first criterion is based on a limiting argument where the sample size of one of the groups grows to infinity, whereas the second criterion is based upon conditioning on the "true" value of the parameters. We find priors where the limiting argument and conditioning give rise to equivalent Bayes decisions under perfect knowledge, and cases where they give rise to different Bayes decisions. We show that, with some prior specifications, the limiting Bayes decisions are not compatible with any prior specification for the one-sample problem.

## Erin Conlon; University of Massachusetts

### Parallel Markov chain Monte Carlo for Bayesian hierarchical models with big data, in two stages

#### March 6, 2019

Due to the recent growth of big data sets, new Bayesian Markov chain Monte Carlo (MCMC) parallel computing methods have been created. These methods divide large data sets by observations into subsets. However, many Bayesian hierarchical models have only a small number of parameters that are common to the full data set, with the majority of parameters being group specific. Therefore, techniques that split the full data set by groups rather than by observations are a more natural analysis approach.

Here, we adapt and extend such a two-stage Bayesian hierarchical modelling method. In stage 1, each group is evaluated independently in parallel; the stage 1 posteriors are used as proposal distributions in stage 2, where the full model is estimated. We illustrate our approach using both simulation and real data sets, with both three-level and four-level models. Our results show considerable increases in MCMC efficiency and large reductions in computation times compared to the full data analysis.

## Stephanie Hicks; Johns Hopkins University

### Making data science accessible world-wide in the Johns Hopkins Data Science Lab

#### February 27, 2019

In this talk, I will introduce the Johns Hopkins Data Science Lab: who we are, what are our goals, and the types of projects we are working on to make data science accessible world-wide. Then, I will discuss projects that I have focused on related to data science education. Despite unprecedented and growing interest in data science on campuses, there are few courses and course materials that provide meaningful opportunity for students to learn about real-world challenges. Most courses provide unrealistically clean data sets that fit the assumptions of the methods in an unrealistic way. The result is that students are left unable to effectively analyze data and solve real-world challenges outside of the classroom. To address this problem, I am leveraging the idea from Nolan and Speed in 1999, who argued the solution to this problem is to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. I will share a set of general principles and offer a detailed guide derived from my successful experience developing and teaching graduate-level, introductory data science courses centered entirely on case studies. Furthermore, I will present the Open Case Studies educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data derived from real-world challenges.

## Joseph Cappelleri; Pfizer

### Advancing Interpretation of Patient-Reported Outcomes

#### February 20, 2019

A patient-reported outcome is any report on the status of a patient’s health condition that comes directly from the patient. Clear and meaningful interpretation of patient-reported outcome scores are fundamental to their use as they can be valuable in designing studies, evaluating interventions, educating consumers, and informing health policy makers involved with regulatory, reimbursement, and advisory agencies. Interpretation of patient-reported outcome scores, however, is often not well understood because of insufficient data or lack of experience or clinical understanding to draw from.

This presentation provides an update review on two broad approaches – anchor-based and distributed-based – aimed at enriching the understanding and meaning of patient-reported outcome scores. Anchor-based approaches use a measure (external to the targeted patient-reported outcome of interest) that is well interpretable and correlated with the targeted patient-reported outcome. Examples include percentages based on thresholds, criterion-group interpretation, content-based interpretation, and clinical important difference. Distributed-based approaches rely strictly on the distribution of the data. Examples include effect size, probability of relative benefit, and cumulative distribution functions. Applications are based on real-life and simulated examples.

## Jun Yan; University of Connecticut

### Generalized scale-change models for recurrent event processes under informative censoring

#### February 13, 2019

Two major challenges arise in regression analyses of recurrent event data: first, popular existing models, such as the Cox-type models, may not fully capture the covariate effects on the underlying recurrent event process; second, the censoring time remains informative about the risk of experiencing recurrent events after accounting for covariates. We tackle both challenges by a general class of semiparametric scale-change models that allow a scale-change covariate effect as well as a multiplicative covariate effect. The proposed model is flexible and nests several existing models, including the popular proportional rates model, the accelerated mean model, and the accelerated rate model. Moreover, it accommodates informative censoring through subject-level latent frailty whose distribution is left unspecified. A robust approach is proposed to estimate the model parameters, which does not need a parametric assumption on the distribution of the frailty and the recurrent event process. The asymptotic properties of the resulting estimator are established, with the asymptotic variance estimated from a novel resampling approach. As a byproduct, the structure of the model provides a model selection approach among the submodels via hypothesis testing of model parameters. Numerical studies show that the proposed estimator and the model selection procedure perform well under both noninformative and informative censoring scenarios. The methods are applied to data from two transplant cohorts to study the risk of infections after transplantation.

## Nicholas Henderson; Johns Hopkins University

### Estimating heterogeneous treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models

#### February 4, 2019

Individuals often respond differently to identical treatments, and characterizing such variability in treatment response is an important aim in the practice of personalized medicine. In this article, we describe a nonparametric accelerated failure time model that can be used to analyze heterogeneous treatment effects (HTE) when patient outcomes are time-to-event. By utilizing Bayesian additive regression trees and a mean-constrained Dirichlet process mixture model, our approach offers a flexible model for the regression function while placing few restrictions on the baseline hazard. Our nonparametric method leads to natural estimates of individual treatment effect and has the flexibility to address many major goals of HTE assessment. Moreover, our method requires little user input in terms of model specification for treatment covariate interactions or for tuning parameter selection. Our procedure shows strong predictive performance while also exhibiting good frequentist properties in terms of parameter coverage and mitigation of spurious findings of HTE. We illustrate the merits of our proposed approach with a detailed analysis of two large clinical trials for the prevention and treatment of congestive heart failure using an angiotensin-converting enzyme inhibitor. The analysis revealed considerable evidence for the presence of HTE in both trials as demonstrated by substantial estimated variation in treatment effect and by high proportions of patients exhibiting strong evidence of having treatment effects which differ from the overall treatment effect.

## Wen Zhou; Colorado State University

### Estimation and Inference of Heteroskedasticity Models with Latent Semiparametric Factors for Multivariate Time Series

#### January 28, 2019

This paper considers estimation and inference of a flexible heteroskedasticity model for multivariate time series, which employs semiparametric latent factors to simultaneously account for the heteroskedasticity and contemporaneous correlations. Specifically, the heteroskedasticity is modeled by the product of unobserved stationary processes of factors and subject-specific covariate effects. Serving as the loadings, the covariate effects are further modeled through additive models. We propose a two-step procedure for estimation. First, the latent processes of factors and their nonparametric loadings are estimated via projection-based methods. The estimation of regression coefficients is further conducted through generalized least squares. Theoretical validity of the two-step procedure is documented. By carefully examining the convergence rates for estimating the latent processes of factors and their loadings, we further study the asymptotic properties of the estimated regression coefficients. In particular, we establish the asymptotic normality of the proposed two-step estimates of regression coefficients. The proposed regression coefficient estimator is also shown to be asymptotically efficient. This leads us to a more efficient confidence set of the regression coefficients. Using a comprehensive simulation study, we demonstrate the finite sample performance of the proposed procedure, and numerical results corroborate our theoretical findings. Finally, we illustrate the use of our proposal through application to the air quality data.

## Yao Zheng; Purdue University

### Finite Time Analysis of Vector Autoregressive Models under Linear Restrictions

#### January 25, 2019

We develop a unified finite-time theory for the OLS estimation of possibly unstable and even slightly explosive VAR models under linear restrictions, with the applicable region \rho(A)\leq 1+c/T, where \rho(A) is the spectral radius of the transition matrix A in the VAR(1) representation, T is the time horizon and c>0 is a universal constant. This linear restriction framework encompasses various existing models in the literature such as banded/network VAR models. We show that the restrictions reduce the error bounds through not only the reduced dimensionality but also a scale factor that resembles the asymptotic covariance matrix of the estimator in the fixed dimensional setup; as long as the model is correctly specified, this scale factor is decreasing in the number of restrictions. Our analysis reveals that the phase transition from slow and fast error rate regimes is determined by the smallest singular value of A, a measure of the least excitable mode of the system. The minimax lower bounds are also derived across different regimes. The developed finite-time theory not only bridges the theoretical gap between stable and unstable regimes but also precisely characterizes the effect of the restrictions and its interplay with other model parameters. Simulations support our theoretical results in both small and large samples

## Youngdeok Hwang; Sungkyunkwan University

### Statistical Estimation of Air Pollution Through Integration of Physical Knowledge

#### January 23, 2019

Air pollution is driven by non-local dynamics, in which the air quality at a site is determined by the transport of pollutants from distant pollution emission sources by atmospheric processes. To understand the underlying nature of pollution generation, it is crucial to employ physical knowledge to account for the transport of pollutants by the wind. In this talk, I will discuss methods for estimating the pollution emission from the area of interest through the use of physical knowledge and observed data. The proposed methods use a statistical approach to utilize large-scale data from a numerical weather prediction model, while integrating the dynamics of the physical processes into the model. Also discussed are some extensions and related problems.