**Spring 2017 **

Coffee will be served at 3:30PM in room 326 for each Wednesday colloquium and at 10:30AM in room 326 for each 11AM and 10AM for each 9AM colloquium. TAs are responsible for set-up according to the schedule.

Information about past colloquia is available here.

## Date |
## Speaker |
## Title |
## Location |

Wednesday, January 18, 4:00pm | Fanfang Wang, University of Connecticut | On the Estimation of Integrated Volatility in the Frequency Domain | AUST 105 |

Wednesday, January 25, 4:00pm | Victor Hugo Lachos Davila, University of Connecticut | Heavy-tailed longitudinal regression models for censored data: A likelihood based perspective | AUST 105 |

Wednesday, February 1, 4:00pm | AUST 105 | ||

Monday, February 6, 11:00am | Yuwen Gu, School of Statistics, University of Minnesota | High-dimensional Generalizations of Asymmetric Least Squares and Their Applications | AUST 344 |

Wednesday, February 8, 4:00pm | AUST 105 | ||

Friday, February 10, 11:00am | Jon Steingrimsson, John Hopkins Bloomberg School of Public Health | Doubly Robust Survival Trees and Forests | AUST 105 |

Monday, February 13, 9:00am | Nhat Ho, University of Michigan | Parameter Estimation and Multilevel Clustering with Mixture and Hierarchical Models | AUST 344 |

Wednesday, February 15, 4:00pm | Vishesh Karwa, Harvard University | Differentially Private Statistical Inference | AUST 105 |

Friday, February 17, 11:00am | HaiYing Wang, University of New Hampshire | Information-Based Optimal Subdata Selection for Big Data Linear Regression | AUST 105 |

Monday, February 20, 9:00am | Kuang-Yao Lee, Yale School of Public Health | On Additive Conditional Independence for High-Dimensional Statistical Analysis | AUST 344 |

Wednesday, February 22, 10:00am | Jon Steingrimsson, John Hopkins Bloomberg School of Public Health | Doubly Robust Survival Trees and Forests
*Rescheduled from Feb. 10 |
AUST 344 |

Wednesday, March 1, 4:00pm | AUST 105 | ||

Wednesday, March 8, 4:00pm
Joint event with School of Education |
Dan McNeish, University of North Carolina, Chapel Hill | Is Bayes a Solution for Small Samples? | Gentry 144 |

Wednesday, March 22, 4:00pm | Gongjun Xu, University of Michigan | Identifying Latent Structures in Restricted Latent Class Models | AUST 105 |

Wednesday, March 29, 4:00pm | Andrea Troxel, NYU School of Medicine | Statistical Issues in Pragmatic Behavioral Clinical Trials | AUST 105 |

Wednesday, April 5, 4:00pm | Michael Lavine, University of Massachusetts | WHIM: Function Approximation WHere It Matters | AUST 105 |

Wednesday, April 12, 4:00pm | Donghui Zhang and Xiwen Ma, Sanofi | Statistical Collaborations in Drug Discovery – Challenges and Opportunities
Some Statistical Issues in Biomarker Identification for Clinical Study |
AUST 105 |

Wednesday, April 19, 4:00pm | Bani Mallick, Texas A&M University | AUST 105 | |

Wednesday, April 26, 4:00pm | Gen Li, Columbia University | A General Framework for the Association Analysis of Heterogeneous Data | AUST 105 |

Colloquium is organized by Professor Xiaojing Wang.

## Fanfang Wang; University of Connecticut

### On the Estimation of Integrated Volatility in the Frequency Domain

#### January 18, 2017

This talk discusses frequency-domain analysis of integrated volatility using intraday information. By exploring the informational content of the power spectrum of ultra-high-frequency data, the speaker would consider a realized periodogram-based estimator for the ex-post price variation. When intraday equity prices are sampled at ultra-high frequency and are contaminated with market microstructure noise, the proposed estimator behaves like a low-pass filter: it removes the noise by filtering out high frequency periodograms and converts the high frequency data into low frequency periodograms. Numerical study shows that the proposed estimator is insensitive to the choice of sampling frequency and it is competitive with other existing noise-corrected volatility measures.

## Victor Hugo Lachos Davila; University of Connecticut

### Heavy-tailed longitudinal regression models for censored data: A likelihood based perspective

#### January 25, 2017

HIV RNA viral load measures are often subjected to some upper and lower detection limits depending on the quantification assays. Hence, the responses are either left or right censored. Moreover, it is quite common to observe viral load measurements collected irregularly over time. A complication arises when these continuous repeated measures have a heavy-tailed behaviour. For such data structures, we propose a robust nonlinear censored regression model based on the scale mixtures of normal (SMN) distributions. To take into account the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is considered. A stochastic approximation of the EM (SAEM) algorithm is developed to obtain the maximum likelihood estimates of the model parameters. The main advantage of this new procedure allows us to estimate the parameters of interest and evaluate the log-likelihood function in an easy and fast way. Furthermore, the standard errors of the fixed effects and predictions of unobservable values of the response can be obtained as a by-product. The practical utility of the proposed method is exemplified using both simulated and real data.

## Yuwen Gu; School of Statistics, University of Minnesota

### High-dimensional Generalizations of Asymmetric Least Squares and Their Applications

#### February 6, 2017

Asymmetric least squares (ALS) regression is a convenient and effective method for summarizing the conditional distribution of a response variable given the covariates. Recent years have seen a growing interest in ALS amongst statisticians, biostatisticians, econometricians and financial analysts. However, existing work on ALS only considers the traditional low-dimension- and-large- sample setting. In this talk, we systematically explore the Sparse Asymmetric LEast Squares (SALES) regression under high dimensionality. We show the complete theory using penalties such as lasso, MCP and SCAD. A unified efficient algorithm for fitting SALES is proposed and is shown to have a guaranteed linear convergence.

An important application of SALES is to detect heteroscedasticity in high-dimensional data and from that perspective it provides a computationally friendlier alternative to sparse quantile regression (SQR). However, when the goal is to separate the set of significant variables for the mean and that for the standard deviation of the conditional distribution, both SALES and SQR can fail when overlapping variables exist. To that end, we further propose a Coupled Sparse Asymmetric LEast Squares (COSALES) regression. We show that COSALES can consistently identify the two important sets of significant variables for the mean and standard deviation simultaneously, even when the two sets have overlaps.

## Jon Steingrimsson; John Hopkins Bloomberg School of Public Health

### Doubly Robust Survival Trees and Forests

~~February 10, 2017~~

#### February 22, 2017

Survival trees use recursive partitioning to separate patients into distinct risk groups when some observations are right-censored. Survival forests average multiple survival trees creating more flexible prediction models. In the absence of censoring, the algorithms rely heavily on the choice of loss function used in the decision making process. Motivated by semiparametric efficiency theory, we replace the loss function used in the absence of censoring by doubly robust loss functions. We derive properties of these loss functions and show how the doubly robust survival trees and forest algorithms can be implemented using a certain form of response transformation. Furthermore, we discuss practical issues related to the implementation of the algorithms. The performance of the resulting survival trees and forests is evaluated through simulation studies and analyzing data on death from myocardial infarction.

## Nhat Ho; University of Michigan

### Parameter Estimation and Multilevel Clustering with Mixture and Hierarchical Models

#### February 13, 2017

This talk addresses statistical inference with mixture and hierarchical models: efficiency of parameter estimation in finite mixtures, and scalable clustering of multilevel structured data.

It is well-known that due to weak identifiability and singularity structures of latent variable models’ parameter space, the convergence behaviors of parameter estimation procedures for mixture models remain poorly understood. In the first part of the talk, we describe a general framework for characterizing impacts of weak identifiability and singularity structures on the convergence behaviors of the maximum likelihood estimator in finite mixture models. This allows us to resolve several open questions regarding popular models such as Gaussian and Gamma mixtures, as well as to explicate the behaviors of complex models such as mixtures of skew normal distributions.

In the second part of the talk, we address a clustering problem with multilevel structured data, with the goal of simultaneously clustering a collection of data groups and partitioning the data in each group. By exploiting optimal transport distance as a natural metric for distributions and a collection of distributions, we propose an optimization formulation that allows to discover the multilevel clustering structures in grouped data in an efficient way. We illustrate the performance of our clustering method in a number of application domains, including computer vision.

## Vishesh Karwa; Harvard University

### Differentially Private Statistical Inference

#### February 15, 2017

Differential privacy has emerged as a powerful tool to reason rigorously about privacy and confidentiality issues. In its purest form, differential privacy limits direct access to raw data, allowing interaction only through a noisy interface. This requires new approaches to statistical inference. In this talk, I will introduce the definition of differential privacy, followed by some of its key properties. I will then present a framework for performing statistical inference under the constraint of differential privacy and its connections to measurement error and missing data models, with several examples. I will end with a demonstration of a differentially private interface to access data, developed as a part of ongoing collaboration between computer scientists, political scientists, and lawyers at Harvard.

## HaiYing Wang; University of Michigan

### Information-Based Optimal Subdata Selection for Big Data Linear Regression

#### February 17, 2017

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in Big Data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to existing methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, i.e., the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data.

## Kuang-Yao Lee; Yale School of Public Health

### On Additive Conditional Independence for High-Dimensional Statistical Analysis

#### February 20, 2017

With the advance of high-throughput technologies, massive and complex data are routinely collected and these data need to be processed and analyzed differently from conventional data. In this presentation I will discuss a nascent concept for analyzing big data — additive conditional independence (ACI) — a three-way statistical relation that shares many similarities with conditional independence. However, its nonparametric characterization does not involve multivariate kernel, which enjoys the flexibility of nonparametric estimators but avoids the curse of dimensionality in high-dimensional settings. We facilitate the implementation of ACI via a case study on nonparametric graphical models, and describe a general framework for adopting ACI to a broader scope. Additionally, to emphasize the increasing impact of ACI we also introduce several recent developments under various statistical settings. We investigate the properties of the proposed estimators through both theoretical and simulation analyses. The usefulness of our procedures is also demonstrated through an application to gene regulatory network (GRN) inference using a DREAM Challenge dataset. This is joint work with Bing Li (Penn State), Hongyu Zhao (Yale), Lexin Li (UC Berkeley) and Tianqi Liu (Yale).

## Dan McNeish; University of North Carolina, Chapel Hill

### Is Bayes a Solution for Small Samples?

#### March 8, 2017

In educational research, small sample data are extremely common, especially when data have a hierarchical structure. Recent meta-analyses have found that between 20% and 50% of studies are classified as having small samples. As barriers to software implementation continue to fall, Bayesian methods are becoming an increasingly popular method by which to accommodate small sample data and such a strategy is often suggested. Although true that Bayesian methods have advantages over frequentist methods with small sample data, these advantages are not acquired automatically. This talk discusses how typical applications of Bayesian methods in empirical studies are not sufficient to effectively capitalize on small sample advantages and can actually exacerbate small sample issues known to affect frequentist methods. The relevance of small sample methods for emerging methodological developments is discussed. Growth models and multilevel mediation are shown as examples.

## Gongjun Xu, University of Michigan

### Identifying Latent Structures in Restricted Latent Class Models

#### March 22, 2017

This talk focuses on a family of restricted latent structure models with wide applications in psychological and educational assessments, where the model parameters are restricted via a latent structure matrix to reflect pre-specified assumptions on the latent attributes. Such a latent structure matrix is often provided by experts and assumed to be correct upon construction, yet it may be subjective and misspecified. Recognizing this problem, researchers have been developing methods to estimate the structure matrix from data. However, the fundamental issue of the identifiability of the structure matrix has not been addressed until now. In this work, we first introduce identifiability conditions that ensure the estimability of the structure matrix. The results provide theoretical justification for the existing estimation methods as well as a guideline for the related experimental designs. With the theoretical development, we further propose an information-based model selection method to estimate the latent structure. Simulation studies and data analysis are also presented to examine the performance of the proposed method.

## Andrea B. Troxel, NYU School of Medicine

### Statistical Issues in Pragmatic Behavioral Clinical Trials

#### March 29, 2017

Randomized clinical trials provide gold-standard evidence for the effectiveness of interventions. Explanatory trials, aimed at providing information on mechanisms of action, often involve highly selected populations. Pragmatic trials, aimed at generating information to enable decision-making, attempt to enroll a broader group of participants. We propose two innovations in pragmatic clinical trial design. The use of an opt-out framework for consent can increase the enrolled fraction of the target population, but brings with it new design challenges. Rapid-cycle adaptive testing offers a way to enhance the speed of evidence production while preserving the crucial benefits of randomization. We provide several examples of trials of behavioral interventions in which these ideas are implemented.

## Michael Lavine; University of Massachusetts

### WHIM: Function Approximation WHere It Matters

#### April 5, 2017

If f is a function of interest, typically either a likelihood or posterior density function on a parameter space Θ, and θ̂ is the MLE (maximum likelihood estimate) or MAP (maximum a posteriori estimate), it is often of interest to find the region of Θ where f(θ) or log f(θ) is nearly as large as f(θ̂) or log f(θ̂). Typical tools for working with f, such as optimizers and MCMC, can fail when f is multimodal or has a plateau. This paper describes an algorithm called WHIM — for function approximation WHere It Matters — that finds the region of Θ where f is large and that is guaranteed not to fail for f’s arising from the large class of models described here, even when those f’s are multimodal or have plateaus.

WHIM was introduced in Lavine, Bray, and Hodges (2015), where the focus was on linear mixed models with exactly two variances. Here, the focus is on the algorithm and what features of f are needed to make it work. We show how those features arise in several classes of statistical models. Understanding the features and the logic behind them will enable other researchers to determine whether they exist and how to make WHIM work for other classes of functions. In addition, our examples show what can be learned by finding the region where f(θ) is large, rather than finding just θ̂.

## Donghui Zhang; Sanofi, U.S.

### Statistical Collaborations in Drug Discovery – Challenges and Opportunities

#### April 12, 2017

New drug discovery is a lengthy, expensive and challenging process from target identification to clinical trials in patients. In this talk, I will provide high level overview of various steps in the drug discovery process. Then, I will get into some details of these steps including their scientific objectives, scientific approaches used, the statistical issues involved. Most importantly, I will illustrate how statisticians can contribute significantly throughout the whole process

Bio: Dr. Donghui Zhang is the head of Translational Medicine and Preclinical Biostatistics at Sanofi North American Hub. She has over 20 years of pharmaceutical industry experiences at Merck and Sanofi. She has more than 40 publications in peer reviewed statistical and scientific journals.

## Xiwen Ma; Sanofi, U.S.

### Some Statistical Issues in Biomarker Identification for Clinical Study

#### April 12, 2017

For pharmaceutical interventions, it’s well known that the strategy of one-size fits all is hardly applicable to most common and rare diseases. With recent advances in biological science, personalized medicine and biomarker identification have become a very hot topic in pharmaceutical industry. In this talk, we present several statistical challenges in biomarker research for clinical study. Novel statistical methods were developed to address these issues.

Bio: Dr. Xiwen Ma was graduated from the department of statistics at University of Wisconsin-Madison. He joined Sanofi on March 2015. He has worked in the pharmaceutical industry for more than 6 years in both pre-clinical and clinical statistics. He has extensive experience in different therapeutic areas including oncology, neuroscience and rare diseases. His primary research interests are personalized medicine and biomarker identification.

## Gen Li; Columbia University

### A General Framework for the Association Analysis of Heterogeneous Data

#### April 26, 2017

Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional datasets with continuous measurements. Motivated by the Computer Audition Lab 500-song (CAL500) music annotation study, we develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two datasets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the CAL500 data sheds light on the relationship between acoustic features and semantic annotations, and provides an effective means for automatic annotation and music retrieval.