Spring 2018

All colloquia will be held at 4pm in AUST 108, unless otherwise noted. Coffee will be served at 3:30pm in room 326.

Information about past colloquia is available here.

Wednesday, January 17 None N/A N/A
Wednesday, January 24 Jessica Cisewski, Yale University A preferential attachment model for the stellar initial mass function via approximate Bayesian computation AUST 108
Wednesday, January 31 Michael Jordan, UC Berkeley

NESS Colloquium; sponsored jointly by NESS and the UCONN departments of Statistics and CSE

On Computational Thinking, Inferential Thinking and Data Science MONT 104
Wednesday, February 7 Lucas Janson, Harvard University

Postponed to March 28

Using Knockoffs to find important variables with statistical guarantees AUST 108
Wednesday, February 14 Fei Miao, University of Connecticut Data-Driven Dynamic Robust Resource Allocation for Efficient Transportation AUST 108
Wednesday, February 21 Fei Wang, Cornell University Is Your Data Cheating You? Towards Explainable AI in Medicine with Knowledge Empowerment AUST 108
Wednesday, February 28 Kelly Zou, Pfizer Real-World Evidence in the Era of Big Data AUST 108
Wednesday, March 7 Nalini Ravishanker, University of Connecticut


Modeling Financial Durations using Penalized Estimating Functions AUST 108
Wednesday, March 21 Mengyang Gu, Johns Hopkins University Robust calibration, emulation and scalable computation for imperfect mathematical models with massive output AUST 108
Wednesday, March 28 Lucas Janson, Harvard University Using Knockoffs to find important variables with statistical guarantees AUST 108
Wednesday, April 4 Paul Albert, NIH/NCI

Robert W. Makuch Distinguished Lecture in Biostatistics

Innovative Approaches to the Statistical Analysis of Circadian Rhythm Data: Uncovering the Patterns of Life AUST 108
Wednesday, April 11 Reneé Moore, Emory University

NESS Colloquium; sponsored jointly by NESS and the UCONN Department of Statistics

The Roles of Statisticians in Clinical Trials: Here, There, Everywhere ITE C80

Coffee at 3:15 in AUST 326

Wednesday, April 18 Brian Hobbs, Cleveland Clinic Trial design in the presence of non-exchangeable subpopulations AUST 108
Friday, April 20 Tyler Kleykamp, Chief Data Officer, State of Connecticut The Emerging Role of Data in Government 11 am in AUST 163

Coffee at 10:30 in AUST 326

Wednesday, April 25 James Z. Wang, Pennsylvania State University

Sponsored jointly by the CSE and Statistics Departments

Modeling Aesthetics and Emotions in Visual Content – From Vincent van Gogh to Robotics and Vision 10:30 in Homer Babbidge Library – Class of 1947 Room
Wednesday, April 25 Vince Lyzinski, University of Massachusetts

UCONN/UMASS Joint Colloquium

Information Recovery in Shuffled Graphs via Graph Matching AUST 108

Pizza after colloquium, AUST 326

Colloquium is organized by Professor Xiaojing Wang.

Jessi Cisewski-Kehe; Yale University

A preferential attachment model for the stellar initial mass function via approximate Bayesian computation

January 24, 2018

Explicitly specifying a likelihood function is becoming increasingly difficult for many problems in astronomy.  Astronomers often specify a simpler approximate likelihood – leaving out important aspects of a more realistic model.  Estimation of a stellar initial mass function (IMF) is one such example.  The stellar IMF is the mass distribution of stars initially formed in a particular volume of space, but is typically not directly observable due to stellar evolution and other disruptions of a cluster. Several difficulties associated with specifying a realistic likelihood function for the stellar IMF will be addressed in this talk.

Approximate Bayesian computation (ABC) provides a framework for performing inference in cases where the likelihood is not available.  I will introduce ABC, and demonstrate its merit through a simplified IMF model where a likelihood function is specified and exact posteriors are available.  To aid in capturing the dependence structure of the data, a new formation model for stellar clusters using a preferential attachment framework will be presented.  The proposed formation model, along with ABC, provides a new mode of analysis of the IMF.

Michael I. Jordan; UC Berkeley

On Computational Thinking, Inferential Thinking and Data Science

January 31, 2018

The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in Data Science is apparent from their sharply divergent nature at an elementary level—in computer science, the growth of the number of data points is a source of “complexity” that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of “simplicity” in that inferences are generally stronger and asymptotic results can be invoked.  On a formal level, the gap is made evident by the lack of a role for computational concepts such as “runtime” in core statistical theory and the lack of a role for statistical concepts such as “risk” in core computational theory.  I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and including a surprising cameo role for symplectic geometry.


Michael I. Jordan is the Pehong Chen Distinguished Professor in the Department of Electrical Engineering and Computer Science and the Department of Statistics at the University of California, Berkeley.

His research interests bridge the computational, statistical, cognitive and biological sciences, and have focused in recent years on Bayesian nonparametric analysis, probabilistic graphical models, spectral methods, kernel machines and applications to problems in distributed computing systems, natural language processing, signal processing and statistical genetics. Prof. Jordan is a member of the National Academy of Sciences, a member of the National Academy of Engineering and a member of the American Academy of Arts and Sciences. He is a Fellow of the American Association for the Advancement of Science. He has been named a Neyman Lecturer and a Medallion Lecturer by the Institute of Mathematical Statistics. He received the IJCAI Research Excellence Award in 2016, the David E. Rumelhart Prize in 2015 and the ACM/AAAI Allen Newell Award in 2009. He is a Fellow of the AAAI, ACM, ASA, CSS, IEEE, IMS, ISBA and SIAM.

Lucas Janson; Harvard University

Using Knockoffs to find important variables with statistical guarantees

February 7, 2018

March 28, 2018

Many contemporary large-scale applications, from genomics to advertising, involve linking a response of interest to a large set of potential explanatory variables in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively select important variables while controlling the fraction of false discoveries, even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, we propose a new framework of model-X knockoffs, which reads from a different perspective the knockoff procedure (Barber and Candès, 2015) originally designed for controlling the false discovery rate in linear models. Model-X knockoffs can deal with arbitrary (and unknown) conditional models and any dimensions, including when the number of explanatory variables p exceeds the sample size n. Our approach requires the design matrix be random (independent and identically distributed rows) with a known distribution for the explanatory variables, although we show preliminary evidence that our procedure is robust to unknown/estimated distributions. As we require no knowledge/assumptions about the conditional distribution of the response, we effectively shift the burden of knowledge from the response to the explanatory variables, in contrast to the canonical model-based approach which assumes a parametric model for the response but very little about the explanatory variables. To our knowledge, no other procedure solves the controlled variable selection problem in such generality, but in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.

Fei Miao; University of Connecticut

Data-Driven Dynamic Robust Resource Allocation for Efficient Transportation

February 14, 2018

Ubiquitous sensing in smart cities enables large-scale multi-source data collected in real-time, poses several challenges and requires a paradigm-shift to capture the complexity and dynamics of systems. Data-driven cyber-physical systems (CPSs) integrating machine learning, optimization, and control are highly desirable for this paradigm-shift, since existing model-based techniques of CPSs become inadequate. For instance, how to identify, analyze the dynamical interplay between urban-scale phenomena (such as mobility demand and supply) from data, and take actions to improve system-level service efficiency is still a challenging problem in transportation systems. In this talk, we present a data-driven dynamic robust resource allocation framework to match supply towards spatial-temporally uncertain demand, while seeking to reduce total resource allocation cost. First, we present a receding horizon control framework that incorporates large-scale historical and real-time sensing data in demand prediction and dispatch decisions under practical constraints. However, demand prediction error is not negligible and affects the system’s performance. Therefore, with spatial-temporal demand uncertainty models constructed from data, we then develop two computationally tractable robust resource allocation methods to provide probabilistic guarantees for the system’s worst-case and expected performances. As a case study, we evaluated the proposed framework using real taxi operational data. Lastly, I will provide an overview of my research that uses the knowledge of the system dynamics to guarantee security and resiliency properties of CPSs. I will introduce my research of coding schemes for stealthy data injection attacks detection, and stochastic game schemes for resilient control strategies.

Fei Wang; Weill Cornell Medical College

Is Your Data Cheating You? Towards Explainable AI in Medicine with Knowledge Empowerment

February 21, 2018

With the arrival of big data era, more and more data in different real world applications are becoming readily available. Artificial Intelligence (AI), which aims at providing the computer with the capability of learning from data like humans, is becoming ubiquitous. Many sophisticated AI models, such as deep learning, has been very popular. However, the success of these methods usually required huge amount of data, while in medicine it is usually costly or even impossible to get such large amount of data. Therefore only limited data samples are available, in which case the existing AI methodologies could easily overfit and thus misled by the data. In this talk, I will present some of the research from my lab on how to armor AI algorithms with domain knowledge to make them more effectively discover the genuine insights from data. Specifically, I will talk about how to enhance data-driven algorithms with knowledges as well as the concrete examples on knowledge acquisition and integration. I will also present examples on how they can be used in real world medical problems.

Kelly Zou; Pfizer Inc

Real-World Evidence in the Era of Big Data

February 28, 2018

Given the desire to enhance the effectiveness and efficiency of health care systems, it is important to understand and evaluate the risk factors for disease progression, treatment patterns such as medication uses, and utilizations such as hospitalization. Statistical analyses via observational studies and data mining may help evaluate patients’ diagnostic and prognostic outcomes, as well as inform policies to improve patient outcomes and to control costs. In the era of big data, real-world longitudinal patient-level databases containing the insurance claims of commercially insured adults, electronic health records, or cross-sectional surveys, provide useful insights to such analyses. Within the healthcare industry, executing rapid queries to inform development and commercialization strategies, as well as pre-specified non-interventional observation studies, are commonly performed. In addition, pragmatic studies are increasingly being conducted to examine health-related outcomes. In this presentation, selective published examples on real-world data analyses are illustrated. Results typically suggest that paying attention to patient comorbidities and pre-index or at index health care service utilization may help identify patients at higher risk and unmet needs for treatments. Finally, fruitful collaborative opportunities exist across different sectors among academia, industry and the government.

Nalini Ravishanker; University of Connecticut

Modeling Financial Durations using Penalized Estimating Functions

March 7, 2018

Accurate modeling of patterns in inter-event durations is of considerable interest in high-frequency financial data analysis. The class of logarithmic autoregressive conditional duration (Log ACD) models provides a rich framework for analyzing durations, and recent research is focused on developing fast and accurate methods for fitting these models to long time series of durations under least restrictive assumptions. This talk describes a semi-parametric modeling approach using Godambe-Durbin martingale estimating functions. This approach has wide applicability to several classes of linear and nonlinear time series. It only requires assumptions on the first few conditional moments of the process and does not require specification of its probability distribution. We discuss three approaches for parameter estimation: solution of nonlinear estimating equations, recursive formulas for the vector-valued parameter estimates, and iterated component-wise scalar recursions. Effective starting values from an approximating time series model increase the accuracy of the final estimates. We illustrate our approach via a simulation study and a real data example based on high-frequency transaction level data on several stocks. We may use this approach for structural break detection in a retrospective and an online way.

Mengyang Gu; Johns Hopkins University

Robust calibration, emulation and scalable computation for imperfect mathematical models with massive output

March 21, 2018

We focus on the problem of calibrating imperfect mathematical models using experimental data. To compensate for the misspecification of the mathematical model, a discrepancy function is usually included and modeled via a Gaussian stochastic process (GaSP), leading to better results of prediction. The calibrated mathematical model itself, however, sometimes fits the experimental data poorly, as the calibration parameters become unidentifiable. In this work, we propose the scaled Gaussian stochastic process (S-GaSP), a novel stochastic process for calibration and prediction. This new approach bridges the gap between two predominant methods, namely the L2 calibration and GaSP calibration. A computationally feasible approach is introduced for this new model under the Bayesian paradigm. New robust and computationally efficient statistical models will also be discussed for emulating computationally expensive mathematical models with massive output. The spatio-temporal outputs from TITAN2D, a computer model that simulates volcanic eruption, and the Interferometric synthetic aperture radar (InSAR) data will be used to demonstrate the performance of the proposed statistical methods for emulation and calibration.

Paul S. Albert; National Cancer Institute

Innovative Approaches to the Statistical Analysis of Circadian Rhythm Data: Uncovering the Patterns of Life

April 4, 2018

Circadian rhythms are defined as a biological endogenous process that repeats at an approximate 24-hour period. Increasingly these processes are recognized in their importance in understanding disease processes. In 2017, for example, the Nobel prize for physiology was given for discoveries of molecular mechanisms controlling these rhythms. This talk will focus on our recent work on the statistical modeling of longitudinally collected circadian rhythm data. I will begin with a discussion of a shape invariant model for Gaussian data that can be easily be fit with standard software (Albert and Hunsberger, Biometrics, 2005). This model was subsequently extended for modeling longitudinal count data (Ogbagaber et al., Journal of Circadian Rhythms, 2012). More recently we developed a statistical model for assessing the degree of disturbance or irregularity in a circadian pattern for count sequences that are observed over time in a population of individuals (Kim and Albert, Journal of the American Statistical Association, in press). We develop a latent variable Poisson modeling approach with both circadian and stochastic short-term trend (autoregressive latent process) components that allow for individual variation in the degree of each component. A parameterization is proposed for modeling covariate dependence on the proportion of these two model components across individuals. In addition, we incorporate covariate dependence in the overall mean, the magnitude of the trend, and the phase-shift of the circadian pattern. Innovative Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed models are considered and compared using the deviance information criterion. We illustrate this methodology with longitudinal physical activity count data measured in a longitudinal cohort of adolescents. Lastly, I will describe our recent methodological work focusing on examining the circadian rhythms of metabolites in a controlled environment. A majority of this work is joint with Dr. Sungduk Kim at the NCI.

Reneé H. Moore; Emory University

The Roles of Statisticians in Clinical Trials: Here, There, Everywhere

April 11, 2018

Most are familiar with the fact that statisticians develop new statistical methodology that can be applied to real projects via data analysis. However, statisticians contribute much more to research, and particularly help maintain the highest level of scientific evidence, the clinical trial. Dr. Reneé Moore has collaborated in clinical trials research in multiple ways: as study biostatistician for one-site trials, as study statistician for one site in a multi-site clinical trial, as the lead statistician in the data coordinating center (DCC) of a multi-site clinical trial, as a member of the data safety and monitoring board (DSMB) for several clinical trials, and of course as the statistician who analyses data from a completed clinical trials. In this talk, Dr. Moore will describe the multi-faceted contributions that statisticians make in clinical trials by highlighting her experience leading a DCC (Marcus, Moore, et al. NEJM 2013). From clinical trial design, to performing power analysis and making sure the trial has rigor and reproducibility, to being responsible for the conduct of the trial, to presenting the progress of recruitment, data quality, and interim analyses of ongoing trials, and to dissemination of results – the best quality of clinical trials research is performed with statisticians present here, there, and everywhere.

Brian P. Hobbs; Cleveland Clinic

Trial design in the presence of non-exchangeable subpopulations

April 18, 2018

Advances in biology and immunology continue to refine our understanding of cancer pathogenesis, elucidating potential mechanisms of tumor-cell growth, survival, angiogenesis and the systematic suppression of cancer immunity. With FDA approval bestowed on less than 6% of oncology drugs entering human testing between 2006 and 2015, however, the process for translating advances in preclinical knowledge into effective cancer treatment strategies has had limited success historically. Basket trials comprise a class of experimental designs that endeavor to test the effectiveness of a therapeutic strategy among patients defined by the presence of a particular biomarker target (often a molecular feature) rather than a particular cancer type. Acknowledging the potential for differential effectiveness on the basis of traditional criteria for cancer subtyping, evaluations of treatment effectiveness are conducted with respect to the “baskets” which collectively represent a partition of the targeted patient population consisting of discrete subtypes. Yet, designs of early basket trials have been criticized for their reliance on basketwise analysis strategies which suffered from limited power in the presence of imbalanced enrollment as well as failed to convey to the clinical community evidentiary measures for consistent effectiveness among the studied clinical subtypes. This presentation is intended to elucidate issues that limit the effectiveness of traditional and “response-adaptive” designs used in precision medicine contexts in oncology. Additionally, I will present a novel class of Bayesian sequential designs based on multi-source exchangeability modeling. The methodology is demonstrated with both analysis and permutation studies based on data reported from a recent basket trial designed to estimate the effectiveness of vemurafenib in BRAF mutant non-melanoma among six clinical sites.

Tyler Kleykamp; Chief Data Officer, State of Connecticut

The Emerging Role of Data in Government

April 20, 2018

Governments collect large amounts of data in order to manage a variety of public programs. Until recently however, there has been little effort to leverage this data beyond administering the program or service for which is collected. Governments at all levels, both in the U.S. and internationally, are in the early stages of leveraging data as a strategic asset to address pressing policy issues, improve efficiency, enhance service delivery, and ultimately improve people’s quality of life. Tyler will discuss some of the broad trends across government in the use of data with an emphasis on open data. This will include an overview of the Connecticut Open Data Portal ( and provide some specific examples of how open data is being used both by the public and by state agencies. In addition, Tyler will address some of the challenges governments face in using data; and how the public can benefit from increased access to open government data.

James Z. Wang; Pennsylvania State University

Modeling Aesthetics and Emotions in Visual Content – From Vincent van Gogh to Robotics and Vision

April 25, 2018

The talk will focus on recent efforts in my research group on computerized analysis of oil paintings and digital visual content using a data-driven approach. Art historians have long observed the highly characteristic brushstroke styles of Vincent van Gogh and have relied on discerning these styles for authenticating and dating his works. In our work, we compared van Gogh with his contemporaries by statistically analyzing a massive set of automatically extracted brushstrokes. A novel extraction method is developed by exploiting an integration of edge detection and clustering-based segmentation. Evidence substantiates that van Gogh's brushstrokes are strongly rhythmic. The talk will also describe an effort to model the aesthetic and emotional characteristics in visual contents such as photographs. By taking a data-driven approach, using Internet as the data source, we show that computers can be trained to recognize various characteristics that are highly relevant to aesthetics and emotions. Future computer systems equipped with such capabilities are expected to help millions of users with unimagined ways. Finally, the talk will briefly cover a core statistical learning method we have developed that have broad potential applications.

Vince Lyzinski; University of Massachusetts, Amherst

Information Recovery in Shuffled Graphs via Graph Matching

April 25, 2018

While many multiple graph inference methodologies operate under the implicit assumption that an explicit vertex correspondence is known across the vertex sets of the graphs, in practice these correspondences may only be partially or errorfully known. Herein, we provide an information theoretic foundation for understanding the practical impact that errorfully observed vertex correspondences can have on subsequent inference, and the capacity of graph matching methods to recover the lost vertex alignment and inferential performance. Working in the correlated stochastic blockmodel setting, we establish a duality between the loss of mutual information due to an errorfully observed vertex correspondence and the ability of graph matching algorithms to recover the true correspondence across graphs. In the process, we establish a phase transition for graph matchability in terms of the correlation across graphs, and we conjecture the analogous phase transition for the relative information loss due to shuffling vertex labels. We lastly demonstrate the practical effect that graph shuffling— and matching—can have on subsequent inference, with examples from two sample graph hypothesis testing and joint spectral graph clustering.