All colloquia will be held at 4pm in AUST 108, unless otherwise noted. Coffee will be served at 3:30pm in room 326.
Information about past colloquia is available here.
|Wednesday, January 23||Youngdeok Hwang, Sungkyunkwan University||Statistical Estimation of Air Pollution Through Integration of Physical Knowledge||4PM in BPB 130
Coffee at 3:30 in AUST 326
|Friday, January 25||Yao Zheng, Purdue University||Finite Time Analysis of Vector Autoregressive Models under Linear Restrictions||11AM in ROWE 122
Coffee at 10:30 in AUST 326
|Monday, January 28||Wen Zhou, Colorado State University||Estimation and Inference of Heteroskedasticity Models with Latent Semiparametric Factors for Multivariate Time Series||11AM in ROWE 122
Coffee at 10:30 in AUST 326
|Monday, February 4||Nicholas Henderson, Johns Hopkins University||Estimating heterogeneous treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models||11AM in ROWE 122|
|Wednesday, February 13||Jun Yan, University of Connecticut||Generalized scale-change models for recurrent event processes under informative censoring||4PM in AUST 108|
|Wednesday, February 20||Joseph Cappelleri, Pfizer||Advancing Interpretation of Patient-Reported Outcomes||4PM in AUST 108|
|Wednesday, February 27||Stephanie Hicks, Johns Hopkins University||Making data science accessible world-wide in the Johns Hopkins Data Science Lab||4PM in AUST 108|
|Wednesday, March 6||Erin Conlon, University of Massachusetts||Parallel Markov chain Monte Carlo for Bayesian hierarchical models with big data, in two stages||4PM in AUST 108|
|Wednesday, March 13||Victoria Pena, CUNY||–||4PM in AUST 108|
|Wednesday, March 27||Linglong Kong, University of Alberta||–||4PM in AUST 108|
|Wednesday, April 3||Donald Berry, MD Anderson||–||4PM in CHEM A203|
|Wednesday, April 10||Patrick Flaherty, University of Massachusetts||–||4PM in AUST 108|
|Friday, April 12||Debanjan Chattacharjee, Utah Valley University||–||11AM, location TBD|
|Wednesday, April 17||Bhramar Mukherjee, University of Michigan||–||4PM in AUST 108|
|Wednesday, April 24||Dipankar Bandyopadhyay, Virginia Commonwealth University||–||4PM in AUST 108|
|Wednesday, May 1||Karthik Bharath, Nottingham University||–||4PM in AUST 108|
Colloquium is organized by Professor Yuwen Gu.
Youngdeok Hwang; Sungkyunkwan University
Statistical Estimation of Air Pollution Through Integration of Physical Knowledge
January 23, 2019
Air pollution is driven by non-local dynamics, in which the air quality at a site is determined by the transport of pollutants from distant pollution emission sources by atmospheric processes. To understand the underlying nature of pollution generation, it is crucial to employ physical knowledge to account for the transport of pollutants by the wind. In this talk, I will discuss methods for estimating the pollution emission from the area of interest through the use of physical knowledge and observed data. The proposed methods use a statistical approach to utilize large-scale data from a numerical weather prediction model, while integrating the dynamics of the physical processes into the model. Also discussed are some extensions and related problems.
Yao Zheng; Purdue University
Finite Time Analysis of Vector Autoregressive Models under Linear Restrictions
January 25, 2019
We develop a unified finite-time theory for the OLS estimation of possibly unstable and even slightly explosive VAR models under linear restrictions, with the applicable region \rho(A)\leq 1+c/T, where \rho(A) is the spectral radius of the transition matrix A in the VAR(1) representation, T is the time horizon and c>0 is a universal constant. This linear restriction framework encompasses various existing models in the literature such as banded/network VAR models. We show that the restrictions reduce the error bounds through not only the reduced dimensionality but also a scale factor that resembles the asymptotic covariance matrix of the estimator in the fixed dimensional setup; as long as the model is correctly specified, this scale factor is decreasing in the number of restrictions. Our analysis reveals that the phase transition from slow and fast error rate regimes is determined by the smallest singular value of A, a measure of the least excitable mode of the system. The minimax lower bounds are also derived across different regimes. The developed finite-time theory not only bridges the theoretical gap between stable and unstable regimes but also precisely characterizes the effect of the restrictions and its interplay with other model parameters. Simulations support our theoretical results in both small and large samples
Wen Zhou; Colorado State University
Estimation and Inference of Heteroskedasticity Models with Latent Semiparametric Factors for Multivariate Time Series
January 28, 2019
This paper considers estimation and inference of a flexible heteroskedasticity model for multivariate time series, which employs semiparametric latent factors to simultaneously account for the heteroskedasticity and contemporaneous correlations. Specifically, the heteroskedasticity is modeled by the product of unobserved stationary processes of factors and subject-specific covariate effects. Serving as the loadings, the covariate effects are further modeled through additive models. We propose a two-step procedure for estimation. First, the latent processes of factors and their nonparametric loadings are estimated via projection-based methods. The estimation of regression coefficients is further conducted through generalized least squares. Theoretical validity of the two-step procedure is documented. By carefully examining the convergence rates for estimating the latent processes of factors and their loadings, we further study the asymptotic properties of the estimated regression coefficients. In particular, we establish the asymptotic normality of the proposed two-step estimates of regression coefficients. The proposed regression coefficient estimator is also shown to be asymptotically efficient. This leads us to a more efficient confidence set of the regression coefficients. Using a comprehensive simulation study, we demonstrate the finite sample performance of the proposed procedure, and numerical results corroborate our theoretical findings. Finally, we illustrate the use of our proposal through application to the air quality data.
Nicholas Henderson; Johns Hopkins University
Estimating heterogeneous treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models
February 4, 2019
Individuals often respond differently to identical treatments, and characterizing such variability in treatment response is an important aim in the practice of personalized medicine. In this article, we describe a nonparametric accelerated failure time model that can be used to analyze heterogeneous treatment effects (HTE) when patient outcomes are time-to-event. By utilizing Bayesian additive regression trees and a mean-constrained Dirichlet process mixture model, our approach offers a flexible model for the regression function while placing few restrictions on the baseline hazard. Our nonparametric method leads to natural estimates of individual treatment effect and has the flexibility to address many major goals of HTE assessment. Moreover, our method requires little user input in terms of model specification for treatment covariate interactions or for tuning parameter selection. Our procedure shows strong predictive performance while also exhibiting good frequentist properties in terms of parameter coverage and mitigation of spurious findings of HTE. We illustrate the merits of our proposed approach with a detailed analysis of two large clinical trials for the prevention and treatment of congestive heart failure using an angiotensin-converting enzyme inhibitor. The analysis revealed considerable evidence for the presence of HTE in both trials as demonstrated by substantial estimated variation in treatment effect and by high proportions of patients exhibiting strong evidence of having treatment effects which differ from the overall treatment effect.
Jun Yan; University of Connecticut
Generalized scale-change models for recurrent event processes under informative censoring
February 13, 2019
Two major challenges arise in regression analyses of recurrent event data: first, popular existing models, such as the Cox-type models, may not fully capture the covariate effects on the underlying recurrent event process; second, the censoring time remains informative about the risk of experiencing recurrent events after accounting for covariates. We tackle both challenges by a general class of semiparametric scale-change models that allow a scale-change covariate effect as well as a multiplicative covariate effect. The proposed model is flexible and nests several existing models, including the popular proportional rates model, the accelerated mean model, and the accelerated rate model. Moreover, it accommodates informative censoring through subject-level latent frailty whose distribution is left unspecified. A robust approach is proposed to estimate the model parameters, which does not need a parametric assumption on the distribution of the frailty and the recurrent event process. The asymptotic properties of the resulting estimator are established, with the asymptotic variance estimated from a novel resampling approach. As a byproduct, the structure of the model provides a model selection approach among the submodels via hypothesis testing of model parameters. Numerical studies show that the proposed estimator and the model selection procedure perform well under both noninformative and informative censoring scenarios. The methods are applied to data from two transplant cohorts to study the risk of infections after transplantation.
Joseph Cappelleri; Pfizer
February 20, 2019
A patient-reported outcome is any report on the status of a patient’s health condition that comes directly from the patient. Clear and meaningful interpretation of patient-reported outcome scores are fundamental to their use as they can be valuable in designing studies, evaluating interventions, educating consumers, and informing health policy makers involved with regulatory, reimbursement, and advisory agencies. Interpretation of patient-reported outcome scores, however, is often not well understood because of insufficient data or lack of experience or clinical understanding to draw from.
This presentation provides an update review on two broad approaches – anchor-based and distributed-based – aimed at enriching the understanding and meaning of patient-reported outcome scores. Anchor-based approaches use a measure (external to the targeted patient-reported outcome of interest) that is well interpretable and correlated with the targeted patient-reported outcome. Examples include percentages based on thresholds, criterion-group interpretation, content-based interpretation, and clinical important difference. Distributed-based approaches rely strictly on the distribution of the data. Examples include effect size, probability of relative benefit, and cumulative distribution functions. Applications are based on real-life and simulated examples.
Stephanie Hicks; Johns Hopkins University
February 20, 2019
In this talk, I will introduce the Johns Hopkins Data Science Lab: who we are, what are our goals, and the types of projects we are working on to make data science accessible world-wide. Then, I will discuss projects that I have focused on related to data science education. Despite unprecedented and growing interest in data science on campuses, there are few courses and course materials that provide meaningful opportunity for students to learn about real-world challenges. Most courses provide unrealistically clean data sets that fit the assumptions of the methods in an unrealistic way. The result is that students are left unable to effectively analyze data and solve real-world challenges outside of the classroom. To address this problem, I am leveraging the idea from Nolan and Speed in 1999, who argued the solution to this problem is to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. I will share a set of general principles and offer a detailed guide derived from my successful experience developing and teaching graduate-level, introductory data science courses centered entirely on case studies. Furthermore, I will present the Open Case Studies educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data derived from real-world challenges.
Erin Conlon; University of Massachusetts
March 6, 2019
Due to the recent growth of big data sets, new Bayesian Markov chain Monte Carlo (MCMC) parallel computing methods have been created. These methods divide large data sets by observations into subsets. However, many Bayesian hierarchical models have only a small number of parameters that are common to the full data set, with the majority of parameters being group specific. Therefore, techniques that split the full data set by groups rather than by observations are a more natural analysis approach.
Here, we adapt and extend such a two-stage Bayesian hierarchical modelling method. In stage 1, each group is evaluated independently in parallel; the stage 1 posteriors are used as proposal distributions in stage 2, where the full model is estimated. We illustrate our approach using both simulation and real data sets, with both three-level and four-level models. Our results show considerable increases in MCMC efficiency and large reductions in computation times compared to the full data analysis.