Coffee will be served at 3:30PM in room 326 for each Wednesday colloquium. TAs are responsible for set-up according to this schedule.
Information about past colloquia is available here.
For searchable listing of past colloquia use our search tool.
Colloquium is organized by Professor Xiaojing Wang.
Zhengqing Ouyang; Jackson Lab
Statistical modeling of RNA structurome from next generation sequencing
In addition to its expression, the structure of RNA has important yet unexplored roles in almost every step of cellular processes, including transcription, splicing, degradation, transport, localization, and translation. Emerging technologies utilizing next generation sequencing have begun to dissect the structures of the whole transcriptome (the so-called RNA structurome). It facilitates high-throughput functional studies of RNAs, especially thousands of noncoding RNAs. However, it remains challenging to analyze high-throughput RNA structure probing data because of various statistical issues. We present a probabilistic modeling framework for high-throughput RNA structure probing data. The framework combines a novel joint Poisson-gamma mixture model with a hidden Markov model, allowing for inferring RNA structures at the genome scale and at single nucleotide resolution. We demonstrate its advantages in both simulated and real datasets. It can be applied to analyze the RNA structurome of any transcriptome.
Qiqi Deng; Boehringer Ingelheim
Story of Alpha and Beta in Clinical trials
Alpha and Beta, also referred as type I error and type II error, is a critical concept in clinical trial design and interpretation of the results. In this talk, I will share with you how this concept is applied in modern drug development process, and why it is important. Many advanced statistical methodologies used in complicated clinical trials is essentially tackling this fundamental topic. Two types of clinical trial design, one using futility analysis at interim, another using seamless phase II/III design will be discussed to illustrate the idea.
Enterprise business metric forecasting
Regularly issued forecasts of various business metrics, such as revenue, cost, and event occurrences, play a key role in tracking and managing business performance. Some forecasts are for next several time periods, while others target specific term summaries such as quarterly totals. In an enterprise business, processes are often organized in a hierarchical and high-dimensional cube structure based on features such as product offerings or geography. Typically, forecasts need to be issued for every cell of the cube and must satisfy certain consistency relationships over the different dimensions. Forecasting for a hierarchical structure of time series is challenging because various components at different levels of the hierarchy can interact in a complex manner. A large number of time-dependent covariates are usually available, although the measurement periods may vary, and missing or unstable measurements are expected in some cells. Desirable forecast properties include reasonable accuracy at various levels or slices of a cube, and stability over time periods. In this talk we present actual enterprise forecasting problems and their challenges. We describe the development and implementation of methods for an operational solution effectively used by business.
Estimating Standard Errors for Importance Sampling Estimators with Multiple Markov Chains
The naive importance sampling (IS) estimator, based on samples from a single importance density, can be numerically unstable. We consider multiple distributions IS estimators where samples from more than one probability distribution are combined to consistently estimate means with respect to given target distributions. These generalized IS estimators provide more stable estimators than naive IS estimators. We consider the Markov chain Monte Carlo context, where independent samples are replaced with Markov chains. If these Markov chains converge to their respective target distributions at a polynomial rate, then under two finite moment conditions, we show that a central limit theorem holds for the IS estimators. Further, we develop an easy to implement consistent method to calculate valid asymptotic standard errors based on the batch means (BM) methods. We also provide a BM estimator for calculating asymptotically valid standard errors of Geyer (1994)’s reverse logistic estimator. We illustrate the method with an application in Bayesian variable selection in linear regression. In particular, the multi-chain IS estimator is used to perform empirical Bayes variable selection and the BM estimator is used to obtain standard errors in the large p situation where current methods are not applicable.
Hongyuan Cao; University of Missouri
Change point estimation: another look at multiple testing problems
We consider the problem of large scale multiple testing for data that have locally clustered signals. With this structure, we apply techniques from change point analysis and propose a boundary detection algorithm so that the local clustering information can be utilized. We show that by exploiting the local structure, the precision of a multiple testing procedure can be improved substantially. We study tests with independent as well as dependent p-values. Monte Carlo simulations suggest that the methods perform well with realistic sample sizes and demonstrate the improved detection ability compared with competing methods. The practical utility of our methods is illustrated from a genome-wide association study of blood lipids.
Zuoheng Wang; Yale University
Longitudinal SNP Set Association Analysis of Quantitative Phenotypes
Traditional association analyses of genetic variants mainly focus on cross-sectionally measured phenotypes. Many health studies have collected phenotypes at multiple time points. Repeated measurements from the same subject over a period of time not only provide a more accurate assessment of disease condition, but also allow us to explore the genetic influence on disease development and progression. Thus there is growing interest in studying longitudinally measured data. Here we propose LSKAT (Longitudinal SNP-set Kernel Association Test), a robust association test of rare and common variants with longitudinal phenotypes, which extends the SKAT method for a single measurement to repeated measurements. LSKAT uses several variance components to account for the within-subject correlation in longitudinal data and the contributions from all genetic variants (rare and common) in a region. In simulation studies, we evaluate the performance of LSKAT, and demonstrate that LSKAT has an improved power, by making full use of repeated measurements, as comparing to previously proposed tests on a single measurement or average measurements for each subjects. Finally, we apply LSKAT to test for association in veteran cohort with both HIV-infected and uninfected patients to understand the genetic risk underlying alcohol dependence.
Bodhisattva Sen; Columbia University
Adaptation in Shape Constrained Regression
We consider nonparametric least squares estimation of a shape (e.g., monotonic/convex) constrained regression function, both with univariate and multivariate predictors. We discuss the characterization, computation and consistency of the least squares estimator (LSE) in these problems. An appealing property of the LSE is that it is completely tuning parameter-free. To quantify the accuracy of the LSE we consider the behavior of its risk, under the squared error loss. We derive worst case risk (upper) bounds in these problems and highlight the adaptive behavior of the LSE. In particular, we show that the LSE automatically adapts to “sparsity” in the underlying true regression function. Another interesting feature of the LSE in the multi-dimensional examples is that it adapts to the intrinsic dimension of the problem.
Narayana Balakrishna; Cochin University of Science and Technology
Multiplicative Error Models for Non-negative Time series
The evolution of non-negative time series is better explained by multiplicative form of the model rather than its additive counterpart. For example a sequence of durations between events in financial markets are modelled by Autoregressive Conditional Duration (ACD) models, which are members of a larger class known as Multiplicative Error Models (MEMs) introduced by Engle (2002). Most of the ACD models are discussed in the parametric setup with specified innovation distributions such as exponential, Weibull, gamma, etc. In the present talk, we introduce a parametric ACD/SCD model with inverse Gaussian innovations and discuss its properties and applications. We will also discuss some of our recent work on non-parametric kernel estimation of the marginal distribution for a non-negative time series. These results will be used for the non-parametric analysis of ACD/SCD models.
This is joint work with Ian Johnston.
Alan Gelfand; Duke University
Spatial data and Gaussian processes: A beautiful marriage
In the past twenty years analysis of spatial data has become increasingly model-based. Full specification of stochastic models for the spatial process being investigated enables full inference and uncertainty assessment regarding the process. Gaussian processes on subsets of R2 have become a fundamental specification for such modeling, particularly in settings where prediction is a primary goal. Therefore, focusing on the point-referenced case, we elaborate the substantial range of spatial settings where Gaussian processes have enabled rich and flexible modeling. We start with the basic geostatistical model, in hierarchical form, moving to generalized spatial regression models, multivariate process models, and spatially varying coefficient models. We will consider the use of Gaussian processes to handle skewed distributions as well as nonparametric distributional models and also the role of Gaussian processes in dimension reduction strategies to accommodate large datasets. Also, we will look at less standard contexts including spatial extremes, spatial directional data, and spatial quantile regression. Modeling details, model fitting, and examples will be provided.
Ting Zhang; Boston University
Semiparametric Model Building for Regression Models with Time-Varying Parameters
I consider the problem of semiparametric model building for linear regression models with potentially time-varying coefficients. By allowing the response variable and explanatory variables be jointly a nonstationary process, the proposed methods are widely applicable to nonstationary and dependent observations, for example time-varying autoregressive processes with heteroscedastic errors. We propose a local linear shrinkage method that is capable of achieving variable selection and parameter estimation simultaneously in a computationally efficient manner. Its selection consistency along with the favorable oracle property is established. Due to the fear of losing efficiency, an information criterion is further proposed for distinguishing time-varying and time-invariant components. Numerical examples are presented to illustrate the proposed methods.
Erin Conlon; University of Massachusetts
Parallel Markov Chain Monte Carlo Methods for Bayesian Analysis of Big Data
Recently, new parallel Markov chain Monte Carlo (MCMC) methods have been developed for massive data sets that are too large for traditional statistical analysis. These methods partition big data sets into subsets, and implement parallel Bayesian MCMC computation independently on the subsets. The posterior MCMC samples from the subsets are then joined to approximate the full data posterior distributions. Current strategies for combining the subset samples include averaging, weighted averaging and kernel smoothing approaches. Here, I will discuss our new method for combining subset MCMC samples that directly products the subset densities.
While our method is applicable for both Gaussian and non-Gaussian posteriors, we show in simulation studies that our method outperforms existing methods when the posteriors are non-Gaussian. I will also discuss computational tools we have developed for carrying out parallel MCMC computing in Bayesian analysis of big data.
Melanie Wall; Columbia University
In search of an optimally valid criteria set for diagnosing alcohol use disorder
In the field of psychiatry, the current classification system is the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5). The DSM-5 provides diagnostic criteria that are associated with each mental disorder. Following the prototypal theory, not all listed criteria for a specific disorder must be present in order to receive a diagnosis but rather a specified subset of criteria must be present. To be diagnosed with an alcohol use disorder (AUD) under the DSM-5, an individual meeting any two of the 11 criteria receives a diagnosis. The question we will address is whether the validity of the diagnosis using the rule of 2 out of 11 can be improved upon using some other rule based on the same or a subset of the 11 criteria. In this presentation we will: 1) define validity in the absence of a gold standard by formalizing ideas of content validity from psychometrics 2) compare different measures of diagnostic performance, e.g. sensitivity, specificity, misclassification rate, diagnostic odds ratio from biostatistics and 3) demonstrate an empirical search algorithm incorporating uncertainty through bootstrapping that can identify an optimally valid rule for diagnosing alcohol use disorder. The data we will use comes from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) conducted by the National Institute on Alcohol Abuse and Alcoholism in 2001/2002.
Steven Chiou; Harvard University
Joint scale-change models for recurrent events and failure time
Recurrent event data arise frequently in various fields such as biomedical sciences, public health, engineering, and social sciences. In many instances, the observation of the recurrent event process can be stopped by the occurrence of a correlated failure event, such as treatment failure and death. In this article, we propose a joint scale- change model for the recurrent event process and the failure time, where a shared frailty variable is used to model the association between the two types of outcomes. In contrast to the popular Cox-type joint modeling approaches, the regression parameters in the proposed joint scale-change model have marginal interpretations. The proposed approach is robust in the sense that no parametric assumption is imposed on the distribution of the unobserved frailty and that we do not need the strong Poisson-type assumption for the recurrent event process. We establish consistency and asymptotic normality of the proposed semiparametric estimators under suitable regularity conditions. To estimate the corresponding variances of the estimators, we develop a computationally efficient resampling-based procedure. Simulation studies and an analysis of hospitalization data from the Danish Psychiatric Central Register illustrate the performance of the proposed method.