All colloquia will be held at 4pm in AUST 108, unless otherwise noted. Coffee will be served at 3:30pm in room 326.
Information about past colloquia is available here.
Colloquium is organized by Professor Xiaojing Wang.
Abdus Sattar; Case Western Reserve University
Modeling of High-Dimensional Clinical Longitudinal Oxygenation Data from Retinopathy of Prematurity
August 31, 2018
Many remarkable advances have been made in the non-parametric and semiparametric methods for high-dimensional longitudinal data. However, there is a lack of a method for addressing missing data in these important methods. Motivated by oxygenation of retinopathy of prematurity (ROP) study, we developed a penalized spline mixed effects model for a high-dimensional nonlinear longitudinal continuous response variable using the Bayesian approach. The ROP study is complicated by the fact that there are non-ignorable missing response values. To address the non-ignorable missing data in the Bayesian penalized spline model, we applied a selection model. Properties of the estimators are studied using Markov Chain Monte Carlo (MCMC) simulation. In the simulation study, data were generated with three different percentages of non-ignorable missing values, and three different sample sizes. Parameters were estimated under various scenarios. The proposed new approach did better compare to the semiparametric mixed effects model with non-ignorable missing values under missing at random (MAR) assumption in terms of bias and percent bias in all scenarios of non-ignorable missing longitudinal data. We performed sensitivity analysis for the hyper-prior distribution choices for the variance parameters of spline coefficients on the proposed joint model. The results indicated that half-t distribution with three different degrees of freedom did not influence to the posterior distribution. However, inverse-gamma distribution as a hyper-prior density influenced to the posterior distribution. We applied our novel method to the sample entropy data in ROP study for handling nonlinearity and the non-ignorable missing response variable. We also analyzed the sample entropy data under missing at random.
Tim Killeen, Professor Emeritus; University of Connecticut
A Bivariate Test for Location Based on Data Depth
September 5, 2018
Starting with the ideas contained in the groundbreaking data depth paper by Liu (1990), we formulate the simplicial data depth concept into a bivariate test of location that is distribution free under the assumption of angular symmetry with a known median. The test statistic is a count of the number of data triangles that contain the null hypothesized value. A straightforward method for computing this statistic is provided. The exact null distribution is computed and tabled for n < 16, the asymptotic null distribution is derived and a formula for an approximate critical value is provided. Simulations show that the power of the test compares favorably with other bivariate sign type tests. The analogous one-dimensional distribution free test is discussed. Then these ideas are extended to three dimensions, where we show that, surprisingly, the test is not generally distribution free under H0.
Xinwei Deng; Virginia Tech
Convex Clustering for Generalized Linear Models with Applications to Purchase Likelihood Prediction
September 12, 2018
In the IT service pricing, it is essential to have accurate prediction of the purchase likelihood of potential clients. However, the heterogeneity related to both clients and products results in very different purchase behaviors. It is not appropriate to use one global model on all data. There is a great need of constructing distinctive models under different data segments. Towards this aim, we propose a convex clustering approach to performing data segmentation and model fitting simultaneously. The proposed method ensures data points with a common model structure are grouped into the same segment. An efficient algorithm with desirable asymptotic properties is developed for parameter estimation. The performance of the proposed approach and its merits are illustrated by numerical examples and a case study with business data from a major IT service provider.
This is a joint work with several researchers at the T.J. IBM Watson Research Center.
Junxian Geng; Boehringer Ingelheim
Probabilistic Community Detection with Unknown Number of Communities
September 19, 2018
A fundamental problem in network analysis is clustering the nodes into groups which share a similar connectivity pattern. Existing algorithms for community detection assume the knowledge of the number of clusters or estimate it a priori using various selection criteria and subsequently estimate the community structure. Ignoring the uncertainty in the first stage may lead to erroneous clustering, particularly when the community structure is vague. We instead propose a coherent probabilistic framework for simultaneous estimation of the number of communities and the community structure, adapting recently developed Bayesian nonparametric techniques to network models. An efficient Markov chain Monte Carlo (MCMC) algorithm is proposed which obviates the need to perform reversible jump MCMC on the number of clusters. The methodology is shown to outperform recently developed community detection algorithms in a variety of synthetic data examples and in benchmark real-datasets. Using an appropriate metric on the space of all configurations, we develop nonasymptotic Bayes risk bounds even when the number of clusters is unknown. Enroute, we develop concentration properties of nonlinear functions of Bernoulli random variables, which may be of independent interest in analysis of related models.
Vladimir Pozdnyakov; University of Connecticut
Discretely Observed Brownian Motion Governed by a Telegraph Process: Estimation
October 3, 2018
A Brownian motion whose infinitesimal variance alternates according to a telegraph process is considered. This stochastic process can be employed to model variety of real-word situations. In this work we applied our findings for animal movement analysis. Another possible application is stochastic volatility modeling in mathematical finance. The main goal is to develop an estimation procedure for underlying model parameters when the Brownian Motion governed by telegraph process is observed discretely. Resulting sequence of observations is not Markov. But since the location-state process is Markov, the likelihood estimation can be done with help of Hidden Markov Model tools.
Annie Qu; UIUC
Multilayer Tensor Factorization with Applications to Recommender Systems
October 10, 2018
Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency. One major advantage is that the proposed method is able to address the “cold-start” issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic block-wise-coordinate-descent algorithm. In theory, we investigate both algorithmic properties for global and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature. This is joint work with Xuan Bi and Xiaotong Shen.
Lu Lu; Smith College
Applications of Statistics and Machine Learning in Credit Scoring: A Case Study
October 17, 2018
Credit scoring is the process to assess the default risk based on the borrower’s characteristics. Lending Club (LC) is a P2P platform that issues personal loans. Their datasets include several features about each loan: borrower’s FICO score, employment length, LC assigned loan grade, etc. The main goal here is to predict the default risk of the loan using these factors. We will apply several machine learning algorithms such as random forests to LC’s data and compare their performances. Since most loans are in good status, the modeling also involves dealing with the class imbalance problem.
Feng Guo; Virginia Tech
Driving Risk Assessment with the Naturalistic Driving Study
October 19, 2018
Traffic crashes is the leading cause of death with 35,000 to 40,000 fatalities annually. This talk will introduce driving risk assessment using the naturalistic driving study, an innovative way of investigating traffic safety and driving behaviors. The NDS monitors the driving process continuously through video recording devices, various radar and kinematic sensors installed on subject vehicles. The large data collected provide rich information but also bring challenges in data analysis and modeling. This talk will introduce the characteristics of NDSs, epidemiological approaches for video-based risk modeling, recurrent event models, and rare-event prediction. I will also introduce latest findings using the second Strategic Highway Research Plan Naturalistic Driving Study, the largest NDS with more than 3,500 drivers with more than one million hours of driving data.
Victor Hugo Lachos Davila; University of Connecticut
Censored Regression Models for Complex Data
October 24, 2018 @ UMass
Measurement data can be subject to some upper and/or lower detection limits because of the restriction/limitation of experimental apparatus. A complication arises when these continuous measures present a heavy-tailed behavior because inference can be seriously affected by the misspecification of their parametric distribution. For such data structures, we discuss some useful models and estimation strategies for robust estimation. The practical utility of the proposed methods is exemplified using real data sets and simulation studies.
Xia Wang; University of Cincinnati
Bayesian Hidden Markov Models for Dependent Large-Scale Multiple Testing
October 26, 2018
An optimal and flexible multiple hypotheses testing procedure is constructed for dependent data based on Bayesian techniques, aiming at handling two challenges, namely dependence structure and non-null distribution specification. Ignoring dependence among hypotheses tests may lead to loss of efficiency and bias in decision. Misspecification in the non-null distribution, on the other hand, can result in both false positive and false negative errors. Hidden Markov models are used to accommodate the dependence structure among the tests. Dirichlet mixture process prior is applied on the non-null distribution to overcome the potential pitfalls in distribution misspecification. The testing algorithm based on Bayesian techniques optimizes the false negative rate (FNR) while controlling the false discovery rate (FDR). The procedure is applied to pointwise and clusterwise analysis. Its performance is compared with existing approaches using both simulated and real data examples. This is a joint work with Ali Shojaie and Jian Zou.
T.S.G. Peiris; University of Moratuwa
Impact of Mathematics (Level 1 and Level 2) on the Performance of Engineering Undergraduates in Level 2: A Case Study
October 31, 2018
Mathematics plays a major role to develop the analytical thinking of students in a wide range of disciplines, especially in engineering sciences and consequently it helps to improve the performance of engineering students at each level during the undergraduate period. Though few such studies had been conducted, the concept covariate under multivariate environment has not been considered in past studies. This study therefore focus to study (i) the effect of mathematics in Level 1 and Level 2 simultaneously and (ii) effect of mathematics in Level 1 and Level 2 separately by using unadjusted and adjusted Canonical Correlation Analysis (CCA) and to develop an index to compare the impact of mathematics on student’s performance among different engineering disciplines. The study was conducted using raw marks of 1256 engineering students from seven different disciplines at the Faculty of Engineering, University of Moratuwa, Sri Lanka for two academic years 2010 and 2011. A theoretical model underlying relationship between two measurements, mathematics performance and engineering performance was developed based on literature review. The Structural Equation Modeling (SEM) based on Partial Least Squares technique was used to validate the conceptual model. The first canonical variate of engineering was found to be the best proxy indicator for the engineering performance. The impact of mathematics in semester 2 is significantly higher compared with the impact of mathematics in semester 1 on engineering performance in Level 2. The mathematics in Level 1 and Level 2 jointly influenced on the engineering performance in Level 2 irrespective of the engineering disciplines and the level of impact of mathematics varies among engineering disciplines. The individual effect of mathematics in Level 2 is significantly higher compared to the individual effect of mathematics in Level 1 on engineering performance in Level 2. The mathematics in Level 1 is still important in affecting students’ engineering performance in Level 2 as there is a significant effect indirectly. The results obtained in this study can be utilized in curriculum development in mathematics modules. This is joint work with K.A.D.S.A. Nanayakkara.
Mengyang Gu; Johns Hopkins University
Calibration of imperfect mathematical models by multiple sources of data with measurement bias
November 7, 2018
Model calibration involves using experimental or field data to estimate the unknown parameters of a mathematical model. This task is complicated by discrepancy between the model and reality, and by possible bias in field data. We consider model calibration in the presence of both model discrepancy and measurement bias using multiple sources of data. Model discrepancy is often estimated using a Gaussian stochastic process (GaSP), but it was observed in many studies that the calibrated mathematical model can be far from the reality. Here we show that modeling the discrepancy function via a GaSP often leads to an inconsistent estimation of the calibration parameters even if one has an infinite number of repeated experiments and infinite number of observations in each experiment. We introduce the scaled Gaussian stochastic process (S-GaSP) to model the discrepancy function. We establish the explicit connection between the GaSP and S-GaSP through the orthogonal series representation. We show the predictive mean estimator in the S-GaSP calibration model converges to the reality at the same rate as the GaSP with the suitable choice of the regularization parameter and scaling parameter. We also show the calibrated mathematical model in the S-GaSP calibration converges to the one that minimizes the L2 loss between the reality and mathematical model with the same regularization and scaling parameters, whereas the GaSP model does not have this property.
The scientific goal of this work is to use multiple radar satellite interferograms to calibrate a geophysical model of Kilauea Volcano, Hawai`i. We investigate the use of models calibrated using all the data sets simultaneously, and also using stacks (averages)—a commonly-used approach in geoscience research. The connection and difference between these two approaches are studied. We derive distributions for the maximum likelihood estimator and Bayesian inference, both implemented in the “RobustCalibration” package available on CRAN. Analysis of both simulated and real data confirm that our approach can identify the measurement bias and model discrepancy using multiple sources of data.
Joshua L. Warren; Yale University
Spatiotemporal Statistical Methods for Monitoring Glaucoma Progression Using Visual Field Data
November 14, 2018
Diagnosing glaucoma progression early is critical for limiting irreversible vision loss. A common method for assessing glaucoma progression relies on a longitudinal series of visual fields (VF) acquired from a patient at regular intervals. VF data are characterized by a complex spatiotemporal correlation structure due to the data generating process and ocular anatomy. Thus, advanced statistical methods are needed to make clinical determinations regarding progression status and for monitoring the disease over time. We introduce a spatiotemporal boundary detection model that allows the underlying anatomy of the optic disc to define the spatial structure of the VF data across time. Based on this model, we define a diagnostic metric and verify that it explains a novel pathway in glaucoma progression. A spatially varying change points model is also developed to facilitate the prediction of VF data and to estimate the timing of future vision loss. Models are applied to data from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry. Simulations are presented, showing that the proposed methodology is preferred over existing models. This is joint work with Samuel I. Berchuck and Jean-Claude Mwanza.
Daniel B. Neill; New York University
Subset Scanning for Event and Pattern Detection
November 28, 2018
Building on the prior literature on (spatial) scan statistics, subset scanning is an accurate and computationally efficient framework for detecting events and other patterns in both spatial and non-spatial datasets, through constrained optimization of a score function (e.g., a likelihood ratio statistic) over subsets of the data. Many score functions of interest satisfy the linear-time subset scanning property (Neill, 2012), enabling exact and efficient optimization over subsets. This efficient unconstrained optimization step, the fast subset scan, can be used as a building block for scalable solutions to event and pattern detection problems incorporating a variety of real-world constraints. In this talk, I will introduce the fundamental theory and methodology of subset scanning, and various extensions and generalizations of this approach. I will also describe a number of real-world applications of subset scanning, ranging from public health (early detection of disease outbreaks and emerging patterns of drug overdose deaths) to algorithmic fairness (discovering and correcting systematic biases in risk prediction, with applications to criminal justice and many other domains).
Bio: Daniel B. Neill is Associate Professor of Computer Science and Public Service at NYU’s Courant Institute Department of Computer Science and Robert F. Wagner Graduate School of Public Service, and Associate Professor of Urban Analytics at NYU’s Center for Urban Science and Progress. He was previously a tenured faculty member at Carnegie Mellon University’s Heinz College, where he was the Dean’s Career Development Professor, Associate Professor of Information Systems, and Director of the Event and Pattern Detection Laboratory. He received his M.Phil. from Cambridge University and his M.S. and Ph.D. in Computer Science from Carnegie Mellon University. Dr. Neill’s research focuses on developing new methods for machine learning and event detection in massive and complex datasets, with applications ranging from medicine and public health to law enforcement and urban analytics. He works closely with organizations including public health, police departments, hospitals, and city leaders to create and deploy data-driven tools and systems to improve the quality of public health, safety, and security, for example, through the early detection of disease outbreaks and through predicting and preventing hot-spots of violent crime. He was the recipient of an NSF CAREER award and an NSF Graduate Research Fellowship, and was named one of the “top ten artificial intelligence researchers to watch” by IEEE Intelligent Systems.
Lorin Crawford; Brown University
Variable Prioritization in “Black Box” Statistical Methods
December 5, 2018
A consistent theme of the work done in the Crawford Lab is to take modern computational approaches and develop theory that enable their interpretations to be related back to classical genomic principles. The central aim of this talk is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel, interpretable, and computationally efficient way to summarize the relative importance of predictor variables. Methodologically, we develop the “RelATive cEntrality” (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data. We will illustrate RATE through Bayesian Gaussian process regression; although, the proposed innovations apply to other nonlinear methods (e.g. deep neural networks). It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and applications to real genome-wide association mapping studies, we show that applying RATE enables an explanation for this improved performance.