For searchable listing of past colloquia use our search tool.
Colloquium is organized by Professor Xiaojing Wang.
Michael Katehakis; Rutgers University
On the Asymptotically Optimal Behavior of Sequential Allocation Policies
Consider the problem of sampling sequentially from a finite number of N ≥ 2 populations or ‘bandits’, where each population i is specified by a sequence of iid random variables representing the reward received the every time a population is sampled. For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞. In our constructions, the function g effectively controls the ‘exploration’ of the classical ‘exploration/exploitation’ tradeoff.
When additional parametric assumptions can be made, one can construct policies that are asymptotically optimal in the sense of achieving the lower bound on the logarithmic rate of increase of the regret of Burnetas and Katehakis (1996). We present such asymptotically optimal policies for the cases in which the iid random variables are Normal with unknown means and unknown variances and Uniform with unknown supports.
Finally we present asymptotically optimal policies for case in which the iid variables are of unknown, not necessarily finite, means and support, such as in the case of Pareto Bandits.
Estimating Standard Errors for Importance Sampling Estimators with Multiple Markov Chains
The naive importance sampling (IS) estimator, based on samples from a single importance density, can be numerically unstable. We consider multiple distributions IS estimators where samples from more than one probability distribution are combined to consistently estimate means with respect to given target distributions. These generalized IS estimators provide more stable estimators than naive IS estimators. We consider the Markov chain Monte Carlo context, where independent samples are replaced with Markov chains. If these Markov chains converge to their respective target distributions at a polynomial rate, then under two finite moment conditions, we show that a central limit theorem holds for the IS estimators. Further, we develop an easy to implement consistent method to calculate valid asymptotic standard errors based on the batch means (BM) methods. We also provide a BM estimator for calculating asymptotically valid standard errors of Geyer (1994)’s reverse logistic estimator. We illustrate the method with an application in Bayesian variable selection in linear regression. In particular, the multi-chain IS estimator is used to perform empirical Bayes variable selection and the BM estimator is used to obtain standard errors in the large p situation where current methods are not applicable.
John W. Emerson; Yale University
Topics in High-Performance Computing with R for EveRyone
This talk will explore a range of topics related to high-performance computing with R, accessible to R users at all levels working on standard (i.e. laptop or workstation) hardware. An understanding of a few subtle features of the language can go a long way towards improving speed and memory-efficiency, and the implications to so-called ‘Big Data’ problems will be obvious. The talk with revolve around a small number of reproducible toy examples — minimal sufficient (in a sense) for achieving the goals. There will be no theorems or proofs.
Haim Bar; University of Connecticut
A Scalable Empirical Bayes Approach to Variable Selection
We develop a model-based empirical Bayes approach to variable selection problems in which the number of predictors is very large, possibly much larger than the number of responses (the so-called “large p, small n” problem). We consider the multiple linear regression setting, where the response is assumed to be a continuous variable and it is a linear function of the predictors plus error. The explanatory variables in the linear model can have a positive effect on the response, a negative effect, or no effect. We model the effects of the linear predictors as a three-component mixture in which a key assumption is that only a small (unknown) fraction of the candidate predictors have a non-zero effect on the response variable. By treating the coefficients as random effects we develop an approach that is computationally efficient because the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using the EM algorithm which is scalable and leads to significantly faster convergence, compared with simulation-based methods.
Joint work with James Booth and Martin T. Wells
Tahir Ekin; Texas State University
Statistical Issues in Medical Fraud Assessment
U.S. governmental agencies report that three to ten percent of the annual health care spending is lost to fraud, waste and abuse. These fraudulent transactions have direct cost implications to the tax-payers and diminish the overall quality of the medical services. In this talk, firstly, the use of statistical sampling and estimation methods for medical fraud assessment will be discussed. The skewness and multi-modality within the payment populations make the over-payment estimation a challenging task. Proposed Bayesian inflated mixture based models will be presented and their conformance with the existing governmental guidelines will be investigated. Secondly, the use of data mining approaches for medical fraud detection will be discussed. The main objective of these approaches is to identify the billing behaviors and detect unusual behaviors. The focus will be on unsupervised approaches such as latent Dirichlet allocation and Bayesian co-clustering.
Luis Carvalho; Boston University
A Hierarchical Statistical Model and Computational Methods for Genome-Wide Association Studies
Genome-wide association studies (GWAS) attempt to determine which genomic markers are predictors of genetic traits, most commonly human diseases. In practice, despite the extreme imbalance of having millions of markers recorded for only a few thousand individuals, it is of great interest to glean as much information as possible from this type of data. To this end, we propose a novel statistical model that exploits a hierarchical structure between markers and genes to leverage information between levels and alleviate the “large-p small-n” regimen while still attaining a reasonably complex and realistic model. Fitting the model is challenging due to the high number of variables to select, so we discuss efficient computational approaches that we explored to estimate the parameters. Finally, we illustrate the proposed model and estimation procedures on simulated data and on a real-world data set from the Wellcome Trust Consortium. If time permits, we also discuss a latent genotype procedure that aims to correct genotypical correlations.
This is joint work with Ian Johnston.
Alan Gelfand; Duke University
Spatial data and Gaussian processes: A beautiful marriage
In the past twenty years analysis of spatial data has become increasingly model-based. Full specification of stochastic models for the spatial process being investigated enables full inference and uncertainty assessment regarding the process. Gaussian processes on subsets of R2 have become a fundamental specification for such modeling, particularly in settings where prediction is a primary goal. Therefore, focusing on the point-referenced case, we elaborate the substantial range of spatial settings where Gaussian processes have enabled rich and flexible modeling. We start with the basic geostatistical model, in hierarchical form, moving to generalized spatial regression models, multivariate process models, and spatially varying coefficient models. We will consider the use of Gaussian processes to handle skewed distributions as well as nonparametric distributional models and also the role of Gaussian processes in dimension reduction strategies to accommodate large datasets. Also, we will look at less standard contexts including spatial extremes, spatial directional data, and spatial quantile regression. Modeling details, model fitting, and examples will be provided.
Ting Zhang; Boston University
Semiparametric Model Building for Regression Models with Time-Varying Parameters
I consider the problem of semiparametric model building for linear regression models with potentially time-varying coefficients. By allowing the response variable and explanatory variables be jointly a nonstationary process, the proposed methods are widely applicable to nonstationary and dependent observations, for example time-varying autoregressive processes with heteroscedastic errors. We propose a local linear shrinkage method that is capable of achieving variable selection and parameter estimation simultaneously in a computationally efficient manner. Its selection consistency along with the favorable oracle property is established. Due to the fear of losing efficiency, an information criterion is further proposed for distinguishing time-varying and time-invariant components. Numerical examples are presented to illustrate the proposed methods.
Brad Carlin; University of Minnesota
Bayesian Adaptive Clinical Trial Methods for Incorporating Auxiliary Data and Identifying Interesting Subgroups, with Application to Orphan Diseases
Bayesian methods have a long history of success in clinical trial settings where patients and other resources are scarce, where good reliable external information is available, or both. In this talk we review several real-life settings where Bayesian methods have paid such dividends. First, we describe recent Bayesian advances in the adaptive incorporation of historical information in clinical trials through what are known as commensurate priors, showing connections with traditional meta-analytic methods and illustrating their potential for improved power while maintaining acceptable Type I error. We then extend the approach to the cases of survival and functional data, utilizing penalized splines in both cases, and illustrating with a dataset from diagnostic radiology. We then describe our experiences to date using adaptive Bayesian methods in the context of recent work with the University of Minnesota Center for Orphan Drug Research. Applications include trials for drugs treating epilepsy and neurodegenerative diseases such as Parkinson’s and adrenoleukodystrophy (ALD), in the latter case with Lorenzo’s Oil. We briefly describe current research in the cautious combination of randomized and nonrandomized data sources, and also introduce methods for selection of patient subsets for which it is very likely the treatment is effective. While like Nebraska’s Platte River this talk will be “a mile wide and an inch deep,” all of our methods will be illustrated in the context of real trial settings arising in both academics and industry.
This talk represents joint work with Drs. Brian Hobbs and Tom Murray of the University of Texas M.D. Anderson Cancer Center, and Dr. Jim Cloyd, Ms. Cynthia Basu, Ms. Hong Zhao, and Mr. Patrick Schnell of the University of Minnesota.
Melanie Wall; Columbia University
In search of an optimally valid criteria set for diagnosing alcohol use disorder
In the field of psychiatry, the current classification system is the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5). The DSM-5 provides diagnostic criteria that are associated with each mental disorder. Following the prototypal theory, not all listed criteria for a specific disorder must be present in order to receive a diagnosis but rather a specified subset of criteria must be present. To be diagnosed with an alcohol use disorder (AUD) under the DSM-5, an individual meeting any two of the 11 criteria receives a diagnosis. The question we will address is whether the validity of the diagnosis using the rule of 2 out of 11 can be improved upon using some other rule based on the same or a subset of the 11 criteria. In this presentation we will: 1) define validity in the absence of a gold standard by formalizing ideas of content validity from psychometrics 2) compare different measures of diagnostic performance, e.g. sensitivity, specificity, misclassification rate, diagnostic odds ratio from biostatistics and 3) demonstrate an empirical search algorithm incorporating uncertainty through bootstrapping that can identify an optimally valid rule for diagnosing alcohol use disorder. The data we will use comes from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) conducted by the National Institute on Alcohol Abuse and Alcoholism in 2001/2002.
Steven Chiou; Harvard University
Joint scale-change models for recurrent events and failure time
Recurrent event data arise frequently in various fields such as biomedical sciences, public health, engineering, and social sciences. In many instances, the observation of the recurrent event process can be stopped by the occurrence of a correlated failure event, such as treatment failure and death. In this article, we propose a joint scale- change model for the recurrent event process and the failure time, where a shared frailty variable is used to model the association between the two types of outcomes. In contrast to the popular Cox-type joint modeling approaches, the regression parameters in the proposed joint scale-change model have marginal interpretations. The proposed approach is robust in the sense that no parametric assumption is imposed on the distribution of the unobserved frailty and that we do not need the strong Poisson-type assumption for the recurrent event process. We establish consistency and asymptotic normality of the proposed semiparametric estimators under suitable regularity conditions. To estimate the corresponding variances of the estimators, we develop a computationally efficient resampling-based procedure. Simulation studies and an analysis of hospitalization data from the Danish Psychiatric Central Register illustrate the performance of the proposed method.