Please send comments for publication to firstname.lastname@example.org.
Paper of the Month, February 2020: Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32, 407–499.
In many problems, we have available a large collection of predictors from which we hope to select a parsimonious set for accurate prediction of a response variable. Classical model selection algorithms include forward selection, backward elimination, stepwise selection, and their stagewise versions, while in the era of big data, methods under the learning scheme of “regularization + optimization”, as exemplified by Lasso, have undergone exciting development. Given that these seemingly very different approaches often produce models with striking similarity, a natural question arises: is there any intrinsic connection between them?
Efron et al. (2004) proposed Least Angle Regression (LARS), a novel model selection algorithm, to bridge Lasso and forward stagewise regression (FSR). They showed that the solution paths of either Lasso or FSR can be efficiently produced via the LARS algorithm with some simple modification. Ever since this work, there has been a revival of interest into the so called stagewise learning. Generally speaking, a stagewise algorithm builds a model from scratch and gradually increases the model complexity in a sequence of simple learning steps. As such, it was realized that LARS and stagewise learning also connect to various optimization and machine learning approaches such as steepest descend, boosting, and path-following algorithms. As noted in one of the paper’s discussion pieces, “the LARS–Lasso–boosting relationship opens the door for new insights on existing methods’ underlying statistical mechanisms and for the development of new and promising methodology.”
We will discuss the main idea of LARS, the intriguing connections between regularized estimation and stagewise learning, and some recent advances.
Notes Preparer: Kun Chen
Paper of the Month, November 2019: Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data Journal of the American statistical Association, 88(422), 669-679.
Albert and Chib (1993) introduced a data-augmentation idea for Bayesian inference in regression models for binary and polychotomous response data. This paper is a seminal work on introduction of latent data into Bayesian analysis of categorical response data.
By using continuous latent data response to connect the binary response in a probit model with the normal linear model, this approach has several advantages. First, it allows us to perform exact inference for binary regression models, which is likely preferable to maximum likelihood methods for small sample sizes. Second, sampling from the joint posterior of unknowns in the model then only require Gibbs sampling, which allows us to directly draw standard distributions such as the multivariate normal and thus is easy to implement. To sum up, this data-augmentation scheme leads to simple, effective methods for Bayesian posterior inference, which circumvent the need for analytic approximations, numerical integration or Metropolis-Hastings in probit models.
Moreover, we can easily extend this approach to model the latent data beyond the probit link. For example, our faculty members Dr. Ming-Hui Chen and Dr. Dipak K. Dey used this kind of data-augmentation idea in their paper (2008) titled “Flexible Generalized T-Link Models for Binary Response Data”. I also employed this idea in my paper (2013) titled “Bayesian Analysis of Dynamic Item Response Models in Educational Testing” for logistic regression.
However, in comparison to the probit model, Bayesian inference for the logistic regression model has long been recognized as a hard problem for the inconvenient analytic form of the likelihood function of the model. A paper “Bayesian inference for logistic models using Polya-Gamma latent variables” written by Polson, Scott and Windle (2013) introduced a Pólya-Gamma latent variable as a new data-augmentation scheme for binomial likelihoods. Their approach avoided the Metropolis-Hasting steps for logistic regression as shown in my paper (2013), which made their method very useful, including for logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models of count data.
The article of Albert and Chib (1993) is one of the papers I always encourage my graduate students to read when they begin to study Bayesian methods. It reflects a very important idea for designing an effective Markov chain Monte Carlo algorithm in the Bayesian analysis.
- 1. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422), 669-679.
- 2. Kim, S., Chen, M. H., & Dey, D. K. (2007). Flexible generalized t-link models for binary response data. Biometrika, 95(1), 93-106.
- 3. Wang, X., Berger, J. O., & Burdick, D. S. (2013). Bayesian analysis of dynamic item response models in educational testing. The Annals of Applied Statistics, 7(1), 126-153.
- 4. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.
Notes Preparer: Wang Xiaojing
Paper of the Month, October 2019: A. Azzalini and A. Dalla-Valle (1996). The multivariate skew normal distribution Biometrika, Volume 83, Number 4, 715-726.
In this paper Azzalini and Dalla Valle introduced the multivariate skew normal (MSN) distribution which extends the class of multivariate normal distributions by the addition of a shape parameter to regulate skewness. Several properties are discussed with emphasis on the bivariate case. In a subsequent paper, published in the JRSS B [Azzalini and Capitanio (1999). Statistical applications of the multivariate skew normal distribution. Volume 61, Number 3, 715-726], further probabilistic properties of the distribution are examined, with special emphasis on aspects of statistical relevance.
Since the introduction of the MSN, several multivariate distributions have been proposed, such as, the multivariate skew-t distribution or the class of skew-elliptical distributions. Our faculty Dipak Dey published an interesting paper in JMVA [Branco and Dey (2001). A general class of multivariate skew-elliptical distributions. Volume 79, Number 1, 99-113]. Most importantly, is that distributions generated starting from the MSN distributions has been widely used in many applications such as mixed-effects models, finite mixture models, spatial models, measurement error models, just to mention a few. Personally, I wrote my PhD thesis using this distribution in 2004, and recently I have written a book entitled “Finite Mixtures of Skewed Distributions” published by Springer, which is based on the multivariate skew-elliptical distributions and the R package mixsmsn.
Notes Preparer: Victor Hugo Lachos
Paper of the Month, September 2019: Thomas S. Ferguson (1973). A Bayesian Analysis of Some Nonparametric Problems Annals of Statistics, Volume 1, Number 2 (1973), 209-230.
Bayesian statisticians were traditionally silent in nonparametric problems until this pioneer work by Tom Ferguson published in 1973. Basically, we need to construct a workable prior on a space of distribution functions. As pointed out by the author: “There are two desirable properties of a prior distribution for nonparametric problems. (I) The support of the prior distribution should be large–with respect to some suitable topology on the space of probability distributions on the sample space. (II) Posterior distributions given a sample of observations from the true probability distribution should be manageable analytically. … This paper presents a class of prior distributions, called Dirichlet process (DP) priors, broad in the sense of (I), for which (II) is realized, and for which treatment of many nonparametric statistical problems may be carried out, yielding results that are comparable to the classical theory.” In application, the author illustrates the DP process on nonparametric estimation for a distribution function, a mean, a quantile, a variance, or a covariance. He further applies the DP prior to a two-sample problem, quantal bio-assay, empirical Bayes and discrimination problems.
This article stimulates a lot of interests, not only in statistics, but also in machine learning community. It is generally considered to be difficult for graduate students to read on their own for the first time. I will make the basic idea in the article simple and easy. So it will be easier for you to read. Looking back, I still consider the paper to be one of the best papers I have read. I am glad I read it while I was in graduate school.
Notes Preparer: Lynn Kuo
Paper of the Month, April 2019: Shuo-Yen Robert Li (1980). A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments Annals of Probability, 8(6), 1171-1176.
If a monkey types only capital letters, and is on every occasion equally likely to type any of the 26, how long on average will it take the monkey to produce the sequence: ‘ABRACADABRA’? This seemingly childish question has a lot of connections to important real-world problems. For instance, the distribution of scan statistics can be related to this type of waiting time. The study of occurrence of patterns is a classical problem in probability theory. In the first volume of his famous book Feller has several sections devoted to this topic. The distribution of the waiting time can be derived in many different ways. But my personal favorite is the martingale approach developed in Li (1980). Li’s key observation is that information on the occurrence times of patterns can be obtained from the values assumed by a specially constructed auxiliary martingale at a certain well-chosen time. Li’s method is a textbook example of a clever application of fundamental Doob’s Optional Stopping Theorem. This elegant mathematical trick did not change the world, but I like it a lot. And by the way, the expected time is 2611 +264 +26.
Notes Preparer: Vladimir Pozdnyakov
Paper of the Month, March 2019: Woodroofe, M. (1977). Second order approximations for sequential point and interval estimation. Ann. Statist., 5, 984-995.
Many crucial properties associated with first boundary crossing depend intricately on the distribution of the corresponding stopping time, the magnitude of an overshot, and their complex relationship. Under a very broad structure and a minimal set of assumptions, Woodroofe (1977) developed nonlinear renewal theory to tackle these issues more generally than one would normally expect. In the same paper, Michael Woodroofe used his tools to precisely formulate second-order asymptotic approximations for the risk functions associated with a number of classical sequential point and interval estimation problems. The impact has been far-reaching, helping researchers to investigate second-order approximations for risk functions in a very wide range of problems. Woodroofe’s results are truly pretty, involve deep probability theory, and they are difficult to prove. On the other hand, their applications are numerous. Many have provided surprising practical and useful conclusions in a large variety of inference problems – parametric, nonparametric, multivariate tests and estimation. Frequentists and Bayesians (empirical, hierarchical) have both ripped well-deserved benefits from Woodroofe’s techniques over the last four decades. I have been a direct beneficiary of Woodroofe’s fundamental approach combined with the Lai-Siegmund (1977,1979, Annals of Statistics) approach over the past 40 years. My students, colleagues, I and others have taken those fundamentals with requisite updates as needed to many corners of statistical inference by handling problems in clinical trials, reliability, quality control, multiple comparisons, multiple decision theory and others. Woodroofe’s 1977 paper continues to lay Golden Eggs without “stopping” (pun totally intended!).
Notes Preparer: Nitis Mukhopadhyay
In this paper, Stein (1981) derived the famous Stein’s Lemma for multivariate normal distributions. Though simple, this lemma is so remarkable that it was used in the paper to derive an unbiased estimator for the risk of an arbitrary almost differentiable estimator under the squared error loss. This is known as Stein’s Unbiased Risk Estimate (SURE). SURE is an extremely useful tool for selecting tuning parameters or choosing between estimators to minimize the empirical risk, and as a theoretical tool for proving dominance results. For example, SURE can be applied to estimate the degrees of freedom for a large class of estimators, such as the various types of linear smoothers, the LARS (Efron et al. 2004), the LASSO (Zou, Hastie, and Tibshirani 2007), the reduced rank regression (Mukherjee et al. 2015), and many more. SURE can be also used to show the dominance of the James-Stein estimator over the standard least squares estimator when the dimension is above two. It is a really nice tool to have in our statistician’s toolbox.
Notes preparer: Yuwen Gu
Let A₁, A₂, …, An be a sequence of dependent events. For j=1,2,…n, define the indicator random variables: Iⱼ=1 if the event Aⱼ occurs, and Iⱼ=0 otherwise. In this article a new method is presented for approximating the distribution of W, the sum of Iⱼ, j=1,2,…n, by a Poisson distribution and a derivation of a bound on the distance between the distribution of W and the Poisson distribution with mean E(W). This new method is based on previous work by C. Stein where a central limit theorem for a sum of dependent random variables was obtained without the use of moment generating functions. In Chen (1975), the distance between the distribution of W and the Poisson distribution is assessed based the unique solution for an equation formulated specifically for the Poisson distribution. The focus of this work is on deriving upper and lower bounds on the distance between the distribution of W and the Poisson distribution with mean E(W). The Poisson approximation itself has been used in several ares of probability and statistics including: DNA and protein sequences, geometrical probability, spatial statistics and reliability theory. Since the Poisson approximation can be inaccurate, improvements of the Poisson approximations have been investigated. Most importantly, a compound Poisson approximation method was developed that includes a bound on the distance between the distribution of W and the approximating compound Poisson approximation. The origin of this approach is included in the Chen (1975) article.
Chen, L.H.Y. Poisson Approximation for Dependent Trials. The Annals of Probability, 1975, Vol. 3, 534-545.
Notes Preparer: Joseph Glaz
Liang and Zeger (1986) proposed the generalized estimating equation (GEE), a multivariate extension of the generalized linear model to handle clustered data such as longitudinal data. The GEE method focuses on the regression parameters of the marginal means without specifying the multivariate dependence. A working correlation structure is used to improve efficiency. The resulting estimators of the mean parameters are consistent even if the working correlation is not correctly specified. The closer the working correlation is to the truth, the higher the efficiency. When the outcomes are multivariate normal, the GEEs reduces to the score equation if the working correlation is correctly specified. The estimators are asymptotically normal with a variance that can be estimated by a sandwich estimator. This paper has been cited widely (approximately 16,000 citations to date). The method has been a standard tool in applied statisticians’ toolbox and is widely used in many fields. It is implemented in standard software packages such as SAS and R (Our faculty Jun Yan developed the R package geepack during his graduate study).
Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.
Notes Preparer: Dipak Dey
The Expectation Maximization (EM) algorithm was introduced by Dempster, Laird, and Rubin in 19771. As the title of the paper suggests, the EM algorithm is a method to obtain maximum likelihood estimates in cases where the data are incomplete. It has been widely used in many applications, such as data imputation, fitting mixture models, and clustering. The EM algorithm consists of two steps: in the E-step, each missing value is replaced with its expected value, using the current estimates of the parameters in the model. In the M-step, using the available data and the imputed data for missing values, the likelihood function is maximized with respect to each of the model’s parameters in order to obtain new (better) estimates. The algorithm continues until a convergence criterion is met (for example, the improvement in the likelihood of the model is less than some user-defined threshold.) One of the challenges is to identify conditions that ensure that the algorithm will converge. Jeff Wu pointed out in his 1983 Annals of Statistics paper2 a flaw in the convergence proof in the Dempster, Laird, and Rubin paper, and offered not only a correct proof, but also extended the applicability of the algorithm beyond the exponential family models. For our monthly gathering to discuss the paper of the month you may start by reading Jeff Bilmes’ “gentle introduction” to EM algorithm, available here.
1 Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”. Journal of the Royal Statistical Society, Series B, 39 (1): 1–38.
2 Wu, C. F. Jeff (Mar 1983), “On the Convergence Properties of the EM Algorithm”, Annals of Statistics, 11 (1): 95–103.
Notes Preparer: Haim Bar
The Paper of the Month for September 2018 is selected to be “Generalized cross-validation as a method for choosing a good ridge parameter” by Gene Golub, Michael Heath, and Grace Wahba. The paper was published in Technometrics 21, no. 2 (1979): 215-223. Dr. Wahba will be the speaker of the 2018 Pfizer/ASA/UConn Distinguished Statistician Colloquium.
Cross-validation is widely used to assess the quality of model fitting in statistics and machine learning and is often adopted in choosing tuning parameters in regularization methods such as the widely used LASSO. However, the ordinary cross-validation may not perform well for certain scenarios, such as the case when the design matrix is close to being column-orthogonal. In the context of ridge regression, the authors proposed the method of generalized cross-validation (GCV) which is a rotation-invariant version of the ordinary cross-validation. The GCV improves the ordinary cross-validation; it does not require to estimate the model error variance; it is applicable when the number of predictors is larger than the sample size. It is worth mentioning that this paper was a joint work by researchers from computer science and statistics almost forty years ago. Modern applications of the technique of GCV are far beyond ridge regression, but the fundamental ideas are in the original paper.
Notes Preparer: HaiYing Wang
The last paper of the month for the academic year appeared in the Journal of Irreproducible Results (JIR). The title of the paper by Bennett et al. is “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction”. The authors of the paper received the Ig Nobel prize in 2012, for neuroscience. According to the Ig Nobel website “The Ig Nobel Prizes honor achievements that first make people laugh, and then make them think.” Indeed, although the chosen paper of the month is light-hearted and has appeared in a science humor magazine, it makes a very good argument for the importance of accounting for multiple testing, especially in modern applications, where the number of simultaneous tests can be very large.
Notes Preparer: Haim Bar
Even an undergraduate student taking a regression course knows how to do model selection or variable selection through model selection criteria such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). Both criteria penalize the measure of fit (log-likelihood) with a multiple of the number of free parameters to combat overfitting. BIC penalizes the number of parameters more heavily than AIC.
AIC is an estimate of the Kullback-Leibler divergence while BIC is an approximation of the posterior probability of a candidate model. AIC is best for prediction as it is asymptotically equivalent to leave-one-out validation. BIC is best for explanation as it allows consistent estimation of the underlying data generating process.
AIC and BIC should be learned together. It is tempting to wonder why the penalty scales on the number of parameters are what they are, which is usually not in textbooks. Both original papers are among the most cited statistical papers; as of now, Akaike (1974) has 39K and Schwarz (1978) has 34K Google citations. We present Schwarz (1978) simply because, despite its importance, it is short (3 pages with 4 references on the 4th page) and accessible. The justifications were made even easier to follow by Cavanugh (1997, Stat Prob Letters) for AIC and by Neath and Cavanaugh (2012, WIREs Comp Stat) for BIC. A StackExchange discussion is also a fun read.
Notes Preparer: Jun Yan
The “Paper of the Month” for February 2018 is selected to be W. Hoeffding’s 1948 paper “A Class of Statistics with Asymptotically Normal Distribution” where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was published on The Annals of Mathematical Statistics Vol. 19, No. 3 (1948), pp. 293-325. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.
I suggested W. Hoeffding’s 1948 paper from the Annals of Math Stat where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.
This paper changed completely the face of nonparametric statistics – both theory and practice – by creating the fundamentals to take this field to maturity. The Hoeffding paper indeed impacted many areas of classical statistical inference by showing how one must handle the probability theory behind the CLT and SLLN for dependent sequences of random variables. It gave rise to new and challenging pathways to handle moderate and large-deviation theories in the case of dependent sequences of random variables.
I totally fell in love with Hoeffding’s 1948 paper in the early 70’s when I was a PhD student at the Indian Statistical Institute-Calcutta. I had great opportunities to use his fundamental projections, CLT, and SLLN with total practical relevance of martingales and reverse martingales in my 1975 thesis-work. Personally, I have gone back to Hoeffding’s paper several hundred times in my life for its fundamentally breakthrough and truly fresh ideas. Beauty in this paper remains astonishing and unmatched given its path-breaking influences on numerous big sub-fields within statistics and probability theory including nonparametrics, large-sample theory, CLT’s for dependent processes and invariance principles, Berry-Esseen rates, theory of martingales and reversemartingales, tightness, geometric probability, and CLT’s for convex bodies.
I feel lucky that I could learn to appreciate the marvel of U-statistics theory early on (beginning 1970) in my career from P. K. Sen, Malay Ghosh, and Bob Serfling (all had close ties with Hoeffding) which helped me to build a large volume of my own significant publications to show for it. I also had the great pleasure of advising a number of PhD students (including Gaute Vik, John Judge, Mabel Moreno, and Bhargab Chattopadhyay) whose research have been significantly touched and influenced by Hoeffding’s paper. Two present students’ (Jun Hu and Chen Zhang) research overwhelmingly exploit Hoeffding’s theory of U-statistics. Our own Rick Vitale (with Herman Rubin) gave a complete asymptotic theory for square-integrable U-statistics in their 1980 Annals of Stat. paper.
After nearly 70 years of publication, W. Hoeffding’s original 1948 paper still inspires me and energizes me big time. Upon rereading the original paper the n-th time, I continue to surprise myself by discovering something interesting, deep and new that remained hidden from me during my previous passes.
Happy reading of Hoeffding (1948) paper. Just live it with TLC!
Notes Preparer: Nitis Mukhopadhyay.
December 2017–January 2018
Controlling the probability of falsely rejecting the null hypothesis is critical for valid and meaningful statistical analysis. But how should this probability be defined and calculated when there are multiple, simultaneous hypotheses? For many years, this question was mostly investigated in the analysis of variance (ANOVA) setting, in which the number of comparisons is typically small or moderate. Until 1995, the common approach to this problem was to control the family-wise error rate (FWER) which ensures that the probability of falsely rejecting at least one of the hypotheses is smaller than a user-specified bound. However, the advent of high-throughput methods, such as in genetics, resulted in a much larger number of simultaneous hypotheses and rendered the FWER approach impractical, in the sense that it was too stringent and lacked power to reject any hypothesis.
In 1995, Benjamini and Hochberg published their breakthrough paper “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing” in which they defined the False Discovery Rate (FDR) as the expected proportion of falsely rejected hypotheses, out of the total number of rejected hypotheses. The paper proposed a very simple procedure to estimate the FDR, and showed that indeed, the procedure controls the FDR at the desired level. This approach enabled significant advances in many areas, and is particularly useful in this age of “Big Data”.
The paper appeared in the Journal of the Royal Statistical Society, Series B, Vol. 57, No. 1. (1995), pp. 289-300. For additional reading, we also suggest reading about the q-value in the paper “A direct approach to false discovery rates“, by John Storey, which appeared in 2002, in the Journal of the Royal Statistical Society, Series B, 64: 479-498.
Notes Preparer: Haim Bar
Once a month during the academic year our faculty will select a paper which we encourage our students to read and discuss. Papers featured in this section should be generally understood by graduate students, and will be selected either because of their impact, or historical value, or because they contain a useful (perhaps overlooked) techniques or results.
The inaugural “Paper of the Month” (November 2017) was selected to be Brad Efron’s paper titled “Bootstrap Methods: Another Look at the Jackknife”, published in 1979 in the Annals of Statistics (Vol. 7, No. 1, pages 1-26). It is hard to overstate the impact of this paper. It allows researchers to construct confidence intervals in many settings, even when there is no closed-form derivation of the standard deviation.
A glance at Efron’s Google scholar page shows that this work was already cited tens of thousands of times. In his column in the IMS Bulletin from November 17, 2016, Professor Xiao-Li Meng included this paper among his five selected “Nobel-Prize (NP) Worthy i.i.d Ideas in Statistics” and wrote: “[the bootstrap] certainly has made many researchers’ lives much easier”, and that it “has literally generated an industry of research on proving when it works, when it doesn’t, and how to make it work when its vanilla version fails.” For example, we may refer to the paper “Two Guidelines for Bootstrap Hypothesis Testing” by Peter Hall and Susan R. Wilson (Biometrics, Vol. 47, No. 2 (Jun., 1991), pp. 757-762)
Notes Preparer: Haim Bar