Paper of the Month

Please send comments for publication to

November 2018

Liang and Zeger (1986) proposed the generalized estimating equation (GEE), a multivariate extension of the generalized linear model to handle clustered data such as longitudinal data. The GEE method focuses on the regression parameters of the marginal means without specifying the multivariate dependence. A working correlation structure is used to improve efficiency. The resulting estimators of the mean parameters are consistent even if the working correlation is not correctly specified. The closer the working correlation is to the truth, the higher the efficiency. When the outcomes are multivariate normal, the GEEs reduces to the score equation if the working correlation is correctly specified. The estimators are asymptotically normal with a variance that can be estimated by a sandwich estimator. This paper has been cited widely (approximately 16,000 citations to date). The method has been a standard tool in applied statisticians’ toolbox and is widely used in many fields. It is implemented in standard software packages such as SAS and R (Our faculty Jun Yan developed the R package geepack during his graduate study).

Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.

Notes Preparer: Dipak Dey.

October 2018

The Expectation Maximization (EM) algorithm was introduced by Dempster, Laird, and Rubin in 19771. As the title of the paper suggests, the EM algorithm is a method to obtain maximum likelihood estimates in cases where the data are incomplete. It has been widely used in many applications, such as data imputation, fitting mixture models, and clustering. The EM algorithm consists of two steps: in the E-step, each missing value is replaced with its expected value, using the current estimates of the parameters in the model. In the M-step, using the available data and the imputed data for missing values, the likelihood function is maximized with respect to each of the model’s parameters in order to obtain new (better) estimates. The algorithm continues until a convergence criterion is met (for example, the improvement in the likelihood of the model is less than some user-defined threshold.) One of the challenges is to identify conditions that ensure that the algorithm will converge. Jeff Wu pointed out in his 1983 Annals of Statistics paper2 a flaw in the convergence proof in the Dempster, Laird, and Rubin paper, and offered not only a correct proof, but also extended the applicability of the algorithm beyond the exponential family models. For our monthly gathering to discuss the paper of the month you may start by reading Jeff Bilmes’ “gentle introduction” to EM algorithm, available here.

1 Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”. Journal of the Royal Statistical Society, Series B. 39 (1): 1–38.
2 Wu, C. F. Jeff (Mar 1983). “On the Convergence Properties of the EM Algorithm”. Annals of Statistics. 11 (1): 95–103.

Notes Preparer: Haim Bar

September 2018

The Paper of the Month for September 2018 is selected to be “Generalized cross-validation as a method for choosing a good ridge parameter” by Gene Golub, Michael Heath, and Grace Wahba. The paper was published in Technometrics 21, no. 2 (1979): 215-223. Dr. Wahba will be the speaker of the 2018 Pfizer/ASA/UConn Distinguished Statistician Colloquium.

Cross-validation is widely used to assess the quality of model fitting in statistics and machine learning and is often adopted in choosing tuning parameters in regularization methods such as the widely used LASSO. However, the ordinary cross-validation may not perform well for certain scenarios, such as the case when the design matrix is close to being column-orthogonal. In the context of ridge regression, the authors proposed the method of generalized cross-validation (GCV) which is a rotation-invariant version of the ordinary cross-validation. The GCV improves the ordinary cross-validation; it does not require to estimate the model error variance; it is applicable when the number of predictors is larger than the sample size. It is worth mentioning that this paper was a joint work by researchers from computer science and statistics almost forty years ago. Modern applications of the technique of GCV are far beyond ridge regression, but the fundamental ideas are in the original paper.

Notes Preparer: HaiYing Wang

May 2018

The last paper of the month for the academic year appeared in the Journal of Irreproducible Results (JIR). The title of the paper by Bennett et al. is “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction”. The authors of the paper received the Ig Nobel prize in 2012, for neuroscience. According to the Ig Nobel website “The Ig Nobel Prizes honor achievements that first make people laugh, and then make them think.” Indeed, although the chosen paper of the month is light-hearted and has appeared in a science humor magazine, it makes a very good argument for the importance of accounting for multiple testing, especially in modern applications, where the number of simultaneous tests can be very large.

To read more about the Ig Nobel prize, click here. To read more about the journal, click here.

Notes Preparer: Haim Bar

March–April 2018

Even an undergraduate student taking a regression course knows how to do model selection or variable selection through model selection criteria such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). Both criteria penalize the measure of fit (log-likelihood) with a multiple of the number of free parameters to combat overfitting. BIC penalizes the number of parameters more heavily than AIC.

AIC is an estimate of the Kullback-Leibler divergence while BIC is an approximation of the posterior probability of a candidate model. AIC is best for prediction as it is asymptotically equivalent to leave-one-out validation. BIC is best for explanation as it allows consistent estimation of the underlying data generating process.

AIC and BIC should be learned together. It is tempting to wonder why the penalty scales on the number of parameters are what they are, which is usually not in textbooks. Both original papers are among the most cited statistical papers; as of now, Akaike (1974) has 39K and Schwarz (1978) has 34K Google citations. We present Schwarz (1978) simply because, despite its importance, it is short (3 pages with 4 references on the 4th page) and accessible. The justifications were made even easier to follow by Cavanugh (1997, Stat Prob Letters) for AIC and by Neath and Cavanaugh (2012, WIREs Comp Stat) for BIC. A StackExchange discussion is also a fun read.

Notes Preparer: Jun Yan

February 2018

The “Paper of the Month” for February 2018 is selected to be W. Hoeffding’s 1948 paper “A Class of Statistics with Asymptotically Normal Distribution” where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was published on The Annals of Mathematical Statistics Vol. 19, No. 3 (1948), pp. 293-325. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.

I suggested W. Hoeffding’s 1948 paper from the Annals of Math Stat where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.

This paper changed completely the face of nonparametric statistics – both theory and practice – by creating the fundamentals to take this field to maturity. The Hoeffding paper indeed impacted many areas of classical statistical inference by showing how one must handle the probability theory behind the CLT and SLLN for dependent sequences of random variables. It gave rise to new and challenging pathways to handle moderate and large-deviation theories in the case of dependent sequences of random variables.

I totally fell in love with Hoeffding’s 1948 paper in the early 70’s when I was a PhD student at the Indian Statistical Institute-Calcutta. I had great opportunities to use his fundamental projections, CLT, and SLLN with total practical relevance of martingales and reverse martingales in my 1975 thesis-work. Personally, I have gone back to Hoeffding’s paper several hundred times in my life for its fundamentally breakthrough and truly fresh ideas. Beauty in this paper remains astonishing and unmatched given its path-breaking influences on numerous big sub-fields within statistics and probability theory including nonparametrics, large-sample theory, CLT’s for dependent processes and invariance principles, Berry-Esseen rates, theory of martingales and reversemartingales, tightness, geometric probability, and CLT’s for convex bodies.

I feel lucky that I could learn to appreciate the marvel of U-statistics theory early on (beginning 1970) in my career from P. K. Sen, Malay Ghosh, and Bob Serfling (all had close ties with Hoeffding) which helped me to build a large volume of my own significant publications to show for it. I also had the great pleasure of advising a number of PhD students (including Gaute Vik, John Judge, Mabel Moreno, and Bhargab Chattopadhyay) whose research have been significantly touched and influenced by Hoeffding’s paper. Two present students’ (Jun Hu and Chen Zhang) research overwhelmingly exploit Hoeffding’s theory of U-statistics. Our own Rick Vitale (with Herman Rubin) gave a complete asymptotic theory for square-integrable U-statistics in their 1980 Annals of Stat. paper.

After nearly 70 years of publication, W. Hoeffding’s original 1948 paper still inspires me and energizes me big time. Upon rereading the original paper the n-th time, I continue to surprise myself by discovering something interesting, deep and new that remained hidden from me during my previous passes.

Happy reading of Hoeffding (1948) paper. Just live it with TLC!

Notes Preparer: Nitis Mukhopadhyay.

December 2017–January 2018

Controlling the probability of falsely rejecting the null hypothesis is critical for valid and meaningful statistical analysis. But how should this probability be defined and calculated when there are multiple, simultaneous hypotheses? For many years, this question was mostly investigated in the analysis of variance (ANOVA) setting, in which the number of comparisons is typically small or moderate. Until 1995, the common approach to this problem was to control the family-wise error rate (FWER) which ensures that the probability of falsely rejecting at least one of the hypotheses is smaller than a user-specified bound. However, the advent of high-throughput methods, such as in genetics, resulted in a much larger number of simultaneous hypotheses and rendered the FWER approach impractical, in the sense that it was too stringent and lacked power to reject any hypothesis.

In 1995, Benjamini and Hochberg published their breakthrough paper “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing” in which they defined the False Discovery Rate (FDR) as the expected proportion of falsely rejected hypotheses, out of the total number of rejected hypotheses. The paper proposed a very simple procedure to estimate the FDR, and showed that indeed, the procedure controls the FDR at the desired level. This approach enabled significant advances in many areas, and is particularly useful in this age of “Big Data”.

The paper appeared in the Journal of the Royal Statistical Society, Series B, Vol. 57, No. 1. (1995), pp. 289-300. For additional reading, we also suggest reading about the q-value in the paper “A direct approach to false discovery rates“, by John Storey, which appeared in 2002, in the Journal of the Royal Statistical Society, Series B, 64: 479-498.

Notes Preparer: Haim Bar

November 2017

Once a month during the academic year our faculty will select a paper which we encourage our students to read and discuss. Papers featured in this section should be generally understood by graduate students, and will be selected either because of their impact, or historical value, or because they contain a useful (perhaps overlooked) techniques or results.

The inaugural “Paper of the Month” (November 2017) was selected to be Brad Efron’s paper titled “Bootstrap Methods: Another Look at the Jackknife”, published in 1979 in the Annals of Statistics (Vol. 7, No. 1, pages 1-26). It is hard to overstate the impact of this paper. It allows researchers to construct confidence intervals in many settings, even when there is no closed-form derivation of the standard deviation.

A glance at Efron’s Google scholar page shows that this work was already cited tens of thousands of times. In his column in the IMS Bulletin from November 17, 2016, Professor Xiao-Li Meng included this paper among his five selected “Nobel-Prize (NP) Worthy i.i.d Ideas in Statistics” and wrote: “[the bootstrap] certainly has made many researchers’ lives much easier”, and that it “has literally generated an industry of research on proving when it works, when it doesn’t, and how to make it work when its vanilla version fails.” For example, we may refer to the paper “Two Guidelines for Bootstrap Hypothesis Testing” by Peter Hall and Susan R. Wilson (Biometrics, Vol. 47, No. 2 (Jun., 1991), pp. 757-762)

Notes Preparer: Haim Bar