Paper of the Month

Please send comments for publication to

September 30, 2022, 12:00 pm, AUST 326

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences116(32), 15849-15854.

Conventional wisdom dictates that statistical models should be flexible enough to capture the data’s underlying structure, but not so flexible that they overfit to irrelevant noise. Balancing these two demands amounts to the bias-variance tradeoff, a central theme of traditional statistical modeling.

Interestingly, the recent success of deep learning has called some of this wisdom into question. Deep learning models are routinely over-parameterized, involving more model parameters than there are data points. Standard practice is to interpolate (perfectly fit) the sample data, a clear recipe for overfitting. Nevertheless, deep learning can perform remarkably well on held-out data, seemingly breaking the rules of the bias variance tradeoff. What is going on?

This paper addresses this question by describing the “Double descent” phenomenon, which I view as one of the most intriguing statistical findings of recent years. Empirically, once a model is over-parameterized to the point of interpolation, adding even more parameters can cause out-of-sample performance to begin improving again—sometimes past the point of the best low-dimensional model. This explains why overparametrized deep learning models have the potential to perform so well.

My plan for the discussion is to focus on the general principles and statistical implications of the “double descent” phenomenon. What does this discovery tell us about traditional statistics? Why did it go unnoticed for so long? How could one distinguish “good” interpolating model from the sea of bad ones? Could more traditional statistical ideas like regularization, Bayesian inference, or nonparametric modeling be helpful?

Other relevant papers that cover various aspects of this problem in greater detail are:

  • Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences117(48), 30063-30070. 
  • Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica30, 203-248. 
  • Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics50(2), 949-986.
  • Bartlett, P. L., Montanari, A., & Rakhlin, A. (2021). Deep learning: a statistical viewpoint. Acta numerica30, 87-201. 

Notes Preparer: Neil Spencer

April 22, 2022, 12:00 pm

Liu, Y., & Xie, J. (2020). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529), 393-402.

The Cauchy distribution has well-known, unique features. For example, it does not have any moment; its sample mean has the same Cauchy distribution. Pillai and Meng (2016) give another surprising result: for two independent and identically distributed multivariate normal random vectors with mean zero and any covariance matrix, the (nonnegatively) weighted average of their component-wise ratio follows a standard Cauchy distribution. It suggests that the Cauchy distribution has high tolerance to dependence structures. Liu and Xie (2020) show that, for a standard multivariate Cauchy distribution with any dependence structure (copula),  the nonnegatively weighted average of the marginal Cauchy variables has the same tail distribution as that of a standard Cauchy distribution. That is, a linear combination of correlated Cauchy variables with arbitrary dependence strictures has the same tail as that of a Cauchy variable. This property makes the Cauchy distribution an angel in combining p-values in multiple testing.

Consider a global testing problem where we have a collection of null hypotheses. Each individual hypothesis test has its own p-value. Under the global null hypothesis, these p-values are uniformly distributed, but could be dependent. Liu and Xie (2020) proposed a Cauchy combination test, where the test statistic is a linear combination of Cauchy variables obtained by transforming the p-values with the quantile function of standard Cauchy distribution. The global null hypothesis is rejected when the statistic is overly large relative to a standard Cauchy tail. Unlike Fisher’s combination of p-values, this test does not require the p-values to be independent, can handle large number of hypotheses with sparse signals, and allows incorporation of prior information on weighting the individual tests. The test has been picked up quickly in practice such as genetic studies. It has also been adapted to address some of its limitations (Chen, 2022).


  • Chen, Z. (2022). Robust tests for combining p-values under arbitrary dependency structures. Scientific Reports, 12(1), 1-8.
  • Pillai, N. S., & Meng, X. L. (2016). An unexpected encounter with Cauchy and Lévy. The Annals of Statistics, 44(5), 2089-2097.

Notes Preparer: Jun Yan

The discussion will be held in-person in the department lounge, but can also be attended virtually for people who would prefer to attend from their offices or home (

March 2022

Paper of the Month, March 25, 2022: Chib, S. (1995). Marginal Likelihood From the Gibbs Output. Journal of the American Statistical Association, 90, 1313–1321.

The marginal likelihood is a useful Bayesian model comparison criterion. For a complex model, the marginal likelihood is often analytically intractable. Chib (1995) exploits the fact that the marginal density can be expressed as the prior times the likelihood function over the posterior density. This simple identity holds for any parameter value. An estimate of the posterior density is shown to be available if all complete conditional densities used in the Gibbs sampler have closed-form expressions. Chib and Liazkov (2001) extend this method to overcome the problems associated with the presence of intractable full conditional densities. This extended method is developed in the context of MCMC chains produced by the Metropolis-Hastings algorithm, whose building blocks are used both for sampling and marginal likelihood estimation.

Liu et al. (2019) review Monte Carlo methods for computing the marginal likelihoods developed in the literature in recent years and provide a detailed development of how these methods are applied to the item response theory models. As empirically shown in Liu et al. (2019), the method of Chib (1995) is efficient but it requires a longer computational time. Inspired by the identity of Chib (1995),  Zhang et al. (2017) develop a useful Conditional Predictive Ordinate (CPO) identity, which leads to the CPO decomposition to quantify the association between the longitudinal data and the survival data within the joint modeling framework.

Other References

  • 1. Chib, S. and Jeliazkov, I. (2001). Marginal Likelihood From the Metropolis-Hastings Output. Journal of the American Statistical Association, 96, 270–281.
  • 2. Liu, Y., Hu, G., Cao, L., Wang, X., and Chen, M.-H. (2019). A Comparison of Monte Carlo Methods for Computing Marginal Likelihoods of Item Response Theory Models (with Discussion). Journal of the Korean Statistical Society, 48, 503-512 for the main paper and 522-523 for the rejoinder.
  • 3. Zhang, D., Chen, M.-H., Ibrahim, J.G., Boye, M.E., and Wei, S. (2017). Bayesian Model Assessment in Joint Modeling of Longitudinal and Survival Data With Applications to Cancer Clinical Trials. Journal of Computational and Graphical Statistics, 26(1), 121-133.

Notes Preparer: Ming-hui Chen

Webex Link:

February 2022

Paper of the Month, February 25, 2022: Nozer D. Singpurwalla, Nicholas G. Polson & Refik Soyer (2018), From Least Squares to Signal Processing and Particle Filtering

“Signal processing is the interpolation and extrapolation of a sequence of observations viewed as a realization of a stochastic process. Its role in applied statistics ranges from scenarios in forecasting and time series analysis to image reconstruction, machine learning, and the degradation modeling for reliability assessment.” The Kalman filter algorithm is useful for Gaussian state space models (dynamic linear models), while particle filtering is useful for “big and high velocity non-Gaussian data”. The principle of conditionalization underlies filtering and prediction based on Bayesian methods.

We will follow the path laid out in this paper and discuss ideas for analyzing time course data.


  • 1. Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002), “A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking,” IEEE Transactions on Signal Processing, 50, 174–188.
  • 2. Liu, J. S., and Chen, R. (1998), “Sequential Monte Carlo Methods for Dynamic Systems,” Journal of the American Statistical Association, 93,1032–1044.
  • 3. Meinhold, R. J., and Singpurwalla, N. D. (1983), “Understanding the Kalman Filter,” The American Statistician, 37, 123–127.

Notes Preparer: Nalini Ravishanker

November 2021

Paper of the Month, November 19, 2021 12:00-1:00 pm ET: Bradley Efron (2020) Prediction, Estimation, and Attribution, Journal of the American Statistical Association, 115:530, 636-655, DOI: 10.1080/01621459.2020.1762613

While views on what makes a science may be diverse, John W. Tukey identified three constituents are essential: “1) intellectual content, 2) organization into an understandable form, 3) reliance upon the test of experience as the ultimatestandard of validity”. In the modern era of data science, lots of research efforts and time are spent on prediction algorithms. Examples include random forest, gradient boosting, support vector machines, and deep learning. What is the relationship between prediction, estimation and attribution? How do they contribute to the three constituents of science? We will discuss a timely and thought-provoking paper by Bradley Efron on this matter.


  • 1. Efron, B., (2020) Prediction, Estimation, and Attribution, Journal of the American Statistical Association, 115:530, 636-655, DOI: 10.1080/01621459.2020.1762613
  • 2.Tukey, J.W., 1962. The Future of Data Analysis. The Annals of Mathematical Statistics, 33(1), pp.1-67.

Notes Preparer: Yuping Zhang

Webex Link:

October 2021

Paper of the Month, October 22, 2021: Bottou(1998), Online Learning and Stochastic Approximations.

Stochastic approximations as a mathematical discipline started in 1950’s with its origin in computer science and engineering, partly due to the need to overcome the shortage of computing power and data storage. In the age of big data, stochastic approximations have become a dominant approach to parameter optimization for large-scale learning systems. The paper by Bottou (1998) helped popularize the approach in the modern machine learning community. However, its fundamental idea of stochastic gradient descent dates back to Robbins and Monro (1951), and its use of martingale convergence dates back to Gladyshev (1965).

Basic ideas of stochastic approximations in the context of large-scale learning will be discussed. Some attempt will be made to explain why martingale convergence is such a useful tool.


  • 1. Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Control, 58(9):2217–2229, 2013.
  • 2. Leon Bottou. Online learning and stochastic approximations. In D Saad, editor, Online Algorithms and Stochastic Approximations. Cambridge University Press, Cambridge, U.K., 1998.
  • 3. E. G. Gladyshev. On stochastic approximation. Theory Probab. & Appl., 10:297–300, 1965.
  • 4. Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951.

Notes Preparer: Zhiyi Chi

September 2021

Paper of the Month, September 2021: Rubin, D. (1976). Inference and missing data. Biometrika, 63(3), 581-592.

Incomplete data also refers to as missing data is a common complication in research. The impact of incomplete data is detrimental across most research. A simple Goggle Scholar search for missing data or incomplete data results in more than 5,000,000 hits. The basic theoretical structure for the analyses of incomplete data was envisioned and developed by Don Rubin in the early to mid 70’s. Together with the EM Algorithm (Dempster, Laird & Rubin 1977) and Multiple imputation (Rubin 1977, Rubin 2004) the paper Inference and missing data has long lasting impact on research to this day.

We will discuss the basic ideas and implications of incomplete data together with concepts such as Missing at Random, Missing not at Random and ignorability which were coined by Rubin in this paper.


  • 1. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
  • 2. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
  • 3. Rubin, D. B. (1977). The design of a general and flexible system for handling non-response in sample surveys,” Consultant Report Submitted to the Social Security Administration, done as part of the 1973 CPS-IRS-SSA Exact Match Project. Also in The American Statistician, 58, 298–302.
  • 4. Rubin, D. B. (2004). The design of a general and flexible system for handling nonresponse in sample surveys. The American Statistician, 58(4), 298-302.

Notes Preparer: Ofer Harel

April 2021

Paper of the Month, April 2021: Benjamini, Y. (2020). Selective Inference: The Silent Killer of Replicability. Harvard Data Science Review, 2(4).

Following the recent attack on statistical testing and p-values in the quest for replicability, Yoav Benjamini was one of several statisticians to come forward in defense of the p-value. He argues the problem of replicability is instead likely rooted in selective inference: “Selective inference is focusing statistical inference on some findings that turned out to be of interest only after viewing the data. Without taking into consideration how selection affects the inference, the usual statistical guarantees offered by all statistical methods deteriorate.”

We will discuss the crisis of replicability, and how p-values and selective inference are involved in this crisis.
Notes Preparer: Elizabeth Schifano

February 2021

Paper of the Month, February 2021: Bickel, P.J. and Li, B. (2006) Regularization in statistics. Test 15, 271–344.

Regularization is frequently used in mathematics, statistics, computer science, and finance to solve ill-posed problems that could not be otherwise solved. In statistics alone, various regularization techniques have been proposed in different research directions, such as nonparametric statistics, high-dimensional statistics, and Bayesian statistics. Before this discussion paper by Bickel and Li (2006), these were just scattered results without a neat organization. Bickel and Li were among the first to bring up a conceptual framework to incorporate most statistical regularizers under the same umbrella. We will go over these regularization methods mentioned in this paper and discuss the vast new developments ever since.

Notes Preparer: Kun Chen and Yuwen Gu

December 2020

Paper of the Month, December 2020: Dirk Eddelbuettel and Conrad Sanderson(2014) RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics and Data Analysis, 2014, 71, March, pages 1054- 1063

In statistical computing with R, the Rcpp package is a breakthrough in that it greatly simplified interfacing R with C++ (Eddelbuettel and Francois, 2011). Thousands of R packages depend on, import from, or link to the Rcpp package. Such an interface often leads to drastic improvements in efficiency by writing key R functions in C++, bringing the speed of compiled languages like C++ to the interpreted language R. Package RcppArmadillo further extends the Rcpp package to easily access the C++ matrix library armadillo. This is important for many statistical applications such as Markov chain Monte Carlo, vector autoregressive models, and so on. It is an important tool for students in statistics today, especially for those who need to deal with computing-intensive tasks or who want to make their methods available for others to use through R packages. An example is the splines2 package developed by my former Ph.D. student Wenjie Wang, which offers API for C++ implementations of shape restricted splines bases.

Notes Preparer: Jun Yan

October 2020

Paper of the Month, October 2020: Bollerslev, T. (1986) Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–327.

When Engle introduced autoregressive conditionally heteroscedastic (ARCH) models in an article published in Econometrica (Engle, 1982), the focus of the article was on macroeconomic data. However, one could not imagine, at that time, that the main field of application for these models would be finance. Since the introduction of generalized ARCH (or GARCH) models (Bollerslev, 1986), these models have become extremely popular among both academics and practitioners. GARCH models led to a fundamental change to the approaches used in finance, through an efficient modeling of volatility (or variability) of the prices of financial assets. In 2003, the Nobel Prize for Economics was jointly awarded to Robert F. Engle and Clive W.J. Granger ‘for methods of analyzing economic time series with time-varying volatility (ARCH)’. Since the late 1980s, numerous extensions of the initial ARCH models have been published; see, e.g., Bollerslev (2008) and Francq & Zakoïan (2010). Along with their development in econometrics and finance, GARCH models and their extensions have given rise to new directions for research in probability and statistics.

We will discuss the main idea of the (G)ARCH model, its extensions and related models, and some recent advances.


  • 1.Engle, R.F. (1982) Autoregressive conditional heteroskedasticity with estimates of the variance of U.K. inflation. Econometrica 50, 987–1008.
  • 2.Bollerslev, T. (1986) Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–327.
  • 3.Bollerslev, T. (2008) Glossary to ARCH (GARCH). In T. Bollerslev, J.R. Russell and M. Watson (eds), Volatility and Time Series Econometrics: Essays in Honor of Robert F. Engle. Oxford: Oxford University Press.
  • 4.Francq, C. & Zakoïan, J.-M. (2010). GARCH Models: Structure, Statistical Inference and Financial Applications. Chichester: John Wiley & Sons.

Notes Preparer: Yao Zheng

May 2020

Paper of the Month, May 2020: Prentice, R. L. (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73(1), 1-11.

Epidemiological cohort studies are typically implemented in investigating associations between disease outcomes and certain risk factors. The major effort and cost of conducting a cohort study mostly arise from the assembling of covariate measurements. The information on assessing the association mostly comes from cases, subjects experiencing the disease. But when the disease of interest is rare, most subjects in the study cohort do not experience the disease event by the end of the study period. In this case, conducting a full-scale cohort study might be too costly and might not be feasible for this purpose. To reduce the cost in such studies and achieve the same goals as a cohort study, several cohort-sampling designs have been proposed.

The case-cohort study design proposed by Prentice (1986) is the most widely used one, especially useful when the disease rate is low. The main idea is to sample a subset disproportionately within the study cohort focusing on cases: a random subset of the cohort(subcohort) and remaining cases in the cohort. Note that this case-cohort sample is not a random sample and a valid estimation procedure needs to take this account. We will begin with this design feature, and discuss its variations and estimation procedures.


  • 1. For theoretical justifications of (modified) Pren’ce(1986)’s estimator- Self, S. G., & Pren’ce, R. L. (1988). Asymptotic distribution theory and efficiency results for case-cohort studies. The Annals of Sta5s5cs, 64-81.
  • 2. Variations of the design- Kulich, M., & Lin, D. Y. (2004). Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Sta5s5cal Associa5on, 99(467), 832-844.
    – Cai, J., & Zeng, D. (2007). Power calculation for case-cohort studies with nonrare events. Biometrics, 63(4), 1288-1295.
  • 3. Extension to multivariate failure ‘me data- Kim, S., Cai, J., & Lu, W. (2013). More efficient estimators for case-cohort studies. Biometrika, 100(3), 695-708.
  • 4. More efficient estimators- Barlow, W. E., Ichikawa, L., Rosner, D., & Izumi, S. (1999). Analysis of case-cohort designs. Journal of clinical epidemiology, 52(12), 1165-1172.
    – Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E., & Kulich, M. (2009). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Sta5s5cs in biosciences, 1(1), 32-49.

Notes Preparer: Sangwook Kang

February 2020

Paper of the Month, February 2020: Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32, 407–499.

In many problems, we have available a large collection of predictors from which we hope to select a parsimonious set for accurate prediction of a response variable. Classical model selection algorithms include forward selection, backward elimination, stepwise selection, and their stagewise versions, while in the era of big data, methods under the learning scheme of “regularization + optimization”, as exemplified by Lasso, have undergone exciting development. Given that these seemingly very different approaches often produce models with striking similarity, a natural question arises: is there any intrinsic connection between them?

Efron et al. (2004) proposed Least Angle Regression (LARS), a novel model selection algorithm, to bridge Lasso and forward stagewise regression (FSR). They showed that the solution paths of either Lasso or FSR can be efficiently produced via the LARS algorithm with some simple modification. Ever since this work, there has been a revival of interest into the so called stagewise learning. Generally speaking, a stagewise algorithm builds a model from scratch and gradually increases the model complexity in a sequence of simple learning steps. As such, it was realized that LARS and stagewise learning also connect to various optimization and machine learning approaches such as steepest descend, boosting, and path-following algorithms. As noted in one of the paper’s discussion pieces, “the LARS–Lasso–boosting relationship opens the door for new insights on existing methods’ underlying statistical mechanisms and for the development of new and promising methodology.”

We will discuss the main idea of LARS, the intriguing connections between regularized estimation and stagewise learning, and some recent advances.

Notes Preparer: Kun Chen

November 2019

Paper of the Month, November 2019: Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data Journal of the American statistical Association, 88(422), 669-679.

Albert and Chib (1993) introduced a data-augmentation idea for Bayesian inference in regression models for binary and polychotomous response data. This paper is a seminal work on introduction of latent data into Bayesian analysis of categorical response data.

By using continuous latent data response to connect the binary response in a probit model with the normal linear model, this approach has several advantages. First, it allows us to perform exact inference for binary regression models, which is likely preferable to maximum likelihood methods for small sample sizes. Second, sampling from the joint posterior of unknowns in the model then only require Gibbs sampling, which allows us to directly draw standard distributions such as the multivariate normal and thus is easy to implement. To sum up, this data-augmentation scheme leads to simple, effective methods for Bayesian posterior inference, which circumvent the need for analytic approximations, numerical integration or Metropolis-Hastings in probit models.

Moreover, we can easily extend this approach to model the latent data beyond the probit link. For example, our faculty members Dr. Ming-Hui Chen and Dr. Dipak K. Dey used this kind of data-augmentation idea in their paper (2008) titled “Flexible Generalized T-Link Models for Binary Response Data”. I also employed this idea in my paper (2013) titled “Bayesian Analysis of Dynamic Item Response Models in Educational Testing” for logistic regression.

However, in comparison to the probit model, Bayesian inference for the logistic regression model has long been recognized as a hard problem for the inconvenient analytic form of the likelihood function of the model. A paper “Bayesian inference for logistic models using Polya-Gamma latent variables” written by Polson, Scott and Windle (2013) introduced a Pólya-Gamma latent variable as a new data-augmentation scheme for binomial likelihoods. Their approach avoided the Metropolis-Hasting steps for logistic regression as shown in my paper (2013), which made their method very useful, including for logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models of count data.

The article of Albert and Chib (1993) is one of the papers I always encourage my graduate students to read when they begin to study Bayesian methods. It reflects a very important idea for designing an effective Markov chain Monte Carlo algorithm in the Bayesian analysis.


  • 1. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422), 669-679.
  • 2. Kim, S., Chen, M. H., & Dey, D. K. (2007). Flexible generalized t-link models for binary response data. Biometrika, 95(1), 93-106.
  • 3. Wang, X., Berger, J. O., & Burdick, D. S. (2013). Bayesian analysis of dynamic item response models in educational testing. The Annals of Applied Statistics, 7(1), 126-153.
  • 4. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.

Notes Preparer: Wang Xiaojing

October 2019

Paper of the Month, October 2019: A. Azzalini and A. Dalla-Valle (1996). The multivariate skew normal distribution Biometrika, Volume 83, Number 4, 715-726.

In this paper Azzalini and Dalla Valle introduced the multivariate skew normal (MSN) distribution which extends the class of multivariate normal distributions by the addition of a shape parameter to regulate skewness. Several properties are discussed with emphasis on the bivariate case. In a subsequent paper, published in the JRSS B [Azzalini and Capitanio (1999). Statistical applications of the multivariate skew normal distribution. Volume 61, Number 3, 715-726], further probabilistic properties of the distribution are examined, with special emphasis on aspects of statistical relevance.
Since the introduction of the MSN, several multivariate distributions have been proposed, such as, the multivariate skew-t distribution or the class of skew-elliptical distributions. Our faculty Dipak Dey published an interesting paper in JMVA [Branco and Dey (2001). A general class of multivariate skew-elliptical distributions. Volume 79, Number 1, 99-113]. Most importantly, is that distributions generated starting from the MSN distributions has been widely used in many applications such as mixed-effects models, finite mixture models, spatial models, measurement error models, just to mention a few. Personally, I wrote my PhD thesis using this distribution in 2004, and recently I have written a book entitled “Finite Mixtures of Skewed Distributions” published by Springer, which is based on the multivariate skew-elliptical distributions and the R package mixsmsn.

Notes Preparer: Victor Hugo Lachos

September 2019

Paper of the Month, September 2019: Thomas S. Ferguson (1973). A Bayesian Analysis of Some Nonparametric Problems Annals of Statistics, Volume 1, Number 2 (1973), 209-230.

Bayesian statisticians were traditionally silent in nonparametric problems until this pioneer work by Tom Ferguson published in 1973. Basically, we need to construct a workable prior on a space of distribution functions. As pointed out by the author: “There are two desirable properties of a prior distribution for nonparametric problems. (I) The support of the prior distribution should be large–with respect to some suitable topology on the space of probability distributions on the sample space. (II) Posterior distributions given a sample of observations from the true probability distribution should be manageable analytically. … This paper presents a class of prior distributions, called Dirichlet process (DP) priors, broad in the sense of (I), for which (II) is realized, and for which treatment of many nonparametric statistical problems may be carried out, yielding results that are comparable to the classical theory.” In application, the author illustrates the DP process on nonparametric estimation for a distribution function, a mean, a quantile, a variance, or a covariance. He further applies the DP prior to a two-sample problem, quantal bio-assay, empirical Bayes and discrimination problems.

This article stimulates a lot of interests, not only in statistics, but also in machine learning community. It is generally considered to be difficult for graduate students to read on their own for the first time. I will make the basic idea in the article simple and easy. So it will be easier for you to read. Looking back, I still consider the paper to be one of the best papers I have read. I am glad I read it while I was in graduate school.

Notes Preparer: Lynn Kuo

April 2019

Paper of the Month, April 2019: Shuo-Yen Robert Li (1980). A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments Annals of Probability, 8(6), 1171-1176.

If a monkey types only capital letters, and is on every occasion equally likely to type any of the 26, how long on average will it take the monkey to produce the sequence: ‘ABRACADABRA’? This seemingly childish question has a lot of connections to important real-world problems. For instance, the distribution of scan statistics can be related to this type of waiting time. The study of occurrence of patterns is a classical problem in probability theory. In the first volume of his famous book Feller has several sections devoted to this topic. The distribution of the waiting time can be derived in many different ways. But my personal favorite is the martingale approach developed in Li (1980). Li’s key observation is that information on the occurrence times of patterns can be obtained from the values assumed by a specially constructed auxiliary martingale at a certain well-chosen time. Li’s method is a textbook example of a clever application of fundamental Doob’s Optional Stopping Theorem. This elegant mathematical trick did not change the world, but I like it a lot. And by the way, the expected time is 2611 +264 +26.

Notes Preparer: Vladimir Pozdnyakov

March 2019

Paper of the Month, March 2019: Woodroofe, M. (1977). Second order approximations for sequential point and interval estimation. Ann. Statist., 5, 984-995.

Many crucial properties associated with first boundary crossing depend intricately on the distribution of the corresponding stopping time, the magnitude of an overshot, and their complex relationship. Under a very broad structure and a minimal set of assumptions, Woodroofe (1977) developed nonlinear renewal theory to tackle these issues more generally than one would normally expect. In the same paper, Michael Woodroofe used his tools to precisely formulate second-order asymptotic approximations for the risk functions associated with a number of classical sequential point and interval estimation problems. The impact has been far-reaching, helping researchers to investigate second-order approximations for risk functions in a very wide range of problems. Woodroofe’s results are truly pretty, involve deep probability theory, and they are difficult to prove. On the other hand, their applications are numerous. Many have provided surprising practical and useful conclusions in a large variety of inference problems – parametric, nonparametric, multivariate tests and estimation. Frequentists and Bayesians (empirical, hierarchical) have both ripped well-deserved benefits from Woodroofe’s techniques over the last four decades. I have been a direct beneficiary of Woodroofe’s fundamental approach combined with the Lai-Siegmund (1977,1979, Annals of Statistics) approach over the past 40 years. My students, colleagues, I and others have taken those fundamentals with requisite updates as needed to many corners of statistical inference by handling problems in clinical trials, reliability, quality control, multiple comparisons, multiple decision theory and others. Woodroofe’s 1977 paper continues to lay Golden Eggs without “stopping” (pun totally intended!).

Notes Preparer: Nitis Mukhopadhyay

February 2019

Paper of the Month, February 2019: Stein, C. (1981), Estimation of the mean of a multivariate normal distribution, The Annals of Statistics, 9(6), 1135-1151.

In this paper, Stein (1981) derived the famous Stein’s Lemma for multivariate normal distributions. Though simple, this lemma is so remarkable that it was used in the paper to derive an unbiased estimator for the risk of an arbitrary almost differentiable estimator under the squared error loss. This is known as Stein’s Unbiased Risk Estimate (SURE). SURE is an extremely useful tool for selecting tuning parameters or choosing between estimators to minimize the empirical risk, and as a theoretical tool for proving dominance results. For example, SURE can be applied to estimate the degrees of freedom for a large class of estimators, such as the various types of linear smoothers, the LARS (Efron et al. 2004), the LASSO (Zou, Hastie, and Tibshirani 2007), the reduced rank regression (Mukherjee et al. 2015), and many more. SURE can be also used to show the dominance of the James-Stein estimator over the standard least squares estimator when the dimension is above two. It is a really nice tool to have in our statistician’s toolbox.

Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. 2004. Least Angle Regression. The Annals of Statistics 32 (2). Institute of Mathematical Statistics: 407-99.

Mukherjee, Ashin, Kun Chen, Naisyin Wang, and Ji Zhu. 2015. On the Degrees of Freedom of Reduced-Rank Estimators in Multivariate Regression. Biometrika 102 (2). Oxford University Press: 457-“77.

Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2007. On the Degrees of Freedom of the Lasso. The Annals of Statistics 35 (5). Institute of Mathematical Statistics: 2173-92.

Notes preparer: Yuwen Gu

December 2018

Let A₁, A₂, …, An be a sequence of dependent events. For j=1,2,…n, define the indicator random variables: Iⱼ=1 if the event Aⱼ occurs, and Iⱼ=0 otherwise. In this article a new method is presented for approximating the distribution of W, the sum of Iⱼ, j=1,2,…n, by a Poisson distribution and a derivation of a bound on the distance between the distribution of W and the Poisson distribution with mean E(W). This new method is based on previous work by C. Stein where a central limit theorem for a sum of dependent random variables was obtained without the use of moment generating functions. In Chen (1975), the distance between the distribution of W and the Poisson distribution is assessed based the unique solution for an equation formulated specifically for the Poisson distribution. The focus of this work is on deriving upper and lower bounds on the distance between the distribution of W and the Poisson distribution with mean E(W). The Poisson approximation itself has been used in several ares of probability and statistics including: DNA and protein sequences, geometrical probability, spatial statistics and reliability theory. Since the Poisson approximation can be inaccurate, improvements of the Poisson approximations have been investigated. Most importantly, a compound Poisson approximation method was developed that includes a bound on the distance between the distribution of W and the approximating compound Poisson approximation. The origin of this approach is included in the Chen (1975) article.

Chen, L.H.Y. Poisson Approximation for Dependent Trials. The Annals of Probability, 1975, Vol. 3, 534-545.

Notes Preparer: Joseph Glaz

November 2018

Liang and Zeger (1986) proposed the generalized estimating equation (GEE), a multivariate extension of the generalized linear model to handle clustered data such as longitudinal data. The GEE method focuses on the regression parameters of the marginal means without specifying the multivariate dependence. A working correlation structure is used to improve efficiency. The resulting estimators of the mean parameters are consistent even if the working correlation is not correctly specified. The closer the working correlation is to the truth, the higher the efficiency. When the outcomes are multivariate normal, the GEEs reduces to the score equation if the working correlation is correctly specified. The estimators are asymptotically normal with a variance that can be estimated by a sandwich estimator. This paper has been cited widely (approximately 16,000 citations to date). The method has been a standard tool in applied statisticians’ toolbox and is widely used in many fields. It is implemented in standard software packages such as SAS and R (Our faculty Jun Yan developed the R package geepack during his graduate study).

Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.

Notes Preparer: Dipak Dey

October 2018

The Expectation Maximization (EM) algorithm was introduced by Dempster, Laird, and Rubin in 19771. As the title of the paper suggests, the EM algorithm is a method to obtain maximum likelihood estimates in cases where the data are incomplete. It has been widely used in many applications, such as data imputation, fitting mixture models, and clustering. The EM algorithm consists of two steps: in the E-step, each missing value is replaced with its expected value, using the current estimates of the parameters in the model. In the M-step, using the available data and the imputed data for missing values, the likelihood function is maximized with respect to each of the model’s parameters in order to obtain new (better) estimates. The algorithm continues until a convergence criterion is met (for example, the improvement in the likelihood of the model is less than some user-defined threshold.) One of the challenges is to identify conditions that ensure that the algorithm will converge. Jeff Wu pointed out in his 1983 Annals of Statistics paper2 a flaw in the convergence proof in the Dempster, Laird, and Rubin paper, and offered not only a correct proof, but also extended the applicability of the algorithm beyond the exponential family models. For our monthly gathering to discuss the paper of the month you may start by reading Jeff Bilmes’ “gentle introduction” to EM algorithm, available here.

1 Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”. Journal of the Royal Statistical Society, Series B, 39 (1): 1–38.
2 Wu, C. F. Jeff (Mar 1983), “On the Convergence Properties of the EM Algorithm”, Annals of Statistics, 11 (1): 95–103.

Notes Preparer: Haim Bar

September 2018

The Paper of the Month for September 2018 is selected to be “Generalized cross-validation as a method for choosing a good ridge parameter” by Gene Golub, Michael Heath, and Grace Wahba. The paper was published in Technometrics 21, no. 2 (1979): 215-223. Dr. Wahba will be the speaker of the 2018 Pfizer/ASA/UConn Distinguished Statistician Colloquium.

Cross-validation is widely used to assess the quality of model fitting in statistics and machine learning and is often adopted in choosing tuning parameters in regularization methods such as the widely used LASSO. However, the ordinary cross-validation may not perform well for certain scenarios, such as the case when the design matrix is close to being column-orthogonal. In the context of ridge regression, the authors proposed the method of generalized cross-validation (GCV) which is a rotation-invariant version of the ordinary cross-validation. The GCV improves the ordinary cross-validation; it does not require to estimate the model error variance; it is applicable when the number of predictors is larger than the sample size. It is worth mentioning that this paper was a joint work by researchers from computer science and statistics almost forty years ago. Modern applications of the technique of GCV are far beyond ridge regression, but the fundamental ideas are in the original paper.

Notes Preparer: HaiYing Wang

May 2018

The last paper of the month for the academic year appeared in the Journal of Irreproducible Results (JIR). The title of the paper by Bennett et al. is “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction”. The authors of the paper received the Ig Nobel prize in 2012, for neuroscience. According to the Ig Nobel website “The Ig Nobel Prizes honor achievements that first make people laugh, and then make them think.” Indeed, although the chosen paper of the month is light-hearted and has appeared in a science humor magazine, it makes a very good argument for the importance of accounting for multiple testing, especially in modern applications, where the number of simultaneous tests can be very large.

To read more about the Ig Nobel prize, click here. To read more about the journal, click here.

Notes Preparer: Haim Bar

March–April 2018

Even an undergraduate student taking a regression course knows how to do model selection or variable selection through model selection criteria such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). Both criteria penalize the measure of fit (log-likelihood) with a multiple of the number of free parameters to combat overfitting. BIC penalizes the number of parameters more heavily than AIC.

AIC is an estimate of the Kullback-Leibler divergence while BIC is an approximation of the posterior probability of a candidate model. AIC is best for prediction as it is asymptotically equivalent to leave-one-out validation. BIC is best for explanation as it allows consistent estimation of the underlying data generating process.

AIC and BIC should be learned together. It is tempting to wonder why the penalty scales on the number of parameters are what they are, which is usually not in textbooks. Both original papers are among the most cited statistical papers; as of now, Akaike (1974) has 39K and Schwarz (1978) has 34K Google citations. We present Schwarz (1978) simply because, despite its importance, it is short (3 pages with 4 references on the 4th page) and accessible. The justifications were made even easier to follow by Cavanugh (1997, Stat Prob Letters) for AIC and by Neath and Cavanaugh (2012, WIREs Comp Stat) for BIC. A StackExchange discussion is also a fun read.

Notes Preparer: Jun Yan

February 2018

The “Paper of the Month” for February 2018 is selected to be W. Hoeffding’s 1948 paper “A Class of Statistics with Asymptotically Normal Distribution” where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was published on The Annals of Mathematical Statistics Vol. 19, No. 3 (1948), pp. 293-325. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.

I suggested W. Hoeffding’s 1948 paper from the Annals of Math Stat where he introduced the concept of U-statistics and built much of the heavy-duty machineries. It was included in the “Breakthroughs in Statistics Volume 1: Foundations and Basic Theory” which was edited by S. Kotz and N. L. Johnson (1992, Springer, New York) with a lengthy and immensely valuable introduction prepared by P. K. Sen.

This paper changed completely the face of nonparametric statistics – both theory and practice – by creating the fundamentals to take this field to maturity. The Hoeffding paper indeed impacted many areas of classical statistical inference by showing how one must handle the probability theory behind the CLT and SLLN for dependent sequences of random variables. It gave rise to new and challenging pathways to handle moderate and large-deviation theories in the case of dependent sequences of random variables.

I totally fell in love with Hoeffding’s 1948 paper in the early 70’s when I was a PhD student at the Indian Statistical Institute-Calcutta. I had great opportunities to use his fundamental projections, CLT, and SLLN with total practical relevance of martingales and reverse martingales in my 1975 thesis-work. Personally, I have gone back to Hoeffding’s paper several hundred times in my life for its fundamentally breakthrough and truly fresh ideas. Beauty in this paper remains astonishing and unmatched given its path-breaking influences on numerous big sub-fields within statistics and probability theory including nonparametrics, large-sample theory, CLT’s for dependent processes and invariance principles, Berry-Esseen rates, theory of martingales and reversemartingales, tightness, geometric probability, and CLT’s for convex bodies.

I feel lucky that I could learn to appreciate the marvel of U-statistics theory early on (beginning 1970) in my career from P. K. Sen, Malay Ghosh, and Bob Serfling (all had close ties with Hoeffding) which helped me to build a large volume of my own significant publications to show for it. I also had the great pleasure of advising a number of PhD students (including Gaute Vik, John Judge, Mabel Moreno, and Bhargab Chattopadhyay) whose research have been significantly touched and influenced by Hoeffding’s paper. Two present students’ (Jun Hu and Chen Zhang) research overwhelmingly exploit Hoeffding’s theory of U-statistics. Our own Rick Vitale (with Herman Rubin) gave a complete asymptotic theory for square-integrable U-statistics in their 1980 Annals of Stat. paper.

After nearly 70 years of publication, W. Hoeffding’s original 1948 paper still inspires me and energizes me big time. Upon rereading the original paper the n-th time, I continue to surprise myself by discovering something interesting, deep and new that remained hidden from me during my previous passes.

Happy reading of Hoeffding (1948) paper. Just live it with TLC!

Notes Preparer: Nitis Mukhopadhyay.

December 2017–January 2018

Controlling the probability of falsely rejecting the null hypothesis is critical for valid and meaningful statistical analysis. But how should this probability be defined and calculated when there are multiple, simultaneous hypotheses? For many years, this question was mostly investigated in the analysis of variance (ANOVA) setting, in which the number of comparisons is typically small or moderate. Until 1995, the common approach to this problem was to control the family-wise error rate (FWER) which ensures that the probability of falsely rejecting at least one of the hypotheses is smaller than a user-specified bound. However, the advent of high-throughput methods, such as in genetics, resulted in a much larger number of simultaneous hypotheses and rendered the FWER approach impractical, in the sense that it was too stringent and lacked power to reject any hypothesis.

In 1995, Benjamini and Hochberg published their breakthrough paper “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing” in which they defined the False Discovery Rate (FDR) as the expected proportion of falsely rejected hypotheses, out of the total number of rejected hypotheses. The paper proposed a very simple procedure to estimate the FDR, and showed that indeed, the procedure controls the FDR at the desired level. This approach enabled significant advances in many areas, and is particularly useful in this age of “Big Data”.

The paper appeared in the Journal of the Royal Statistical Society, Series B, Vol. 57, No. 1. (1995), pp. 289-300. For additional reading, we also suggest reading about the q-value in the paper “A direct approach to false discovery rates”, by John Storey, which appeared in 2002, in the Journal of the Royal Statistical Society, Series B, 64: 479-498.

Notes Preparer: Haim Bar

November 2017

Once a month during the academic year our faculty will select a paper which we encourage our students to read and discuss. Papers featured in this section should be generally understood by graduate students, and will be selected either because of their impact, or historical value, or because they contain a useful (perhaps overlooked) techniques or results.

The inaugural “Paper of the Month” (November 2017) was selected to be Brad Efron’s paper titled “Bootstrap Methods: Another Look at the Jackknife”, published in 1979 in the Annals of Statistics (Vol. 7, No. 1, pages 1-26). It is hard to overstate the impact of this paper. It allows researchers to construct confidence intervals in many settings, even when there is no closed-form derivation of the standard deviation.

A glance at Efron’s Google scholar page shows that this work was already cited tens of thousands of times. In his column in the IMS Bulletin from November 17, 2016, Professor Xiao-Li Meng included this paper among his five selected “Nobel-Prize (NP) Worthy i.i.d Ideas in Statistics” and wrote: “[the bootstrap] certainly has made many researchers’ lives much easier”, and that it “has literally generated an industry of research on proving when it works, when it doesn’t, and how to make it work when its vanilla version fails.” For example, we may refer to the paper “Two Guidelines for Bootstrap Hypothesis Testing” by Peter Hall and Susan R. Wilson (Biometrics, Vol. 47, No. 2 (Jun., 1991), pp. 757-762)

Notes Preparer: Haim Bar