Least Angle Shrinkage and Selection Operator (LASSO)

Least Angle Shrinkage and Selection Operator (LASSO)

Below is a list of some of the important LASSO theory and implementation papers that I have come across. Although this list is by no means complete, I have endevoured to include what I consider the most interesting and important papers on the topic. For each paper, I have provided URLs to facilitate easy download, though in some cases the download is not free unless you or your institution has access to the particular repository. I have also included a brief summary of each paper along with the reference. If you believe I have omitted a paper that should be on this list, please let me know.

LASSO Theory

The seminal paper which introduced the LASSO idea. Note that Tibshirani credits Breiman’s paper on the non-negative garotte as inspiration for the LASSO.

[1] R. Tibshirani
Regression Shrinkage and Selection via the Lasso
Journal of the Royal Statistical Society (Series B), 1996, 58, 267-288

[2] K. Knight and W. Fu
Asymptotics for lasso-type estimators
The Annals of Statistics, 2000, 28, 1356-1378

[3] C. Leng, Y. Lin and G. Wahba
A Note On The Lasso And Related Procedures In Model Selection
Statistica Sinica, 2006, 16, 1273-1284

Zou et al. show that the number of non-zero coefficients in a linear model is an unbiased estimate of the the degress of freedom of the LASSO. This implies that model selection criteria, such as AIC and BIC, can be used for determining the optimal shrinkage parameter.

[4] H. Zou, T. Hastie and R. Tibshirani,
On the “degrees of freedom” of the lasso
The Annals of Statistics, 2007, 35, 2173-2192

Zhao and Yu examine the asymptotics of the LASSO procedure and define the condition (deemed “Irrepresentable Condition”) under which the LASSO can select the true model consistently. The Irrepresentable Condition depends only on the (unknown) variance-covariance matrix of the predictors. Finite sample performance of the LASSO is not discussed.

[5] P. Zhao and B. Yu
On Model Selection Consistency of Lasso
Journal of Machine Learning Research, 2006, 7, 2541-2563

[6] C.-H. Zhang and J. Huang
The sparsity and bias of the Lasso selection in high-dimensional linear regression
The Annals of Statistics, 2008, 36, 1567-1594

Hall et. al. show how the m-out-of-n boostrap method can be used to estimate the LASSO shrinkage parameters such that the resulting model has the oracle property; here it is assumed that there is one shrinkage parameter for each coefficient. Unfortunately, the technique involves solving p constrained quadratic optimisation problems, where p is the number of regressor coefficients. Thus, it is impractical when p is moderate to large.

[7] P. Hall, E. R. Lee and B. U. Park
Bootstrap-based Penalty Choice For The Lasso, Achieving Oracle Performance
Statistica Sinica, 2009, 19, 449-471

Chatterjee and Lahiri examine the LASSO assuming that the number of regressors is fixed and the sample size goes to infinity. Strong convergence of the LASSO is shown provided the shrinkage parameter grows at a specified rate. Convergence rates of the LASSO and least squares estimates are also compared.

[8] A. Chatterjee and S. N. Lahiri
Strong consistency of Lasso estimators.
Sankhya, the Indian Journal of Statistics, 2011, (accepted)

Solving the LASSO optimisation problem

[9] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani
Least Angle Regression
The Annals of Statistics, 2004, 32, 407-451

The algorithm in Genkin et. al. is based on cyclic coordinate descent with Newton moves at each step and solves the LASSO problem in logistic models. The same idea could be used to generate ridge regression solutions. My implementation of the LASSO (see the Software page) is based on this algorithm.

[10] A. Genkin, D. D. Lewis and D. Madigan,
Large-Scale Bayesian Logistic Regression for Text Categorization
Technometrics, 2007, 49, 291-304

[11] M. Y. Park and T. Hastie
L1-regularization path algorithm for generalized linear models
Journal of the Royal Statistical Society (Series B), 2007, 69, 659-677

[12] K. Koh, S.-J. Kim and S. Boyd
An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression
Journal of Machine Learning Research, 2007, 8, 1519-1555

[13] T. T. Wu and K. Lange
Coordinate descent algorithms for lasso penalized regression
The Annals of Applied Statistics, 2008, 2, 224-244

[14] J. Shi, W. Yin, S. Osher and P. Sajda
A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression
Journal of Machine Learning Research, 2010, 11, 713-741

The coordinate descent algorithm by Friedman et al. is very fast and generates entire paths for regularized regression with convex penalties. It can be used for finding ridge estimates as well as LASSO estimates in generalized linear models.

[15] J. Friedman, T. Hastie and R. Tibshirani
Regularized Paths for Generalized Linear Models via Coordinate Descent
Journal of Statistical Software, 2010, 33

Feature elimination

This is a really neat idea that allows fast removal of regressors before running the LASSO optimisation procedure. The procedure is called SAFE-LASSO and is applicable in regularized linear and logistic regression. Interestingly, the regressors that are removed by SAFE-LASSO are guaranteed to be absent in the LASSO solution!

[16] L. El Ghaoui, V. Viallon and T. Rabbani
Safe Feature Elimination in Sparse Supervised Learning
Technical Report No. UCB/EECS-2010-126, University of California, Berkeley, 2010

This paper proposes a new feature elimination method (STRONG-LASSO) following on the ideas in El Ghaoui et. al. As STRONG-LASSO does not guarantee safe feature removal, it is recommended that Karush-Kuhn-Tucker (KKT) conditions are checked prior to removal of a regressor. Extensions to general L1 regularisation problems and connections to the LASSO Irrepresentability Condition are also discussed.

[17] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor and R. J. Tibshirani
Strong rules for discarding predictors in lasso-type problems
Unpublished, 2011

Bayesian interpretation of the LASSO

The LASSO can also be interpreted from a Bayesian point of view as a particular form of prior density on the regression coefficients. Park and Cassela were the first to show how the Gibbs sampler can be used to obtain samples from a posterior distribution with a LASSO-type prior. The nice feature of the Bayesian approach is that we automatically obtain an estimate of the shrinkage coefficient. My MATLAB implementation of this sampler can be found in the Software section. An R package is also available here. Kyung et al. extend the Bayes LASSO idea from Park and Casella to various other Bayesian regularisation methods.

[18] T. Park and G. Casella
The Bayesian Lasso
Journal of the American Statistical Association, 2008, 103, 681-686

Hans details an alternative sampler for the Bayesian LASSO that, unlike the Park and Casella sampler, does not require any latent variables. An R package implementing the method is available here.

[19] C. Hans
Bayesian lasso regression
Biometrika, 2009, 96, 835-845

[20] M. Kyung, J. Gill, M. Ghosh and G. Casella
Penalized Regression, Standard Errors, and Bayesian Lassos
Bayesian Analysis, 2010, 5, 369-412