Yannick Van Huele Data Science Intern, Health Catalyst

Behind the Scenes with Lasso

Share this content:

In a previous blog post, Which Algorithms are in healthcare.ai, we gave a broad overview of the various machine learning algorithms available in the healthcareai package. This week, we’ll delve into the details of one of these models: the lasso.

The lasso is an elegant generalization of classical linear models such as linear regression and logistic regression which reduces overfitting and automatically performs feature selection.

In particular, the lasso provides a way to fit a linear model to data when there are more variables than data points (for example, consider studies in genetics where we might be measuring gene expression for thousands of genes). For the simplest case of this phenomenon, consider the following example. It’s well known that two points determine a line; that is, given two distinct points, it’s always possible to find exactly one line passing through both of them. Given more than two points, it is not always possible to find a line that passes through all of them. But what if you only have one point? There are (infinitely) many lines passing through a single point. Which line gives the best fit to a single data point? The lasso provides an answer to this question.

Linear Regression

Before tackling the lasso, we first need to discuss classical linear models. We will focus on linear regression. However, it should be noted that logistic regression can also be modified in a very similar way.

Ordinary least squares linear regression is one of the oldest statistical techniques for modeling trends, going back two hundred years to the work of Legendre and Gauss. The basic idea is simple, given a bunch of data points:(x_1,\;y_1),\;(x_2,\;y_2),\;\ldots,\;(x_n,\;y_n) , Can we find a line that passes through all of them? Usually, the answer is no (unless a bunch = 2).  So instead we try to find the next best thing: the line that passes closest to all the points. Of course, we must first decide what we mean by closest.

Least squares regression defines the best fitting line using the sum of the square distances between the predictions and the true output values. Specifically, it considers all possible lines:f(x) = mx + band selects the line that minimizes the quantity:\sum_{i=1}^n \left( TrueValue - Prediction\right)^2 = \sum_{i = 1}^n \left( y_i - f(x_i) \right)^2 = \sum_{i=1}^n \left(y_i - (m x_i + b)\right)^2

If we have more than one predictor variable, we are considering functions of the form:f(\mathbf{x}) = c_0 + c_1 x_1 + c_2 x_2 + \cdots + c_p x_p and our goal is to minimize: \sum_{i=1}^n \left( TrueValue - Prediction\right)^2 = \sum_{i = 1}^n \left( y_i - f(x_i) \right)^2 = \sum_{i=1}^n \left(y_i - \left(c_0 + \sum_{j = 1}^p c_{j} x_{ij} \right)\right)^2

This method works quite well if p, the number of predictors, is much smaller than n, the number of data points. On the other hand, if p is greater than n, then linear regression does not have a (unique) solution. Even when p is smaller than n, linear regression can suffer from overfitting if p is only a little smaller than n.

Ridge Regression

One way to deal with this problem is ridge regression (also called Tikhonov regularization). Ridge regression subtly modifies the optimization goal of linear regression: it tries to minimize the sum of squares error while also keeping the model coefficients small. Specifically, ridge regression looks at all linear functions of the form:f(\mathbf{x}) = c_0 + c_1 x_1 + c_2 x_2 + \cdots + c_p x_pand selects the one that minimizes the quantity:\sum_{i=1}^n \left( y_i - \left(c_0 + \sum_{j = 1}^p c_j x_{ij} \right)\right)^2 + \lambda \sum_{k=1}^p c_k^2

The term on the left is the sum of squares error we saw in linear regression. The new term on the right penalizes large coefficients. The constant λ is a model parameter that needs to be tuned. The larger we make λ, the smaller the coefficients in the resulting model. This type of regularization can help reduce overfitting. One advantage of ridge regression over ordinary least squares is that it will always yield a unique solution as long as λ is greater than zero (even when there are more predictors than data points).

The Lasso

The lasso is very similar to ridge regression, with one small but important difference. We still penalize large coefficients but we do so in a different way. The lasso measures the size of the coefficients differently and minimizes the quantity:\sum_{i=1}^n \left( y_i - \left(c_0 + \sum_{j = 1}^p c_j x_{ij} \right)\right)^2 + \lambda \sum_{k=1}^p |c_k|

Note that the squares of the coefficients have been replaced by the absolute values of the coefficients. (The lasso penalty is called an L1 penalty, while the ridge penalty is called an L2 penalty.) This subtle change has some interesting consequences. As with ridge regression, the lasso shrinks coefficients.  However, unlike ridge regression, the lasso can shrink individual coefficients to zero. In this way, the lasso can decide not to use variables that are not very predictive.

To get an idea of how this works, consider the following figure.

The circular contours correspond to coefficient values yielding the same sum of squares error. The ordinary least squares solution is the black dot at the center of the circles. Each circle as you move out corresponds to larger and larger sum of squares error. The lasso penalty forces the coefficients to lie within the diamond shaped regions. The larger λ is, the smaller the diamond shaped region will be. If λ is small enough, the region will contain the ordinary least squares solution and so the lasso and ordinary least squares solutions will be the same (the black dot lies within the black diamond). A larger value of lambda will shrink the coefficients (the blue dot). Increasing lambda some more shrinks the diamond region to the point where the circles will intersect it at a corner (the red dot). In this case, we see that c1 is set to zero. Increasing λ further will continue to shrink c2.

(If we made a similar plot for ridge regression, the diamond shaped regions would be replaced by circular regions and none of the dots corresponding to the ridge solutions would lie along the axes.)

The algorithm in healthcareai tries several values of λ and automatically selects the one which yields the best performance.

Some Additional Remarks

  • Examples of using the lasso in healthcareai can be found in the healthcareai documentation (to access these examples, type “?LassoDeployment” after loading healthcareai into R).
  • There are further generalizations of ridge regression and the lasso. Some examples include the elastic net, which combines the ridge regression and lasso penalties in a clever way, and group lasso, which accounts for additional information about how certain variables are related when doing feature selection. The healthcareai package uses the group lasso.
  • As mentioned earlier, linear regression dates back to the early 1800s. Ridge regression was introduced in statistics in 1970, the lasso was introduced in 1996, and the elastic net and group lasso were introduced in 2005 and 2006, respectively. This shows how a small change can be the difference between ancient history and cutting-edge technology
  • A good reference for learning more about the lasso (as well as many other machine learning methods) is chapter 6 of An Introduction to Statistical Learning. For a more advanced treatment, see chapter 3 of The Elements of Statistical Learning (pdfs for both are freely available on the websites).