A good data scientist will have command of a large breadth of knowledge, from machine learning and statistics to business instinct or software engineering. Part of what makes this job exciting is the possibility of driving insights or improvements from any one of those skills. A data scientist may or may not know all the skills ahead of time, but they are able to step back, understand where there might be a high return on investment, and learn the skills necessary to take advantage. Recently, our team announced the release…
Data Science Blog
A Visual Tour of Lasso and Random Forest
The two main algorithms used for binary classification in healthcareai are logistic regression with a Lasso penalty (from now on, simply the Lasso) and random forests. In this post, we’ll visually explore the behavior of the Lasso and random forest models by working with some artificial 2-dimensional datasets to help build intuition about how the algorithms work and on what type of datasets each algorithm can perform well on. All the Lasso and random forest models plotted below were built using the healthcareai R package (the plots were built in R using the output from the models). Throughout we will see two features at play:
- The Lasso is a linear method (in the classification setting, the Lasso is based off logistic regression) and does a good job when the true decision boundary is linear (when the classes can be separated by a line, plane, or hyperplane). This is one of the reasons Lasso does a good job of avoiding overfitting. On the other hand, the Lasso has a harder time dealing with highly nonlinear relationships in the data.
- Random forests use an ensemble of decision trees to make decisions. Trees iteratively split data into two pieces, using a single variable for each split. In 2-dimensions, each split corresponds to a vertical or horizontal line. In this way, decision trees group data points into different rectangles. Random forests are built by averaging lots and lots of trees and tend to classify the data through rectangular chunks in the same way. Random forests are good at modeling complicated relations in the data as long as there’s enough data to work with.
Real datasets for machine learning in healthcare tend to be more complicated than the ones we’ll consider in this post. There will often be more variables, a mixture of numeric and categorical variables, and even missing data. Though too much simplification can be problematic (e.g. spherical cows are a simplification too far), understanding simple problems is often the key to solving more difficult problems.
We’ll be looking at several toy datasets which we’ll organize by the shape of the decision boundary. In classification problems, the decision boundary is a curve (in 2-dimensions; for higher-dimensional data sets, the decision boundary will be a hypersurface) which traces out the boundary between the two classes. If we’re classifying plots of land as belonging to the state of Utah or the state of Colorado, the decision boundary is a nice straight line. If instead we’re classifying land around the Great Salt Lake as lake or land, the decision boundary will be a lot more complicated and may include several different pieces to deal with the islands in the lake.
Linear Decision Boundary
Let’s start with a linear decision boundary. In this case, the two classes are separated by a line (in 2-dimensions; in higher dimensions, the classes will be separated by a hyperplane). A linear decision boundary can occur if changing variables by a fixed amount always has the same effect, regardless of the starting value. Consider the following example: suppose every additional hour of daily exercise decreases an individual’s odds of suffering a heart attack by 5 percent and each additional cigarette smoked daily increase the odds by 8 percent (the odds is the probability of suffering a heart attack divided by the probability of not suffering a heart attack). Here, we’re assuming the difference between no exercise and 1 hour of exercise is the same as the difference between 15 hours of exercise and 16 hours of exercise. Such a scenario would lead to a linear decision boundary. In a situation like this, we would expect a good model to make strong predictions when the variables are working together (e.g., a heavy smoker who never exercises) and weak predictions when the variables counteract each other (e.g. a smoker who exercises a lot).
While it’s rare to run into a problem where the decision boundary is truly linear (for example, the first hour and sixteenth hour of exercise probably won’t affect heart attack risk in the exact same way), there are many cases where the true decision boundary is approximately linear. Thus, even though this is a simplified problem, it is also a very useful one.
The true decision boundary is represented by the black line. We added some noise to the data so the line does not perfectly separate out the two colors, but overall the red points are concentrated in the upper-right and the blue points in the lower-left. Let’s examine a Lasso model built using these points.
The training set is plotted on the left. The middle plot shows the Lasso model’s classification boundary. This two plots on the right were generated by selecting several hundred evenly spaced points in the plane and giving these to the Lasso model to make predictions on. Note that the middle plot doesn’t quite match up with the true decision boundary – for example, there’s at least one red dot below the line – but comes very close.
Next, let’s look at a random forest trained on the same data.
The random forest does a good job of classifying the data in the area where most of the data is concentrated, but does a poor job outside on the boundaries. The graph on the right shows some nuance in the upper-left and lower-right corners: data points in the right half of the training data tend to be red, but data points in the bottom half of training data tend to be blue, and the random forest makes weak predictions where these conditions overlap. We can also see rectangular patterns in the two graphs on the right, reflecting the way in which random forests are built.
Quadratic Decision Boundary
As mentioned above, the true decision boundary is rarely linear, but might still be approximately linear. To see how the Lasso and random forest algorithms handle such cases, consider the following data set, split in two by a quadratic curve.
As expected, the Lasso tries to separate the data using a line. And, except for the corners of the graph, does a pretty good job of capturing the true boundary.
In fact, the Lasso seems to do a better job of approximating the true decision boundary than the random forest.
Oscillating Decision Boundary
In the examples we’ve considered so far, the Lasso seems to come out ahead. Let’s look at some situations the Lasso has a harder time dealing with.
Let’s start by taking a linear boundary and adding periodic nonlinear behavior in the form of regular oscillations. Such periodic behavior can occur when one of your variables is time. For example, suppose you’re out backpacking in the Wasatch (in which case, we’re a little jealous). The more time you spend outdoors, the more likely you are to get sunburned, but your chance of getting sunburned doesn’t increase at a constant rate: time spent outside in the early afternoon greatly increases your chance of sunburn while time spent outside after sunset doesn’t increase it at all. Tracking your risk over several days, you’d expect to see periodic behavior. In healthcare, you might get periodic behavior from hospital staffing: If Dr. Awesome-Outcomes works on Mondays and Dr. Not-So-Stellar works on Tuesdays, you might see regular bumps and dips in outcomes. Similarly, you might expect different behavior on weekdays and weekends.
The Lasso can’t deal with each individual oscillation and instead averages out their behavior.
Notice how the pale band between the dark blue and dark red points is much wider than in the first example. The Lasso recognizes there is some subtle behavior going on that it’s not fully equipped to deal with and returns weak predictions in that region.
The random forest is better equipped to deal with the individual oscillations. As before, we see that the decision boundary is built out of rectangles and the predictions are weak far away from the bulk of the data.
Circular Decision Boundary
Next, let’s look at a circular boundary. Such a boundary might arise if one of the classes is characterized by extreme values. For example, eating too little or too much can both be unhealthy. Similarly, there are health problems associated with both low blood pressure and high blood pressure.
This type of dataset really throws off the Lasso: how do you split this data in two with a line?
If you think the Lasso’s answer to this question – the graph in the middle – is unconvincing, it may be comforting to know that the Lasso thinks so too: the points in the graph on the right are not very dark. On the other hand, it’s quite easy to approximate circles with rectangles and squares. And, indeed, a random forest is able to handle this dataset pretty well.
It should be noted that we can improve the Lasso’s performance by expanding our feature space: if we train our Lasso using not just the x- and y-coordinates, but higher order terms, we can get a nonlinear decision boundary (technically, the decision boundary is still linear, but in a higher-dimensional space). For example, incorporating quadratic terms (x^2, y^2, and xy), we get the following picture:
The model is still underfitting the data. The Lasso performs automatic feature selection (as described in this blog post): in this example, it has chosen to discard both the variables y^2 and xy. The variable xy Is indeed not needed, but y^2 is. Nonetheless, the picture has improved from the original Lasso fit.
UCI Pima Indians Diabetes Dataset
For one last example, let’s consider a real-world dataset: the Pima Indians Diabetes Dataset from the UCI Machine Learning Repository. The full dataset contains 8 different variables, but we will only use 2 of them, BMI and age. After removing rows with missing BMI values, we get the following picture:
Here, red corresponds to individuals with diabetes and blue to individuals without diabetes. Only individuals 21 or older are included in the dataset which is why the points in the bottom of the graph are not as spread out as those in the top.
Can you guess how the Lasso and random forest will deal with this dataset? Where do you expect the models to do a good job classifying the data? Where do you expect them to do poorly?
Here’s a Lasso:
And here’s a random forest:
Do the graphs look the way you expected they would? It turns out that, despite the differences in the graphs, the two models have approximately the same performance metrics (e.g., the area under the ROC curve is 0.74 for Lasso and 0.75 for random forest)
Hopefully you now have some intuition for how the Lasso and random forests behave that will be helpful when working with more complex datasets that can’t be so nicely represented in 2-dimensions: The Lasso uses a linear decision boundary which can help to avoid overfitting, but can lead to poor performance if the relationship between the predictors and the response variable is highly nonlinear. Random forests can deal well with linear and nonlinear boundaries where there is enough data, but may not give very good predictions in regions where the training data is sparse.
Let us know what interesting things you find, what you like, and what you would like to see added or changed.
Thanks for reading.