Mike Levy Data Scientist, Health Catalyst

Machine Learning Versus Statistics: When to use each

Share this content:

Which is better: machine learning or statistics?

Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what statistical models and machine learning models respectively excel at and when you should deploy one versus the other.

Prediction vs. Explanation

The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables. Many statistical models can make predictions, but predictive accuracy is not their strength. Likewise, machine learning models provide various degrees of interpretability, from the highly interpretable lasso regression to impenetrable neural networks, but they generally sacrifice interpretability for predictive power.

interpretability vs flexibility of machine learning models
The tradeoff between explanatory power (Interpretability) and predictive power (Flexibility) of models is illustrated by the negative relationship in this figure from An Introduction to Statistical Learning.

Forward vs. Rearward Looking

This distinction is highly related to the prediction-explanation distinction. Prediction is obviously forward looking. We train machine learning models on past data to make predictions on current and future data. In contrast, we train statistical models to quantify the relationship between variables, and that relationship only exists in data that was generated in the past. We might hope or assume that those relationships will hold into the future (and indeed machine learning models must make this assumption), but statistical models are used to describe patterns as they were during the period of data collection, whereas machine learning models are used to project patterns into the future.

Big vs. Small Data

Machine learning models need more data than statistical models to perform well. Again, caveats apply: inference from statistical models can be problematic on very small datasets (N ≲ 30; see asymptotic theory), and sometimes machine learning models can make good predictions on little data. But in general, the accuracy of the most powerful predictive models, such as neural networks and random forests, continues with additional thousands or millions of observations. In contrast, statistical models often allow inference and make decent predictions on dozens or hundreds of observations and improve little with the addition of more observations.

Many vs. Few Variables

Machine learning models all have mechanisms to sort out which variables contain information relevant to the outcome and which variables would just add noise to the predictions. Statistical models generally don’t have these mechanisms built in. In the extreme, when there are more predictor variables than observations (for example, when using many genes’ status as predictors), statistical models fail completely, while machine learning models proceed unphased. In fact, the lasso is a conventional regression model with some added machinery to automatically choose which variables help make better predictions and which should be ignored. For this reason, the lasso offers a nice combination of the predictive power of machine learning with the interpretability of statistics.

Summary

If you want predictive accuracy, have many observations, and/or have many variables in your dataset, machine learning models are the way to go. On the other hand, if your primary purpose is explanation rather than prediction, a statistical model may be more appropriate. The lasso regression is a nice intermediate between conventional regression models and black-box machine learning models that allows inference and makes powerful predictions. To develop a lasso or other machine learning model for your own purposes, check out the examples at the healthcare.ai website.