Machine Learning Versus Statistics: When to use each

Share this content:

Which is better: machine learning or statistics?

Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what statistical models and machine learning models respectively excel at and when you should deploy one versus the other.

Prediction vs. Explanation

The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables. Many statistical models can make predictions, but predictive accuracy is not their strength. Likewise, machine learning models provide various degrees of interpretability, from the highly interpretable lasso regression to impenetrable neural networks, but they generally sacrifice interpretability for predictive power.

Forward vs. Rearward Looking

This distinction is highly related to the prediction-explanation distinction. Prediction is obviously forward looking. We train machine learning models on past data to make predictions on current and future data. In contrast, we train statistical models to quantify the relationship between variables, and that relationship only exists in data that was generated in the past. We might hope or assume that those relationships will hold into the future (and indeed machine learning models must make this assumption), but statistical models are used to describe patterns as they were during the period of data collection, whereas machine learning models are used to project patterns into the future.

Big vs. Small Data

Machine learning models need more data than statistical models to perform well. Again, caveats apply: inference from statistical models can be problematic on very small datasets (N ≲ 30; see asymptotic theory), and sometimes machine learning models can make good predictions on little data. But in general, the accuracy of the most powerful predictive models, such as neural networks and random forests, continues with additional thousands or millions of observations. In contrast, statistical models often allow inference and make decent predictions on dozens or hundreds of observations and improve little with the addition of more observations.

Many vs. Few Variables

Machine learning models all have mechanisms to sort out which variables contain information relevant to the outcome and which variables would just add noise to the predictions. Statistical models generally don’t have these mechanisms built in. In the extreme, when there are more predictor variables than observations (for example, when using many genes’ status as predictors), statistical models fail completely, while machine learning models proceed unphased. In fact, the lasso is a conventional regression model with some added machinery to automatically choose which variables help make better predictions and which should be ignored. For this reason, the lasso offers a nice combination of the predictive power of machine learning with the interpretability of statistics.

Summary

If you want predictive accuracy, have many observations, and/or have many variables in your dataset, machine learning models are the way to go. On the other hand, if your primary purpose is explanation rather than prediction, a statistical model may be more appropriate. The lasso regression is a nice intermediate between conventional regression models and black-box machine learning models that allows inference and makes powerful predictions. To develop a lasso or other machine learning model for your own purposes, check out the examples at the healthcare.ai website.

A good data scientist will have command of a large breadth of knowledge, from machine learning and statistics to business instinct or software engineering. Part of what makes this job exciting is the possibility of driving insights or improvements from any one of those skills. A data scientist may or may not know all the skills ahead of time, but they are able to step back, understand where there might be a high return on investment, and learn the skills necessary to take advantage. Recently, our team announced the release…

It seems like every week we see another headline highlighting the promise of data to improve healthcare, from convolutional neural networks beating cardiologists at detecting cardiac arrhythmia to incredible advances in computer vision feeding speculation that radiologists will all soon be out of work. Given these developments, and the fact that machine learning now touches much of our day-to-day lives, you may wonder why aren’t all discussions with physicians informed by data-driven predictions about outcomes of care decisions? For example, suppose you’ve injured your knee skiing and…

Goals of the Rewrite healthcareai is intended to serve a wide range of users, from the least technical to the most technical. As we worked with users from this entire spectrum, we found that there were some significant gaps and unnecessary pain points. We also took this opportunity to increase the quality and maintainability of the code. Paying down some of our technical debts will allow us (and already has) to add features more quickly, with less friction, and create a better experience for our team, our contributors in the…

Which is better: machine learning or statistics? Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what…

Subscribe and get updates delivered to your email.

This project was started by and receives ongoing support from Health Catalyst.