Machine Learning Versus Statistics: When to use each

Share this content:

Which is better: machine learning or statistics?

Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what statistical models and machine learning models respectively excel at and when you should deploy one versus the other.

Prediction vs. Explanation

The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables. Many statistical models can make predictions, but predictive accuracy is not their strength. Likewise, machine learning models provide various degrees of interpretability, from the highly interpretable lasso regression to impenetrable neural networks, but they generally sacrifice interpretability for predictive power.

Forward vs. Rearward Looking

This distinction is highly related to the prediction-explanation distinction. Prediction is obviously forward looking. We train machine learning models on past data to make predictions on current and future data. In contrast, we train statistical models to quantify the relationship between variables, and that relationship only exists in data that was generated in the past. We might hope or assume that those relationships will hold into the future (and indeed machine learning models must make this assumption), but statistical models are used to describe patterns as they were during the period of data collection, whereas machine learning models are used to project patterns into the future.

Big vs. Small Data

Machine learning models need more data than statistical models to perform well. Again, caveats apply: inference from statistical models can be problematic on very small datasets (N ≲ 30; see asymptotic theory), and sometimes machine learning models can make good predictions on little data. But in general, the accuracy of the most powerful predictive models, such as neural networks and random forests, continues with additional thousands or millions of observations. In contrast, statistical models often allow inference and make decent predictions on dozens or hundreds of observations and improve little with the addition of more observations.

Many vs. Few Variables

Machine learning models all have mechanisms to sort out which variables contain information relevant to the outcome and which variables would just add noise to the predictions. Statistical models generally don’t have these mechanisms built in. In the extreme, when there are more predictor variables than observations (for example, when using many genes’ status as predictors), statistical models fail completely, while machine learning models proceed unphased. In fact, the lasso is a conventional regression model with some added machinery to automatically choose which variables help make better predictions and which should be ignored. For this reason, the lasso offers a nice combination of the predictive power of machine learning with the interpretability of statistics.

Summary

If you want predictive accuracy, have many observations, and/or have many variables in your dataset, machine learning models are the way to go. On the other hand, if your primary purpose is explanation rather than prediction, a statistical model may be more appropriate. The lasso regression is a nice intermediate between conventional regression models and black-box machine learning models that allows inference and makes powerful predictions. To develop a lasso or other machine learning model for your own purposes, check out the examples at the healthcare.ai website.

Goals of the Rewrite healthcareai is intended to serve a wide range of users, from the least technical to the most technical. As we worked with users from this entire spectrum, we found that there were some significant gaps and unnecessary pain points. We also took this opportunity to increase the quality and maintainability of the code. Paying down some of our technical debts will allow us (and already has) to add features more quickly, with less friction, and create a better experience for our team, our contributors in the…

Which is better: machine learning or statistics? Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what…

A few weeks ago, our blog featured a post about k-means clustering, an unsupervised machine learning method. We use unsupervised methods when we don’t have an explicit idea of what patterns exist in a dataset. Clustering can help us surface insights about groups that exist in the data that we may not know about. To separate data into clusters, k-means first needs to calculate the distance between each data point. That distance is used to help define the “similarity” between two points and is normally calculated using some continuous technique…

The two main algorithms used for binary classification in healthcareai are logistic regression with a Lasso penalty (from now on, simply the Lasso) and random forests. In this post, we’ll visually explore the behavior of the Lasso and random forest models by working with some artificial 2-dimensional datasets to help build intuition about how the algorithms work and on what type of datasets each algorithm can perform well on. All the Lasso and random forest models plotted below were built using the healthcareai R package (the plots were built in…

Subscribe and get updates delivered to your email.

This project was started by and receives ongoing support from Health Catalyst.