Mike Levy Data Scientist, Health Catalyst

What’s New in healthcare.ai Version 2.0

Share this content:

We are thrilled to announce the release of version 2.0 of our R package, healthcare.ai. The goal of the software is to make it as easy and fast as possible to put machine learning models to work for health systems. We overhauled the code for this release to make the package even easier to use, to automatically avert problems that commonly arise in machine learning deployments, and to boost models’ predictive power. This post describes how the package does that, but if you’re more of a hands-on type, hop on over to the package’s brand new website that provides full walk-through examples of machine learning projects and documentation of all the key functions, and download the package and play with it for yourself.

Intuitive Functions

healthcare.ai 2.0 provides a remarkably simple interface to machine learning. You can do all the work to clean your data, get it ready for modeling, and train a bunch of models to find the one that optimizes performance with a single line of code. All healthcare.ai needs to know is where the data is and what variable should be predicted: models = machine_learn(data, outcome = target). Then, you can make predictions just as easily: predictions = predict(models, new_data).

Once you have your models and/or your predictions, you can inspect them in consistent and intuitive ways. Simply typing the name of an object gives you the “need to know” information, and functions like plot, summary, and evaluate work consistently on different kinds of objects. So, for example, you can get information about model performance by calling evaluate(models) or evaluate(predictions), and you can get informative plots of a variety of types of objects by simply calling plot(object). Our hope is that as you use the package, you’ll quickly develop a “feel” for how things work and won’t even need to read the documentation to do something new. That said, we’ve put a lot of effort into making the documentation as easy to use as possible…

Easy-to-Read Documentation and Examples

One of our main goals for the package is to provide documentation that is easy to read and understand while also being comprehensive and accurate. Every important function is documented on the package website with straightforward examples, and there are a growing number of vignettes that walk through common uses of the package. Additionally, we use the package in our own machine learning work, and we’ve made enough mistakes to know where the common “gotchas” are, so we’ve written warning and error messages that should make it easy to identify what’s gone wrong and figure out how to correct it.

Machine Learning Quickly and Safely

There are literally hundreds of packages for R that do machine learning in some form or another. What sets healthcare.ai apart? In addition to a growing set of healthcare-specific functions, it brings machine learning within reach of any analyst who knows the basics of R. You don’t need a statistics PhD or an understanding of ROC curves to fit a good model — the package takes care of all of that for you. Multiple algorithms are tuned to maximize performance, and performance is evaluated in such a way that you don’t have to worry about overfitting: The performance you get in deployment should be at least as good as the performance you saw in training.

Perhaps more importantly, our years of doing applied machine learning have taught us where the potential pitfalls are, and we built the package to help avoid them. Missingness in your data? No problem, we’ll impute sensible values. Got a class to make a prediction on that the model never saw in training? We built in a safeguard for that. Want to plot predictions against realized outcomes? It’s as easy as plot(predictions).

Not only that, but when you prepare your data with healthcare.ai, a data transformation recipe gets built so that when you want to make predictions on a new dataset, the new data is transformed exactly the same way as the training data was, whether you’re making a single prediction or millions of them. No need to go back and remember how you prepared the data; no need even to understand what a recipe is or where it lives, everything just works.

Power Under the Hood

We’ve put a lot of thought into defaults so that “everything just works” in a way that works well for a variety of use cases. We’ve seen default settings produce models that outperform the best performance recorded in the literature. We do this by transforming data and tuning models in ways that are known to perform well on healthcare data. For example, dates are automatically converted to informative features such as day of the week, and missing category values are replaced with a special new category level. On the model tuning side, a growing list of high performance models are tuned over randomly generated hyperparameter grids using cross-validation to identify the model specification that will provide the most predictive power in deployment.

If default settings aren’t for you, you can take the reins: There are various levels of detail at which you can interact with healthcare.ai. At the highest level, you can automate data preparation and model training in a single line with the machine_learn function. To take control of how data is transformed, go deeper with prep_data, or to specify the details of model training use tune_models. You can go deeper still, for example, by building your own recipe for data preparation and attaching it to the recipe attribute of a model_list or passing a hyperparatmeter grid to tune_models. You can even train models using any of the hundreds of algorithms in the caret ecosystem and bring them into the healthcare.ai pipeline with as.model_list. But, we think the need for such detailed customization will be rare.

It is often said that 80% of a data scientist’s work is data manipulation and cleaning. healthcare.ai takes care of a lot of that. Another big part of day-to-day data science is dealing with discrepancies between training and deployment data, and healthcare.ai takes care of that too. If you want some advice from the field, here it is: Build out a model pipeline first using defaults, and when all the stakeholders are on board and you know how predictions will be used and the infrastructure is in place, that’s the time to ask whether you need to eek out those last few percentage points of performance and put the time and effort into doing so. Premature optimization is a common pitfall, and healthcare.ai helps you avoid it by delivering reliable, high-performance models with minimal effort.

We hope you’ve enjoyed reading about healthcare.ai 2.0 and hope even more that you’ll try using it. There’s a getting started guide with lots of examples on the website, and you can ask questions and get support in our Slack group. If you have ideas for how we could make healthcare.ai more useful, please file an issue on GitHub. And if you’d like in-person training on R, machine learning in healthcare, and using healthcare.ai, register for the Machine Learning course at the Healthcare Analytics Summit.