Machine learning for healthcare just got a whole lot easier

The packages are designed to streamline healthcare machine learning. They do this by including functionality specific to healthcare, as well as simplifying the workflow of creating and deploying models.

Learn more about machine learning via the community by reading and subscribing to our weekly blogs, viewing our weekly YouTube live event broadcasts, and engaging our data science team with questions and answers via email or live events.

Next Live Broadcast
Hosted by
Levi Thatcher and Mike Levy

ML #25 – The How and Why of R for Data Work, with Xiao Liu

Most Recent Broadcast Replay
Hosted by
Mike Mastanduno

ML #24 – Training for Healthcare Machine Learning with Rick Wolf of Insight Data Science

Everything you need to get started

What can I do with

  • Create and compare models based on your data.
  • Save and deploy a model.
  • Perform risk-adjusted comparisons.
  • Do trend analysis following Nelson rules.
  • Improve sparse data via longitudinal imputation.
  • Fill in missing data via imputation.
  • Deploy a model to produce daily predictions.
  • Write predictions back to a database.
  • Learn what factors drive each prediction.

How is it tailored to healthcare?

  • Longitudinal machine learning via mixed models.
  • Longitudinal imputation.
  • Risk-adjusted comparisons.


Our goal with this project is to expedite adoption of ML in healthcare by building pragmatic world class tools to help anyone with access to healthcare data.

You can help in many ways:

  • Try out the packages and let us know what needs improvement!
  • Check out our Github repos

How do I get started? is available in packages for both R and Python, two of the most common languages used by data scientists. If you don’t previous experience with either language, we recommend the R package as it currently has more features and R is more newbie-friendly.

Let's do this!

Access documentation, installation instructions, feature references, as well as hints and tips.

How does focus on healtcare?
Both packages differ from other machine learning packages in that they focus on data issues specific to healthcare. This means that we pay attention to longitudinal questions, offer an easy way to do risk-adjusted comparisons, and provide easy connections and deployment to databases.
Who is designer for?
While data scientists in healthcare will likely find these packages valuable, the audience targets are those analysts, BI developers, and SQL developers that would love to create appropriate and accurate models with healthcare data.

Learn about machine learning in healthcare

Learn from our team of data scientists

Levi Thatcher Director of Data Science, Health Catalyst
Mike Mastanduno Data Scientist, Health Catalyst
Taylor Miller Data Scientist, Health Catalyst
Taylor Larsen Data Science Engineer, Health Catalyst
YouTube Live

Hands-On Healthcare Machine Learning Weekly Broadcasts

Join Levi Thatcher and the data science team as they discuss machine learning topics with open Q&A every Thursday at 3:00 PM EST

ML #24 – Training for Healthcare Machine Learning with Rick Wolf of Insight Data Science

ML #23 – A Survey of the Opioid Epidemic

ML #22 – Machine Learning 101

ML #21 – Central Line Infection Prevention at IU Health, with Kristen Kelley

ML #20 – Exploratory Data Analysis in R

ML #18 – Healthcare Analytics and Open Source with Josh O’Rourke

ML #17 – Healthcare Text Analytics and NLP with Mike Dow

ML #16 – Data Science at an Academic Medical Center with Risa Myers

ML #15 – Multiclass Machine Learning in Using XGBoost

ML #14 – A Day In the Life of A Data Scientist

ML #13 – Basic Feature Engineering in Healthcare

ML #12 – Deep Dive into Heart Failure Readmissions with Joe Smith

ML #11 – How Do You Evaluate Model Performance?

ML #9 – From Zero to Your First Open Source Contribution: It Happens Today!

ML #8 – Open Healthcare Datasets

ML #6 – for Predicting Extended Length of Stay

ML #5 – Open Source Tools for Data Science

ML #1 – Getting Started in R and RStudio

Read the latest from our Data Science Blog

View weekly blogs for tips and advice on machine learning in healthcare.
Subscribe to receive posts via email.

Ethan Taft August 18, 2017

A few weeks ago, our blog featured a post about k-means clustering, an unsupervised machine learning method. We use unsupervised methods when we don’t have an explicit idea of what patterns exist in a dataset. Clustering can help us surface insights about groups that exist in the data that we may not know about. To separate data into clusters, k-means first needs to calculate the distance between each data point. That distance is used to help define the “similarity” between two points and is normally calculated using some continuous technique…

Yannick Van Huele August 01, 2017

The two main algorithms used for binary classification in healthcareai are logistic regression with a Lasso penalty (from now on, simply the Lasso) and random forests.  In this post, we’ll visually explore the behavior of the Lasso and random forest models by working with some artificial 2-dimensional datasets to help build intuition about how the algorithms work and on what type of datasets each algorithm can perform well on. All the Lasso and random forest models plotted below were built using the healthcareai R package (the plots were built in…

Why did Health Catalyst open source this?

We believe that everyone benefits when healthcare is made more efficient and outcomes are improved. Machine learning is surprisingly still fairly new to healthcare and we want to quickly take healthcare down the machine learning adoption path. We believe that making helpful, simple tools widely available is one small way to help healthcare organizations transform their data into actionable insight that can be used to improve outcomes.