Taylor Miller Data Scientist, Health Catalyst

Announcing the 1.0 Release of healthcareai for Python

Share this content:

Goals of the Rewrite

healthcareai is intended to serve a wide range of users, from the least technical to the most technical. As we worked with users from this entire spectrum, we found that there were some significant gaps and unnecessary pain points. We also took this opportunity to increase the quality and maintainability of the code.

Paying down some of our technical debts will allow us (and already has) to add features more quickly, with less friction, and create a better experience for our team, our contributors in the healthcare machine learning community, and our users.

What This Means for Users

Our users need to revisit our documentation, since the API (the way you interact with healthcareai) has almost completely changed. Enjoy them at our documentation site at ReadTheDocs.

Because the overall code complexity decreased, we have increased the quality of the software tools through unit and integration testing, and improved continuous integration tooling.

We broke our users down into three groupings of increasing technical sophistication:

1. BI Developers & Analysts

Because our goal is to democratize machine learning in healthcare, this is our largest audience, so we spent the most amount of time optimizing our tools for this group.

This group’s main interface to the toolset is the class SupervisedModelTrainer, which abstracts away most of the challenging parts of training machine learning models. When using it to train models, you need to know very little data science and this class will provide all the bells and whistles for you:

  • Easy data loading from csv files, MSSQL, MySQL and SQLite databases
  • Data cleaning
  • Automatic graphs for classification tasks
  • Easy model saving
  • Better metrics
  • Better PR/ROC plots w/ ideal cutoffs marked
  • Simple prediction methods from a saved model that automatically uses the same data preparation steps used to train the model
  • Robust database connection tools for making predictions on production data
  • An easy way to automatically train a handful of different classification algorithms and have the best surfaced for you is to use the SupervisedModelTrainer.ensemble() method.
  • Examples split into training and prediction
  • Database connection helpers for MSSQL, MySQL, SQLite
  • Five new combinations of prediction dataframes to make deployment easy and flexible.
  • Nice console outputs, including much more helpful error messages
  • Percentage of nulls that were imputed is printed to the console
  • New algorithms
  • A built in sample dataset using load_diabetes() (rather than hunting around for the .csv files)

The best place to start is by reading the Getting Started document and working through these four example files:

2. Advanced BI Developers & Analysts

Advanced users may want to use different data preparation pipelines, dial in custom hyperparameters, or even create custom ensemble methods. They should use AdvancedSupervisedModelTrainer, which does not modify your data upon creation of a trainer. See the example_advanced.py script for more details.

It is also important to note that an advanced user can start out with the simpler SupervisedModelTrainer class and easily access more advanced functionality by using the SupervisedModelTrainer.advanced_features property which returns the underlying AdvancedSupervisedModelTrainer class.

Advanced users should start by using and understanding the example scripts mentioned above.

3. Data Scientists & Machine Learning Engineers

There is a small segment of our users who want to leverage some of the tool’s helper methods, data pipeline chunks, and other utilities without directly using either of the Trainer classes.

To facilitate this, we factored out as much functionality as possible into common modules that the Trainer classes access, so that you advanced users can reach inside and use all the sharp tools directly. This did have a nice side effect of making the code easier to understand by breaking up the massive functions into smaller, more testable modules.

We also hope that this new architecture will allow advanced users to be able to contribute more algorithms, data pipelines and other great ideas to the tools so we can continue our dream of improving healthcare outcomes with machine learning.

Other improvements of note:

  • table_archiver() utility function to easily create a timestamped history from any table.
  • File I/O utilities for pickling objects or JSON
  • Randomized hyperparameter search by default on SupervisedModelTrainer
  • Feature scaling transformer
    -Many new dataframe filters and transformers that are used in the SupervisedModelTrainer or individually.
  • AzureBlogStorageHelper class to make storing text and pickled objects on Microsoft Azure Blob Storage simple.

What This Release Means for Developers and Engineers

Engineers know that smaller functions make for better code for a lot of reasons:

  • They make it easier to understand the software architecture and design.
  • It becomes easier to find and fix bugs, not to mention harder to introduce them if you are writing good tests. (We have 10x more coverage now and aim for more!)
  • This lowers the barrier to entry for community members looking to help out with issues and enhancements.
  • This results in faster CI builds that help you tighten your feedback loops.

An easy way to get up to speed on our new architecture design is to skim our concise Architecture Overview For Developers document before jumping into the source code.

The Bright Future of healthcareai

With this rewrite under our belts, we are stoked about the future of this toolset. The next major features on our roadmap are adding in multi label classification and a few simple neural nets for tabular data.

We invite all of you to install the latest version and try the new examples. Please join our Slack channel to get help if you are stuck and to discuss ideas for using healthcare.ai to improve healthcare outcomes!