Mike Mastanduno Data Scientist, Health Catalyst

Requirements for ML: Reliable Data Pipelines

Share this content:

Our team is often asked why machine learning (ML) isn’t more prevalent in healthcare. In this first post of a series on barriers to healthcare ML, we discuss one of the biggest hurdles, data reliability.

Most CTOs in the health space are excited about using ML throughout their healthcare systems if they aren’t already. We hear stories of neural nets outperforming seasoned clinicians in prominent research studies, such as predicting mortality based solely off of when lab tests were ordered, or predicting patient characteristics from retinal images. That can create the perception that ML is being used broadly, but in reality, ML and AI are not running healthcare. The lack of integrated AI systems is often blamed on the lack of data scientists in the field, or the complexity of the data. However, healthcare has great people who are capable of learning data science skills. Data analysts are well positioned to build models with their intimate knowledge of healthcare data, often the crux of model-building. Open source ML libraries such as Scikit Learn for PythonCaret for R, and healthcare.ai ease the burden of really knowing much ML or software engineering.

But I would argue that there’s a more fundamental barrier that’s the real culprit.  Amateur data scientists (like you might see in Kaggle contests) can take a dataset and generate predictions with high performance. But performance is only a small part of the process. One must understand the business objectives, gather and clean data, build the model, assess its performance and utility, and fit the results into an actionable format. A good data scientist should be able to do these things, and a good data analyst should be able to leverage what they already know to bridge the gaps in their knowledge. Even after all that, there’s a more fundamental barrier that confines healthcare ML to mostly research.

Production-ready, predictive ML is what’s needed to positively impact patients. And that relies on rock-solid data. Every night, every hour, always available, automatically. So before you go and fire up tensorflow, take an honest look at whether your data warehouse can provide that basic functionality.

Is your data ready for production ML?

I think many health systems or payers would argue that they have reliable data feeds today. But I’m not talking about generating weekly reports and emailing them out to relevant parties. Nor having Bob the SQL guy load CSVs every couple of days to fill in tables and patch holes in missing data. That’s fine for traditional retrospective analysis, but that’s not where ML thrives. Predictive ML should be leveraged to look forward, in real time, as the data changes. Providing the data engineering infrastructure to support that is no small feat. However, it is a prerequisite to using ML and predictive modeling in production. Most hospitals’ data systems are not ready for this. They were designed for storage and analysis, not predictions.

Consider the following three gotchas about trying to use a predictive model in production:

  • Consistent availability– Many health systems depend on files moving and being loaded by individuals or teams. If someone misses a day or there’s a hiccup that causes a delay, a model can’t use that predictor. It’s impossible to scale ML models when the underlying data requires manual processes to load or has changing content. These things must be done automatically and reliably.
  • Availability by time– Training the model only requires retrospective data. If a patient’s severity is documented but then not loaded into the EDW until their discharge, that’s fine for training. But in production, the model might need that information before patients’ discharge to be actionable. Data that populates with a one-week lag cannot help ICU patients today.
  • Real-time feeds– Nurses on the floor are constantly monitoring patients’ vital signs, like temperature. For a ML system to be able to use that information, it needs to be flowing from the digital thermometer into the EDW in close to real time. As a predictor, max_temperature_last_24hr is nowhere near as good as temperature_4hr_trend to catch the early phases of a disease. Management and storage of this data feed is complicated but critical. Using a digital thermometer does not help ML if the data isn’t captured and stored in an accessible place.

What can we do?

Production ML requires that production data is reliably and automatically available, ideally, in real time. Predictive ML will be restricted to one-off engagements and research until the underlying healthcare data infrastructure catches up. Using daily random forest predictions to prevent readmissions in a real health system is unfortunately less exciting and more difficult than publishing a deep learning research paper. But that’s what healthcare really needs. Commoditizing these challenging and less flashy production tasks comes in small, manageable parts. Validate your data with respect to time and missingness carefully, profile it often to look for changes, and do as much automatically as possible. Data issues can be solved by healthcare’s great people, and widespread ML will follow.

On another note, if you haven’t checked out healthcare.ai v2.0, the R package got a major makeover and is better than ever. Thanks for reading and feel free to drop any questions or discussion in the healthcare.ai Slack channel!