Our team is often asked why machine learning (ML) isn’t more prevalent in healthcare. In this first post of a series on barriers to healthcare ML, we discuss one of the biggest hurdles, data reliability. Most CTOs in the health space are excited about using ML throughout their healthcare systems if they aren’t already. We hear stories of neural nets outperforming seasoned clinicians in prominent research studies, such as predicting mortality based solely off of when lab tests were ordered, or predicting patient characteristics from retinal images. That…
Data Science Blog
Google’s Latest Efforts to Help Healthcare
tl;dr: Healthcare needs practical machine learning tools; the focus on deep learning and GPUs doesn’t help the average health system.
Google just released a paper called “Scalable and accurate deep learning for electronic health records” that has received deserved acclaim in both the machine learning (ML) and healthcare communities. This research comes from the Google Brain group and isn’t their first foray into healthcare. See, for example, their impressive work in diabetic retinopathy. In fact, it’s now common for tech giants to wade into healthcare. See similar efforts by Microsoft, Amazon, and IBM. The fact that these companies are interested in healthcare is a wonderful development, as patient safety and clinician satisfaction are suffering partly because of outdated technology
Why has this research by Google caused such buzz? First, Google is at the forefront of research in deep learning (which is a type of ML) and much of what they do simply hasn’t been done before. In this case, what they did was to use ML to understand patient risk of negative outcomes by looking at their corresponding attributes in the EMR. Well, what’s so novel about that? Isn’t that what healthcare.ai enables? What’s novel here is that Google combined text and structured data to predict patient risk of several negative outcomes without any manual feature engineering. What does that mean exactly?
Let’s imagine you’re predicting patient risk of 30-day readmission. If you’re training a model, you have to prepare the independent variables (or patient attributes) by tying these columns to a particular grain and preparing them for the algorithm. In healthcare, the grain is often at the patient encounter (or visit) level. This means that all patient attributes are summarized into one row per-visit. How does that tie to feature engineering? After choosing a particular grain, to standardize the columns you’ll (for example) take an average of a patient’s last five weight measurements or use only their most recent smoking status. This type of work is often done via SQL in a subject-specific data mart that feeds the ML tools (like scikit-learn, healthcare.ai, or TensorFlow).
How does this relate to Google’s work? Essentially, they don’t do any of that column manipulation. Instead, they “take raw EHR data as input, and produce FHIR outputs without manual feature harmonization.” It’s an impressive achievement, especially considering the fact that they also ingest clinician notes and produce respectable model performance. It’s also a great place to focus attention, since ~80% of ML work is feature engineering.
In this retrospective study, they used 216k hospitalizations to predict inpatient mortality, long length of stay, diagnoses, and 30-day readmissions, which are all of practical interest to health systems. To learn these patterns, they used three architectures: one based on recurrent neural networks (LSTM), one on “an attention-based time-aware neural network model (TANN), and one on a neural network with boosted time-based decision stumps.” For a given prediction the risk is based on an “ensemble of predictions from the three underlying model architectures.” In case those details didn’t make it clear, these are researchers—not SQL-based healthcare analysts.
For simplicity, let’s focus on their readmissions predictions at discharge, which achieved an AUROC (or c-statistic) performance of 0.75 and 0.76 for hospital A and B, respectively. Here hospital A is the University of California, San Francisco and Hospital B is University of Chicago Medicine. To establish a baseline comparison, Google appropriately constructs a logistic regression model specific to these same two hospitals using variables from this paper and finds AUROC scores of 0.70|0.68 (for hospital A|B), which they beat with their deep learning architecture. For context, in our ML field work we’ve found that most old-guard risk models like LACE and those from the EMR vendors (which aren’t tuned to a specific hospital) come in between 0.60-0.70 AUROC. Note that for a perfect model AUROC is 1.0; on the other hand, 0.5 means no predictive power.
Lessons for healthcare?
Interestingly, even though Google Brain is a research organization, they were attempting to provide something that’s broadly useful in healthcare:
[W]e report a generic data processing pipeline that can take raw EHR data as input, and produce FHIR outputs without manual feature harmonization. This makes it relatively easy to deploy our system to a new hospital.
At first blush, it’d be easy for a healthcare CIO to see Google’s headline and think, “cool, this makes it easier for us to leverage machine learning.” But, if they or their data scientists dug deeper they’d quickly hit an insurmountable wall. Let’s say your health system had an extraordinary data scientist that has
- Properly identified a relevant business problem that can be improved via risk stratification / decision support
- Found a clinical champion to give both feedback and get team buy-in for the new ML-based workflow
- Built the LSTM/TANN/boosting models using TensorFlow/C++
That’s an impressive accomplishment right there. But you still have to deal with the compute: Google used >201k GPU hours for this. They did build four models—if the readmission model takes one-fourth of those hours, with the Tesla P100 GPU like Google used, that’s $0.73/hr for ~50k hours, or ~$37k. For. One. Model.
State of healthcare
What the tech giants seem to miss is that healthcare significantly lags other industries on the technology front:
- Even though EHRs have been around for 10+ years, it’s a significant achievement to have a functioning data warehouse for analytics.
- Most health systems are extremely excited when they start using any of their own data to improve patient care.
- Workstation CPU VMs with 16GB RAM are standard—it’s impractical to expect GPU access.
The practical ML route
We built healthcare.ai with these facts in mind. It allows data scientists and analysts to quickly train and deploy models in R or Python with minimal time investment by leveraging the experience of data scientists across dozens of healthcare model deployments. Overall, it lets you focus on working with clinicians to a) identify what’s driving your outcome of interest and b) establish buy-in for a new data-driven workflow.
Recently the Bon Secours Charity Health System, a member of the Westchester Medical Center Health Network, found that retrospective analyses and point-based risk scores weren’t enough to help lower their readmissions rate. Bon Secours Charity, which is located in the Hudson Valley region of New York State, determined that optimizing readmissions interventions would be best served by a machine learning model, which would on a daily basis help determine who of their general population was likely to be readmitted within 30-days of an inpatient discharge. They engaged Health Catalyst. Taylor Larsen, from the data science team, installed pre-built data pipelines using standardized clinical/predictive data marts and trained the model that enabled this critical piece of decision support.
Why do I bring this up? Because it provides a helpful contrast to the Google paper discussed above. Here are some details for readmission prediction at discharge:
|Bon Secours Charity HS||Google A (UCSF)||Google B (UCM)|
|Training rows||54k hospitalizations||86k hospitalizations||109k hospitalizations|
|Setup/train time||10 hrs||>20k GPU hrs||>20k GPU hrs|
The underlying raw EHR data is likely fairly similar in the Bon Secours and Google projects—what’s different is Google took the deep learning route whereas Bon Secours used standard clinical data marts and more practical ML. The quick results seen here at Bon Secours are similar to decision support engagements at other health systems, and can be obtained for any system using DOS from Health Catalyst.
Also, it’s important to keep in mind that quickly standing up an accurate model is just one step in operationalizing this type of decision support. Bon Secours showed that partnership, transparency, and careful thought must be demonstrated in order to gain the trust and adoption of end-users that will carry out the interventions.
Healthcare desperately needs Google and its talent, engineering, ML, and user-centric focus, and we’re thrilled that Google is getting involved. The average health system is riddled with antiquated software, manual processes ripe for automation, and typically zero ML-based clinical decision support.
For the sake of both patients and clinicians, we hope that Google focuses on the average health system and works towards practical solutions that reduce medical errors, lower the burden on physicians, and gives the patient the customer-focused experience they deserve.