Our team is often asked why machine learning (ML) isn’t more prevalent in healthcare. In this first post of a series on barriers to healthcare ML, we discuss one of the biggest hurdles, data reliability. Most CTOs in the health space are excited about using ML throughout their healthcare systems if they aren’t already. We hear stories of neural nets outperforming seasoned clinicians in prominent research studies, such as predicting mortality based solely off of when lab tests were ordered, or predicting patient characteristics from retinal images. That…
Data Science Blog
Patient impact predictor–the new suggestive guidance in healthcare.ai
Last summer we discussed the simplified interface of the 1.0 CRAN release of healthcare.ai-R, and we’re now thrilled to demo new features related to clinician guidance in the 1.2 version. We’re calling this Patient Impact Predictor (PIP).
Understanding an ML model
This week we’d like to highlight new functionality that allows one to go a step beyond surfacing predictions to also surface targeted interventions. Risk scores are a great first step, but prescriptive guidance is where the results of machine learning (ML) may actually catch up to the hype. For example, it’s very useful to know that Eddy Exampleton has a 57% risk of heart failure. In past versions of healthcare.ai, we offered not only that but also a list of the top three variables responsible for that patient’s high-risk score.
But what if
- those varialbes are unmodifiable (like age, race, etc)?
- the clinician wants data-driven guidance on appropriate action to take?
For example, we might want to know how Eddy’s risk changes if a certain medical procedure is performed, if a different medication is prescribed, or if Eddy works to change his blood pressure. As with most ML in healthcare, to get the most out of this functionality, one should leverage subject-matter expertise: in our case, subject-matter expertise is critical for carefully selecting the variables on which to make recommendations.
How do we do this? As is often the case with healthcare.ai, we try to incorporate the most practical techniques from ML and apply them to healthcare decision support. En route to this prescriptive breakthrough, we were inspired by LIME, which is used for model interpretability.
Let’s illustrate how this works using the UCI Machine Learning Repository’s Fertility Dataset, a 100 row dataset capturing information about how various factors are related to male fertility. After doing some slight feature engineering, here’s a sample of the pertinent variables:
id: an ID column to use as the grain: this is just the row number in the dataset.
age: the age of the patient
trauma: whether the patient was in an accident or experienced serious trauma
intervention: whether a surgical intervention was performed
alcohol: alcohol consumption habits, grouped into three categories (
smoking: smoking habits (
hours_sitting: the average daily number of hours spent sitting
altered_fertility: whether the patient has altered fertility
You can download our ML-ready version of this simple dataset here. After choosing
infertility as our label or predicted column, we train and deploy a random forest model on separate parts of the data (see here for the code). Now, after generating these predictions, we can use the
getProcessVariablesDf function to generate recommendations.
The simplest way to use this function is to pass a vector of names of categorical variables that represent modifiable risk factors. Some of the variables are outside of our control (the person’s age, whether they experienced serious trauma, etc.), but other variables are more amenable to recommendations. For example, to see how smoking habits, alcohol consumption, and surgical interventions might affect fertility in the model, we would run the following command:
pvdf1 <- rfD$getProcessVariablesDf(modifiableVariables = c("smoking", "intervention", "alcohol"))
This returns a dataframe (i.e., a table) of prescriptive guidance–we present the top recommended intervention for three patients below. See if you can spot the potential issue that arises from letting ML loose without SME input.
Going left to right, we have the patient identifier, the predicted probability for the original row of data, and then the top recommendation for each patient.
Modify1TXT gives the name of the variable on which we’re making a recommendation, followed by the current value (
Modify1Current) and the alternate baseline (
Modify1AltValue) for that variable. The
Modify1AltPrediction provides the patient’s risk if that recommendation is acted upon and the
Modify1Delta shows how much of a risk reduction that action could provide for that patient.
But, did you catch the error?! The guidance for patient 5 and 6 is reasonable–lowering your alcohol consumption is a clinically sensible route to increased fertility. For patient 7, however, why would this ML guidance suggest they start smoking? Broadly, that’s what happens when you don’t pair ML with subject matter expertise. Occasionally there are local patterns in the model that don’t translate to reality. For example, due to the curse of high-dimensionality you’ll occasionally find that in local regions without much data, recommendations arise to smoke or to increase one’s alcohol consumption.
To mitigate this, the user of healthcare.ai (hopefully with guidance from a SME) inputs the baseline that makes sense for a given modifiable risk factor. This helps ML output avoid picking up potentially noisy local patterns in data-sparse regions. To do this, we simply add guidance for the
alcohol risk features for the healthy baseline to be
pvdf2 <- rfD$getProcessVariablesDf(modifiableVariables = c("smoking", "intervention", "alcohol"), variableLevels = list(smoking = c("never"), alcohol = c("hardly_ever")))
And here is the first set of recommendations for patients 5 through 7:
Note that now the ML advice is restricted to what makes sense clinically and intervention priorities are established in a data-driven and actionable way.
PIP recommendations are also customized to each patient: the top recommendation for patient 6 is different from the top recommendation for patient 7. Looking at the delta column (on the far right), we see that some recommendations are more effective than others. Patient risk impactability isn’t the same for everyone. For example, modifying alcohol consumption for patient 5 leads to a noticable drop in the risk, while the effect is much smaller if patient 6 lowered their alcohol consumption.
Finally, the results for patient 7 may seem odd at first glance: the top recommendation is not to change their smoking habits. Looking closer, we see that this patient already has good habits, so it makes sense that there may not be anything actionable to lower their risk of infertility. Note that having fewer modifiable variables and fewer rows in your dataset will make it such that the model has a more difficult time producing recommendations for folks who are either healthy or not impactable. Simple logic could exclude such non-recommendations when pulling this output from SQL Server (for example) to Qlik or the EMR.
Because of the issues that arise when assuming actual causality from relationships found in the data, we should emphasize that, before implementation, a clinician should carefully review the recommendations coming from a clinical model.
Let’s look at the second set of recommendations for these same patients:
Note that some folks (like patient 5) are shown to be quite impactable, in that changes to either of two risk factors could lead to significant reductions in risk! 1) get them to stop drinking (which we discussed above) and 2) cancel that planned surgery, if clinicians weren’t convinced of its necessity.
By default, this PIP-based suggestive guidance contains the top three interventions for each patient, though for brevity, we’ve only shown two.
Focusing on impactable patients
Note that one of the main benefits of this new functionality is that the nurse manager or bed-side clinician is able to focus resources on the most impactable patients. This differs from most healthcare decision support and early versions of healthcare.ai, where the common framework is to stratify patients based on risk, which is a less actionable way of managing your cohort. Focusing on impactability is best and can be done now by simply sorting the patient list by the
Modify1Delta column above, either when discussing a csv file or surfacing this guidance in a visualization.
Motivation: ML model interpretability
ML models are very good at learning from historical data to make accurate predictions on new data. To make such accurate predictions, we often must sacrifice interpretability.
In many fields such as healthcare, we would like to make use of the strong predictive power of ML, but we can run into trouble if the model is too opaque. This problem has led to renewed focus on ML interpretability and tools such as LIME, FairML, and many others.
Some model interpretability is already included in the healthcare.ai package: random forest provides an ordered list of important variables and we can directly understand the Lasso’s predictions by working with the model coefficients. We also provided some row-level model interpretability in the top factors that accompany the predicted risk scores.
Understanding how a model works is useful for determining how to act on a prediction (e.g., if we know that high blood pressure contributed strongly to a high risk prediction, we can take steps to reduce blood pressure). But we wanted to take this a step further and understand what the model was thinking on a per-patient basis.
The basic idea
The idea behind the recommendations is fairly simple. Recommendations are made using counterfactual predictions: that is, we take the true attributes of a patient, changes values in a controlled way, and use the model to make new predictions for the modified data.
For example, let’s say the patient is 30 years old, did not have a surgical intervention, smokes occasionally, drinks several alcoholic beverages each week, and spends 8.5 hours sitting per day (on average). The model predicts that this patient has a 27.5% chance of altered fertility. Now, we copy this patient’s data, modifying only their smoking behavior to get a patient who is 30 years old, did not have a surgical intervention, doesn’t smoke, drinks several alcoholic beverages each week, and spends 8.5 hours sitting per day. Feeding this fictional modified data to the model we get back a 26.6% risk, down 0.9%. Next, we copy the original patient’s data again, this time modifying the alcohol consumption to get a patient who is 30 years old, did not have a surgical intervention, smokes occasionally, rarely drinks alcohol, and spends 8.5 hours sitting per day. Now the model returns a probability of 15.9%, down 12.6% from the original prediction.
We repeat this process for several different values of several different variables (for example, we might want to check the effect of reducing alcohol consumption to just one beverage per week). At the end of this process, we find that greatly reducing alcohol consumption led to the biggest predicted reduction in risk (from 27.5% down to 15.9%), followed by the presence of a surgical intervention (from 27.5% down to 23.1%), and then smoking cessation (from 27.5% down to 26.6%).
Flexibility and pitfalls
getProcessVariablesDf takes some optional parameters about your modifiable feature variables. We describe some of these in this section. More details and examples can be found in the documentation via either
Customizing the guidance
You may have noticed that there were no positive delta values in the example above. Because the positive (yes) class usually corresponds to an undesirable outcome in healthcare (readmission, infection, etc.), the default behavior is to surface recommendations, which reduces the predicted probability. This can be reversed in cases where the positive class represents a desirable outcome. It’s also possible to include as many recommendations as you’d like or to restrict to fewer than 3.
Finally, by default we surface only one recommendation for each variable, but this can also be toggled: for example, the top recommendation might be to heavily reduce alcohol consumption but we might still want to know the effects of a smaller reduction.
Continuous variables and baselines
In the fertility example, all of our modifiable variables were categorical variables. To get recommendations for continuous variables, you have to do a little more work. This is because continuous variables are tricky: to make good recommendations, we need some extra information which can be difficult to automatically extract from the data. We can get an idea of the complications by studying a model that predicts which patients are most likely to pay their medical bills. The patient’s balance seems like a potential variable to use in making recommendations. For example, it may be worth it to reduce a patient’s balance if it significantly increases the likelihood that they will pay their bill. Here are the main difficulties we run into:
- Useful recommendations on continuous variables need to vary in magnitude. In our fertility model, there were only three values for the smoking variable, but a continuous variable can have many more values. For example, in the model to predict likelihood of bill payment, a $50 discount might make a difference for a patient who owes $200, but is unlikely to affect a patient who owes $100,000.
- Often, there are important restrictions we need to impose on our recommendations. We probably don’t want to reduce a patient’s balance all the way down to $0. Even worse, we don’t want to increase the patient’s balance just because the model thinks that would improve their chance of payment. Anomalies like this are especially a concern with smaller datasets where real-world relationships aren’t always well reflected in the data.
To address these issues, we allow for recommendations on continuous variables, but only if comparison baselines are explicitly specified by the user. For example, in our fertility model we can make recommendations about the variable
hours_sitting, but we have to specify levels. To use a different illustration, if the initial health benefits of LDL reduction can largely be seen by getting down to 140 mg/dL, with a subsequent boost by getting down to 120 mg/dL, we can use the healthcare.ai PIP framework to handle this type of multiple baseline setup.
Caveats and which variables to use
Keep in mind that this tool is a model explainer. It helps you understand how your model works. Like with any data work, one has to be vary careful when attributing causality to relationships found in the data.
Overall, selecting the right variables to use as modifiable variables can greatly affect the usefulness of the recommendations. Here are a few issues to keep in mind:
- Beware of correlation: The counterfactual predictions involve only modifying one variable at a time. If you have several predictor variables that are strongly correlated then the recommendations may not accurately reflect reality.
- Consider the relationship to other factors not obviously included in the data: ML models have an uncanny knack of discovering subtly encoded information which is why they make such good predictions, but also why they are susceptible to problems such as data leakage. If a variable is acting as a surrogate for an unseen variable, then that variable may not be useful for recommendations even if it is very useful when making predictions.
- For best guidance, find more variables over which you have direct control: for example, a practitioner can control whether a patient is referred to a smoking cessation program, but cannot directly control the patient’s smoking habits.
To deal with these issues, carefully study the data when building the model, consult with subject-matter experts when selecting which variables to work with, and then check the recommendations on new data. Note that my team is thrilled to be using PIP functionality as part of a CLABSI risk project at a large Midwest health system (with c-diff, readmissions, and ED risk engagements similarly on the docket).
Note: The Patient Impact Predictor project (codename: LIMONE) was started and driven by the esteemed Yannick Van Huele, while a data science intern at Health Catalyst.