Taylor Larsen Data Scientist
Health Catalyst

Defining and Choosing the Right Outcome Variable for Your Healthcare Machine Learning Model

Share this content:

In previous posts, we’ve touched on feature engineering and the importance of understanding the use case your machine learning (ML) model intends to accommodate; however, we haven’t touched a lot on defining and choosing the right outcome (dependent) variable. There is definitely a critical thinking component and a technical component (both are important), but if you spend enough time working through the complexities and nuances behind the outcome variable that you are considering, the technical piece will seem more clear. Investing time up front can save significant time and frustration on the back end, because as you know, the model will do its best to predict whatever it is trained to predict.

Let’s dive right into a few examples that help illustrate our thought process for working through some of the complexities and nuances that are inherent to healthcare that we’ve encountered when selecting and preparing an outcome variable for an ML model.

Length of stay

The definition of hospital length of stay seems straightforward: the number of days a patient was hospitalized. However, we need to consider whether length of stay should be based on hours or days, whether hours should be rounded up or down into days, or maybe we should count the number of midnights that passed. We also need to decide whether length of stay includes time spent in the emergency department or observation status, or maybe we only consider length of stay as the time spent as an inpatient. Furthermore, we need to consider the use case: if the business question suggests that the output of the model should provide a precise prediction of length of stay (like 4.5 days), then we’d choose/build a continuous outcome variable to accommodate a regression model; we’d need a binary outcome variable (Y/N) if the business question suggests that the model should provide a prediction about whether the length of stay will be above or below a certain number of days.


Hospital readmission also seems like a self-explanatory outcome variable—readmission occurs when a patient comes back to the hospital within some specific time frame following a prior hospital stay. Unfortunately, it’s not that simple. We need to consider whether the outcome variable is (or should be) based on the CMS definition of readmissions or if it follows some other definition. If we go down the CMS definition route, there are a lot of complexities to consider; we are ultimately predicting a series of events that are all wrapped into one variable like whether the patient will meet the various inclusion/exclusion criteria for the index and readmit admissions, not just whether a patient will come back to the hospital. Even if we leverage a different or custom definition, we still need to think about the use case: are we trying to predict strictly unplanned inpatient-to-inpatient readmissions, or is overall healthcare utilization the more appropriate outcome to predict? What are the implications if we don’t consider/include patients that come back to the emergency department after a hospital stay?


When we think about diagnosis of a disease or condition, we often think of administrative codes like ICD diagnosis codes. Defining an outcome variable using a group of administrative codes seems pretty clean: will the patient be assigned one of these specific ICD diagnosis codes? We’ve heard many stories from clinicians (and confirmed with analysis) that administrative diagnosis codes are often not reliable reflections of a patient’s clinical diagnosis. (Keep in mind that their purpose is just that: administrative.) So, when we are predicting disease or building datasets to make predictions on a specific cohort of patients, we need to take some important considerations into account. We need to consider whether administrative codes are an accurate reflection of the health condition and whether they are timely enough to use when the ML model is moved to a production data environment. We need to consider the alternatives: is there a clinical test or value that might better reflect a specific condition—maybe even a specific medication or procedure would be a strong indicator. It’s also important to again consider the use case: do we care more about the diagnosis or some sort of clinical event like a patient deteriorating to a certain level where they might require transfer to the ICU?


This critical thinking process around how data elements are defined is likely familiar to you, and this type of challenge is probably why you’re drawn to data architecture and data science. It’s important to consider the implications of each assumption and decision that you make and how that might impact the model. The good news is that if you can ask the right questions, you can both understand the true definition of an existing outcome variable and ensure that the outcome variable you’re including in your ML model fits the use case. We’d love to hear about the interesting challenges you are taking on the problems you’re solving with machine learning, so please reach out to us!