Clustering is an unsupervised machine learning technique, with several valuable applications in healthcare.

Consider the situation where a model is trying to predict a patient’s likelihood of having diabetes. In most cases, a set of outputs (label of whether a person has diabetes or not) is associated with inputs (attributes like age, weight, blood pressure and so on). One can train a supervised classification model, like random forest to learn the patterns associated with the output. Then, if a new patient comes in, the trained classifier can predict if the patient has diabetes or not. This is an example of a supervised learning problem because outputs, or labels, are provided with the training examples.

But what if there are no outputs provided, or they are unreliable? Data is often gathered and divided for model training based on ICD-10 codes, the “ground truth” for things like who has diabetes. However, it’s possible that there are many missing ICD-10 codes for diabetes. If the code can’t be trusted for model training, the diabetic/non-diabetic group structure must be discovered from the input values alone. When the label is unreliable or absent, we must turn to unsupervised learning.

Clustering is exactly that. We are trying to find a structure in a collection of data assuming that a reliable label output is not provided.

Clustering via K-means

Among all the unsupervised learning algorithms, clustering via k-means might be one of the simplest and most widely used algorithms. Briefly speaking, k-means clustering aims to find the set of k clusters such that every data point is assigned to the closest center, and the sum of the distances of all such assignments is minimized.

Let’s walk through a simple 2D example to better understand the idea. Imaging we have these gray points in the following figure and want to assign them into three clusters. K-means follows the four steps listed below.

Step one: Initialize cluster centers

We randomly pick three points C1, C2 and C3, and label them with blue, green and red color separately to represent the cluster centers.

Step two: Assign observations to the closest cluster center

Once we have these cluster centers, we can assign each point to the clusters based on the minimum distance to the cluster center. For the gray point A, compute its distance to C1, C2 and C3, respectively. And after comparing the lengths of d1, d2 and d3, we figure out that d1 is the smallest, therefore, we assign point A to the blue cluster and label it with blue. We then move to point B and follow the same procedure. This process can assign all the points and leads to the following figure.

Step three: Revise cluster centers as mean of assigned observations

Now we’ve assigned all the points based on which cluster center they were closest to. Next, we need to update the cluster centers based on the points assigned to them. For instance, we can find the center mass of the blue cluster by summing over all the blue points and dividing by the total number of points, which is four here. And the resulted center mass C1’, represented by a blue diamond, is our new center for the blue cluster. Similarly, we can find the new centers C2’ and C3’ for the green and red clusters.

Step four: Repeat step 2 and step 3 until convergence

The last step of k-means is just to repeat the above two steps. For example, in this case, once C1’, C2’ and C3’ are assigned as the new cluster centers, point D becomes closer to C3’ and thus can be assigned to the red cluster. We keep on iterating between assigning points to cluster centers, and updating the cluster centers until convergence. Finally, we may get a solution like the following figure. Well done!

Some Additional Remarks about K-means

The k-means algorithm converges to local optimum. Therefore, the result found by K-means is not necessarily the most optimal one.

The initialization of the centers is critical to the quality of the solution found. There is a smarter initialization method called K-means++ that provides a more reliable solution for clustering.

The user has to select the number of clusters ahead of time.

K-means clustering can be applied to many use cases in healthcare and help us to better characterize subpopulations and diseases by medical conditions. Some examples include:

Finding diabetic/non-diabetic group structure without an ICD-10 code present.

Identifying similar patients based on their attributes to explore costs, treatments, or outcomes.

Clustering treatment options within a cohort to make data-driven decisions.

We’d love to hear about what you’re clustering in the world of healthcare data science and machine learning, so please reach out to us or chat with the community on Slack.

A good data scientist will have command of a large breadth of knowledge, from machine learning and statistics to business instinct or software engineering. Part of what makes this job exciting is the possibility of driving insights or improvements from any one of those skills. A data scientist may or may not know all the skills ahead of time, but they are able to step back, understand where there might be a high return on investment, and learn the skills necessary to take advantage. Recently, our team announced the release…

It seems like every week we see another headline highlighting the promise of data to improve healthcare, from convolutional neural networks beating cardiologists at detecting cardiac arrhythmia to incredible advances in computer vision feeding speculation that radiologists will all soon be out of work. Given these developments, and the fact that machine learning now touches much of our day-to-day lives, you may wonder why aren’t all discussions with physicians informed by data-driven predictions about outcomes of care decisions? For example, suppose you’ve injured your knee skiing and…

Goals of the Rewrite healthcareai is intended to serve a wide range of users, from the least technical to the most technical. As we worked with users from this entire spectrum, we found that there were some significant gaps and unnecessary pain points. We also took this opportunity to increase the quality and maintainability of the code. Paying down some of our technical debts will allow us (and already has) to add features more quickly, with less friction, and create a better experience for our team, our contributors in the…

Which is better: machine learning or statistics? Hopefully the way that question is phrased highlights its ridiculousness, but with all the hype around machine learning these days you’d be forgiven for thinking that machine learning is the answer to all data-related questions. However, statistics departments aren’t shuttering or transitioning wholesale to machine learning, and old-school statistical tests definitely still have a place in healthcare analytics. The two are highly related and share some underlying machinery, but they have different purposes, use cases, and caveats. In this post, we’ll discuss what…

Subscribe and get updates delivered to your email.

This project was started by and receives ongoing support from Health Catalyst.