A good data scientist will have command of a large breadth of knowledge, from machine learning and statistics to business instinct or software engineering. Part of what makes this job exciting is the possibility of driving insights or improvements from any one of those skills. A data scientist may or may not know all the skills ahead of time, but they are able to step back, understand where there might be a high return on investment, and learn the skills necessary to take advantage. Recently, our team announced the release…
Data Science Blog
Step by Step to K-Means Clustering
Clustering is an unsupervised machine learning technique, with several valuable applications in healthcare.
Consider the situation where a model is trying to predict a patient’s likelihood of having diabetes. In most cases, a set of outputs (label of whether a person has diabetes or not) is associated with inputs (attributes like age, weight, blood pressure and so on). One can train a supervised classification model, like random forest to learn the patterns associated with the output. Then, if a new patient comes in, the trained classifier can predict if the patient has diabetes or not. This is an example of a supervised learning problem because outputs, or labels, are provided with the training examples.
But what if there are no outputs provided, or they are unreliable? Data is often gathered and divided for model training based on ICD-10 codes, the “ground truth” for things like who has diabetes. However, it’s possible that there are many missing ICD-10 codes for diabetes. If the code can’t be trusted for model training, the diabetic/non-diabetic group structure must be discovered from the input values alone. When the label is unreliable or absent, we must turn to unsupervised learning.
Clustering is exactly that. We are trying to find a structure in a collection of data assuming that a reliable label output is not provided.
Clustering via K-means
Among all the unsupervised learning algorithms, clustering via k-means might be one of the simplest and most widely used algorithms. Briefly speaking, k-means clustering aims to find the set of k clusters such that every data point is assigned to the closest center, and the sum of the distances of all such assignments is minimized.
Let’s walk through a simple 2D example to better understand the idea. Imaging we have these gray points in the following figure and want to assign them into three clusters. K-means follows the four steps listed below.
Step one: Initialize cluster centers
We randomly pick three points C1, C2 and C3, and label them with blue, green and red color separately to represent the cluster centers.
Step two: Assign observations to the closest cluster center
Once we have these cluster centers, we can assign each point to the clusters based on the minimum distance to the cluster center. For the gray point A, compute its distance to C1, C2 and C3, respectively. And after comparing the lengths of d1, d2 and d3, we figure out that d1 is the smallest, therefore, we assign point A to the blue cluster and label it with blue. We then move to point B and follow the same procedure. This process can assign all the points and leads to the following figure.
Step three: Revise cluster centers as mean of assigned observations
Now we’ve assigned all the points based on which cluster center they were closest to. Next, we need to update the cluster centers based on the points assigned to them. For instance, we can find the center mass of the blue cluster by summing over all the blue points and dividing by the total number of points, which is four here. And the resulted center mass C1’, represented by a blue diamond, is our new center for the blue cluster. Similarly, we can find the new centers C2’ and C3’ for the green and red clusters.
Step four: Repeat step 2 and step 3 until convergence
The last step of k-means is just to repeat the above two steps. For example, in this case, once C1’, C2’ and C3’ are assigned as the new cluster centers, point D becomes closer to C3’ and thus can be assigned to the red cluster. We keep on iterating between assigning points to cluster centers, and updating the cluster centers until convergence. Finally, we may get a solution like the following figure. Well done!
Some Additional Remarks about K-means
- The k-means algorithm converges to local optimum. Therefore, the result found by K-means is not necessarily the most optimal one.
- The initialization of the centers is critical to the quality of the solution found. There is a smarter initialization method called K-means++ that provides a more reliable solution for clustering.
- The user has to select the number of clusters ahead of time.
More details can be found in Foundations of Data Science if you are interested.
Applications in Healthcare
K-means clustering can be applied to many use cases in healthcare and help us to better characterize subpopulations and diseases by medical conditions. Some examples include:
- Finding diabetic/non-diabetic group structure without an ICD-10 code present.
- Identifying similar patients based on their attributes to explore costs, treatments, or outcomes.
- Clustering treatment options within a cohort to make data-driven decisions.