Introduction
In this article, we will look into the K-means clustering iterative algorithm and make use of a simple example for the purpose of this demonstration. So, what exactly is clustering? We can think of it as a broad set of techniques that group certain observations together, for example: this could be the creation of subgroups within a particular data set. This algorithm seeks to group the observations into distinct groups so that in theory, the observations in the same group should be similar to each other, whereas observations within different groups are classified as being dissimilar to each other.
K-means Clustering
K-means clustering is a well-known method used in machine learning and it resides in the branch of unsupervised learning, which essential means that you draw inferences from data that has unlabelled responses.
Given a set of observations, K-means clustering will attempt to group the observations into a prespecified number of k distinct, non-overlapping clusters. To perform K-means clustering, we must first decide on the number of k clusters that we want. This will allow the algorithm to assign each observation to exactly one of the k clusters.
Let's have a look at how the K-means clustering iterative algorithm works step by step:
Define the number of k clusters that you want and randomly generate their respective centre points within the data domain.
Compute the distance between the observation and each centre point, and then classify this observation into a group whose centre is closest to it.
Based on these classified observations, we re-compute the group centre by taking the mean of all the vectors in the group.
Steps 2 and 3 are repeated for a set number of iterations or when convergence has been reached.
Let's look at an example to better understand this concept.
Where Should I Set up My New Coffee Shops?
Let's say we want to open three new coffee shops in a town called, George - this is all hypothetically thinking. However, we need to scope out the prime locations for our coffee shops to thrive.
Outlined below, we have a data set of all the people that make use of coffee shops regularly in the surrounding area:
Referring back to step 1, it would make sense to say K=3 because we plan on opening three new coffee shops. We can already see from the data that there is a definite pattern with 3 defined subgroups.
Although this was relatively easy to spot with the human eye, the computer cannot see this, and it might not always be a data set this simple. Therefore, we will randomly plot the 3 centre points on the data set (represented by the stars).
Now that we have plotted the centre points, we can move on to step 2. In step 2, we need to group each observation to a certain centre point, and we do this by computing the distance between each observation and each centre point. Once this is done, we will classify the observation by putting it into a group whose centre is closest to it. By doing this, we end up with something like the following:
*NB: Please note that this might not be completely accurate and that it's only for demonstration purposes.
From the image above, we can see that it wasn't exactly the groupings that we had in mind. Based on these classified groups, we can now move on to step 3 and re-compute the group centre by taking the mean of all the vectors in the group. Basically, we move our stars (centre points) right to the middle of our current groupings.
Looking at our results above, it shows that some of the observations might now be closer to another centre point than their previous positions. For example, let's hone in on some of the blue observations - we can see clearly that it now belongs to the yellow star and some of the red observations clearly belong to the blue star.
Now, we can repeat steps 2 - 3 for a set number of iterations or when convergence has been reached.
Starting at step 2 again, we group each observation to a certain centre point, by computing the distance between each observation and each centre point. Following this, we will classify the observation to be in the group whose centre is closest to it (please refer back to the second image). Finally, we re-compute the group centre by taking the mean of all the vectors in the group, and move the stars (please refer back to the third image).
We've done this process once again (see above image) and we can see that we're not quite there yet, and thus have to repeat steps 2 - 3 again.
Finally, we can now see that the final grouping of the observations make more sense than our initial starting point, so we can successfully say that we have reached the end of the K-means algorithm. From this, we have successfully identified three defined locations for our three new coffee shops, located in coffee centrale.
Conclusion
To wrap it up, we have learned that K-mean clustering is a particularly useful method when trying to find groups that haven't been explicitly labelled in the data set. I hope that you now have a much better understanding of what K-mean clustering is all about and how its iterative algorithm works.
Comments