K-means clustering is often used as a lead-in to classification. It is primarily an exploratory technique to discover the structure of the data that you might not have notice before and as a prelude to more focused analysis or decision processes. Some examples of the set of measurements based on which clustering can be performed are detailed in the slide. In the patient record where we have measures such as BMI, HBA1C, HDL with which we could cluster patients into groups that define varying degrees of risk of a heart disease. In Classification the labels are known. Whereas in clustering the labels are not known. Hence clustering can be used to determine the structure in the data and summarize the properties of each cluster in terms of the measured centroids for the group. The clusters can define what the initial classes could be. In low dimensions we can visualize the clusters. It gets very hard to visualize as the dimensions increase. There are a lot of applications of the K-mean clustering, examples include pattern recognition, classification analysis, artificial intelligence, image processing, machine vision, etc. In principle, you have several objects and each object has several attributes. You want to classify the objects based on the attributes, then you can apply this algorithm. For Data Scientists, K-means is an excellent tool to understand the structure of data and validate some of the assumptions that are provided by the domain experts pertaining to the data.
Step 1 – K-means clustering begins with the data set segmented into K clusters.
Step 2- Observations are moved from cluster to cluster to help reduce the distance from the observation to the cluster centroid.
Step 3 – When observations are moved to a new cluster, the centroid for the affected clusters needs to be recalculated.
Step 4 – This movement and recalculation is repeated until movement no longer results in an improvement.
The model output is the final cluster centers and the final cluster assignments for the data. Selecting the appropriate number of clusters, K, can be done upfront if you possess some knowledge on what the right number may be. Alternatively you can try the exercise with different values for K and decide which clusters best suit your needs. Since it is rare that the appropriate number of clusters in a dataset is known, it is good practice to select a few values for k and compare the results. The first partitioning should be done with the same knowledge used to select the appropriate value of K, for example domain knowledge about the market or industries. If K was selected without external knowledge, the partitioning can be done without any inputs. Once all observations are assigned to their closest cluster, the clusters can be evaluated for their “in-cluster dispersion.” Clusters with the smallest average distance are the most homogenous. We can also examine the distance between clusters and decide if it makes sense to combine clusters which may be located close together. We can also use the distance between clusters to assess how successful the clustering exercise has been. Ideally, the clusters should not be located close together as the clusters should be well separated.
Practically based on the domain knowledge, a value for K is picked and the centroids are computed. Then a different K is chosen and the model is repeated to observe if it enhanced the cohesiveness of the data points within the cluster group. However if there is no apparent structure in the data we may have to try multiple values for K. It is an exploratory process.K-means clustering is easy to implement and it produces concise output. It is easy to assign new data to the existing clusters by determining which centroid the new data point is closest to it. However K-means works only on the numerical data and does not handle categorical variables. It is sensitive to the initial guess on the centroids. It is important that the variables must be all measured on similar or compatible scales. If you measure the living space of a house in square feet, the cost of the house in thousands of dollars (that is, 1 unit is $1000), and then you change the cost of the house to dollars (so one unit is $1), then the clusters may change. K should be decided ahead of the modeling process. Wrong guesses for K may lead to improper clustering. K-means tends to produce rounded and equal sized clusters. If you have clusters which are elongated or crescent shaped, K-means may not be able to find these clusters appropriately. The data in this case may have to be transformed before modeling.