# Clustering

In machine learning, “unsupervised” refers to the problem of finding a hidden structure within unlabeled data. In this lesson and the following lesson we will be discussing two unsupervised learning methods clustering and Association Rules. Clustering is a popular method used to form homogenous groups within a data set based on their internal structure. Clustering is a method often used for exploratory analysis of the data. There are no ”predictions” of any values done with clustering just finding the similarity between the data and grouping them into clusters

The notion of similarities can be explained with the following examples: Consider questions such as

1. How do I group these documents by topic?

1. How do I perform customer segmentation to allow for targeted or special marketing programs.

The definition of “similarity” is specific to the problem domain. We are defining similarity as those data points with the same “topic” tag or customers who can be profiled in to a same “age group/income/gender” or a “purchase pattern”. If we have a vector of measurements of an attribute of the data, the data points that are grouped into a cluster will have values for the measurement close to each other than to those data points grouped in a different cluster. In other words the distance, (an inverse of similarity) between the points within a cluster are always lower than the distance between points in a different cluster. In a cluster we end up with a tight group (homogeneous) of data points that are far apart from those data points that end up in a different cluster. There are many clustering techniques and we are going to discuss one of the most popular clustering method known as “K-means clustering”.

K-means clustering is used to cluster numerical data.

In K-means we define two measures of distances, between two data points(records) and the distance between two clusters. Distance can be measured (calculated) in a number of ways but four principles tend to hold true.

1. Distance is not negative (it is stated as an absolute value)

2. Distance from one record to itself is zero.

3. Distance from record I to record J is the same as the distance from record J to record I, again since the distance is stated as an absolute value, the starting and end points can be reversed.

4. Distance between two records can not be greater than the sum of the distance between each record and a third record.

Euclidean distance is the most popular method for calculating distance. Euclidian distance is a “ordinary” distance that one could measure with a ruler. In a single dimension the Euclidian distance is the absolute value of the differences between two points. The straight line distance between two points. In a plane with p1 at (x1 , y1 ) and p2 at (x2 , y2 ), it is √((x1 – x2 )² + (y1 – y2 )²). In N dimensions, the Euclidean distance between two points p and q is √(∑i=1 N (pi -qi )²) where pi (or qi ) is the coordinate of p (or q) in dimension i. Though there are many other distance measures, the Euclidian distance is the most commonly used distance measure and many packages use this measure.The Euclidian distance is influenced by the scale of the variables. Changing the scale (for example from feet to inches) can significantly influence the results.Second, the equation ignores the relationship between variables. Lastly, the clustering algorithm is sensitive to outliers. If the data has outliers and removal of them is not possible, the results of the clustering can be substantially distorted. The centroid is the center of the discovered cluster. K-means clustering provides this as an output. When the number of clusters is fixed to k, K-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.