Clustering – As the word is self explanatory that here we have group of similar points. Clustering is the method of unsupervised learning and is used for statistical data analysis in many fields.

Clustering is the process of collection of objects on the basis of similarity and dissimilarity between them.

Look at below graph plotting various data points from a dataset.

As we can observe that similar data points are clubbed together. This is the basic idea behind clustering.

Clusters have no definite shape.Clustering is useful as it determines the intrinsic grouping among the unlabeled data present.

Clustering algorithms are widely used in retail, banking, healthcare, manufacturing industries etc.

They are helpful to organize unstructured data and can be used on tabular data, images, text data also.

We can understand clusters with the help of our vary favourite app ‘Amazon’.

Amazon app has various categories for its customers like electronics, clothing, footwears, household things etc. Within electronics category, we have sub-categories like TV, Mobiles, Referigerators, Clothing etc. So, the items with similar features lie in the same category or sub-category.

Types of clustering

Hard Clustering – Data points either belongs to a cluster completely or not. For instance, in amzon app example, any dress will definitely belong to the cluster ‘Clothing’.

Soft Clustering – Here, each data point is assigned a probability or likelihood to be in clusters. For instance, in amazon app, TV may belong to categories ‘Electronics’ or ‘Household items’. Depending upon its probability to be in both the categories, it will be assigned a cluster in which chances of its relation be more than other.

Types of clustering algorithms

Connectivity Models – These are based on the fact that data points plotted closer to each other are similar in behaviour than the points lying farther. This is distance based method. We can choose either of below approaches.

• Start with classifying all data points into separate clusters and then aggregating them as distance decreases.
• Classify all the data points as a single cluster and then keep on partitioning it as the distance increases.

It is also known as hierarchical clustering as there is hierarchy of clusters that merge with each other at certain distances.

Noise and outliers can cause problems for this models since they show up as additional clusters and can make model more complex or may result into inappropriate results. They are easy to intepret but are not highly scalable.

• Centroid Models – Here, data points are clustered together on the basis of their distance from the centroid of clusters. Centroid is chosen such that the distance of data points from center is minimum. It requires the prior knowledge of dataset since we need to mention the number of clusters required at the end. K-Means is one such type of clustering algorithm.
• Distribution Models – Here, data points are clustered together on the basis of the probability to belong to the same distribution (normal or guassian). They may suffer from overfitting. Ex – Expectation-Maximization Algorithm that uses multivariate normal distributions.
• Density Models – Here, cluster is formed on the basis of varied density of data points in space. Data points in the same density region are supposed to belong to same cluster. Ex – DBSCAN and OPTICS.

Distance calculation for clustering

Suppose we have 2D dataset with variables xi and xj given below:-

xi = (xi1, xi2, . . . , xip)

 xj = (xj1, xj2, . . . , xjp)

We can calculate various distance as follows:-

1. Euclidean Distance – Sum of the squares of the difference between the corresponding points. It is used to calculate the distance between numerical values. d(xi , xj ) = (|xi1 − xj1|² + |xi2 − xj2|² + . . . + |xip − xjp|²)
1. Manhattan Distance – Sum of the absolute values of the difference in the given coordinates. It is used to calculate the distance between numerical values. It is given as d(xi , xj ) = (|xi1 − xj1| + |xi2 − xj2| + . . . + |xip − xjp|

We will explain K-Means using below example.

Suppose we have a dataset containing below data points.

{3,5,8,2,19,16,25,34,55,4,39}

As we already discussed that K-Means is centroid based clustering algorithm which require information of number of clusters to be formed before applying the algorithm.

Let’s say we want 2 clusters to be formed. Usually, this information of number of clusters is denoted by K here.

So, K =2

Randomly, choose any data points

Let’s say we choose 8 and 25. These points will behave as centroid.

Now, make clusters of points closer to 8 as one cluster and those closer to 25 as second cluster.

Cluster1 Cluster2

K = 1 K = 2

{3,5,8,2,16,4} {19,34,55,39}

1. Iteration 1

Take mean of the values in each cluster

Mean1 = 6.3 = 6 Mean2 = 36.5 = 37

Above calculated mean values will become centroid for next iteration. Reform the cluster using above two mean values as centroid and assign other points from the original dataset to the cluster where centroid is closer to them.

Cluster1 Cluster2

K = 1 K = 2

{3,5,8,2,19,16,4} {34,55,39}

1. Iteration 2

Take mean of the values in each cluster

Mean1 = 8.14 = 8 Mean2 = 42.6 = 43

Use new mean values as centroid and reshuffle the cluster points. New cluster will be

Cluster1 Cluster2

K = 1 K = 2

{3,5,8,2,19,16,4} {34,55,39}

As we can observe that new clusters are same as in the previous iteration, so we can stop here and these would become our final clusters. These iterations need to be repeated in the same way until we get the same clusters as in the previous step.

How to select the best value of K????

The most important parameter to decide in K-Means clustering is ‘k’ i.e number of clusters. There is not any rule which fit all the situations to find the value of ‘k’. It actually depends on the shape and distribution of the observations in a dataset.

Best value of ‘k’ will help to maintain the balance between the clusters and the average variation within a cluster.

Below are few methods to find the best value of ‘k’.

1. Elbow Method
• Apply clustering algorithm for different values of k (say 1 to 15).
• For each value of k, calculate the sum of squared errors(SSE).
• Plot a graph between SSE and values of k.
• If everything is done perfectly, then graph will look like a curved line similar to bend arm. And the bend point (known as elbow) will be the best value of k.
• The basic idea behind every algo is to minimize the SSE. In the plt drawn in elbow method, SSE is ideally minimum (or 0) when k is equal to the number of data points on the plot since in that case, every data point would lie in its own cluster and SSE would no doubt be 0. But we can’t go by this ideal situation in real scenario and have to choose k such that SSE is minimum instead of 0. The elbow of arm in above plot satisfies this criteria very well.

$${}$$