In the previous post of the ‘Clustering in ML Series’, we looked at what clustering is and some of the different clustering methods in ML. In this part, we will discuss the very first type of clustering method, known as Centroid-based clustering.

As explained in the previous post, Centroid-based clustering methods partition the data points/objects into ‘k’ number of clusters. You can think of it as a basic facility location problem, where the task is to find the best warehouse locations to optimally service a given set of consumers. “Warehouses” can be thought of as cluster centroids and “consumer locations” as the clustered data, as shown in the above picture. These clustering methods iteratively measure the distance between each data point and its nearest cluster’s centroid using various distance metrics.

Centroid-based clustering organizes the data into non-hierarchical clusters. In any of the centroid-based algorithms, the main underlying theme is the aspect of calculating the distance measure between the objects of the data set considered. The basic aspect of distance measure, generally, is derived using one among Euclidean, Minkowski, or Manhattan distance measuring mechanisms. In general, the mean is used in the Euclidean distance measure, the median in Manhattan, and the steepest descent method for calculating the distance measures. These are iterative clustering algorithms during which the notion of similarity is derived by the closeness of a data observation to the centroid of the clusters.

In these models, the no. of clusters required have to be mentioned beforehand, which makes it important to possess prior knowledge of the dataset. These models run iteratively to seek out the local optima. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.

The optimization problem itself is NP-hard, and thus the common approach is to look just for approximate solutions like Lloyd’s algorithm, often just known as “k-means algorithm”. It tries to find a local optimum and usually runs multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of the multiple runs, but also restricting the centroids to observations of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++), etc.

## Popular Centroid-based Clustering ML algorithms

#### 1. k-means Clustering

K-means algorithm is widely utilized in pattern recognition, classifications of documents, image processing, etc. This approach starts with assigning some objects as pivot objects. The value of k, the number of clusters that one wishes to possess, is given beforehand. Now the centroid of each such cluster is calculated. Each data point is then assigned to a cluster whose centroid is the nearest to it. The arithmetic mean is being calculated separately on par with its dimensionality.

K-means algorithm can be treated as a two-phase approach where:

1. In the first phase, k centroids are identified, as per the k value that has been given.  In general, space/distance measure is calculated using the Euclidean distance metric.
2. The second phase involves determining the new centroids such that the dissimilarity measures between the data points clustered together are very less. The process loops continuously, finding new centroids until convergence is achieved.

1. It is the fastest centroid-based algorithm.

2. It can work for large data sets.

3. It helps in reducing the intra-cluster variance measure.

1. Performance is affected when there is more noise in the data.

2. Outliers can never be studied.

3. Even though it reduces intra-cluster variance, it can’t affect or deal with the global minimum variance of measure.

4. It is very sensitive at clustering data sets of non-convex shaped clusters.

#### 2. k-Medoids

In k-medoids, each cluster is represented by the nearest object towards the center. The data points chosen as centers are termed as exemplars and the formed clusters are termed as priors. The silhouette method is usually utilized in order to determine the value of k. The process deals with applying the improved combination of k-medoids and the ‘Partitioning Around Medoids’ (PAM) algorithm on the data.

1. It is very robust to noisy data.

2. It is not sensitive to outliers.

3. Pairwise dissimilarity measure comes into play in the case of squared Euclidean distance measures.

1. Different initial sets of medoids affect the shape and effectiveness of the final cluster.

2. Clustering depends on the units of measurement, the difference in nature of objects differs in the efficiency.

3. It is also sensitive at clustering non-convex shaped clusters.

#### 3. CLARANS

CLARANS stands for Clustering Large Applications based on  RANdomized Search. CLARANS at times is thought of as an enhanced version of primitive CLARA (which is an algorithm used in reducing the computational efforts that one comes across using the k-medoid algorithm) and thus termed as Randomized “CLARA”. It does not have restrictions on search terms as with CLARA for any subset of objects. The basic CLARANS method starts with selecting a few pairs, say (i, h), instead of working on the whole dataset. It starts with randomly selected medoids and then checks for minimal dissimilarity measure i.e. it looks for medoid which is extremely near, termed as “Max-Neighbour” pair for swapping. If the cost is negative, it just updates the medoid set and the process continues. The process ends after reaching the optimal medoid set (termed as “Num-Local”) is achieved.

1. It is thought to be better than most medoid-based algorithms like k-medoids and CLARA.

2. It is more flexible, efficient, and scalable.

3. It covers major aspects of outliers.

4. It is robust to noisy data.