In the previous posts of the ‘Clustering in ML Series’, we learned about the Connectivity-Based Clustering methods in ML. In this part, we will discuss another couple of clustering methods, Distribution-Based Clustering, and Fuzzy Clustering.
Till now, we have learned about the clustering methods that are based on similarity/distance or density. This family of clustering algorithms takes a totally different metric into consideration: probability. Distribution-Based Clustering is a clustering model in which we will fit the data on the probability of it belonging to the same distribution. This clustering approach assumes data is composed of distributions, such as Gaussian, binomial, etc. Gaussian distribution is prominent when we have a fixed number of distributions and all the upcoming data is fitted into it such that the distribution of data may get maximized.
As you can see in the above image, the data is modeled into 3 Gaussian distributions and as the distance from the distribution’s center increases, the probability that a point belongs to the distribution decreases. The band colors show a decrease in probability. The distribution models of clustering are most closely related to statistics and it is very closely related to the way in which datasets are generated and arranged using random sampling principles i.e., to fetch data points from one form of distribution. Clusters can then easily be defined as objects that are most likely to belong to the same distribution. The expectation-maximization algorithm is one of the popular examples of distribution-based clustering, and it uses multivariate normal distributions.
It has a major advantage over the proximity and centroid-based clustering methods in terms of flexibility, correctness, and shape of the clusters formed. It is because those algorithms depend on the shape of clusters which is adjusted by one or more hyper-parameters and if we by mistake didn’t set these hyperparameter values correctly, they may lead to unwanted results. While one disadvantage of distribution-based clustering is that, we can only use it when we have information about the type of distribution of our data.
Most of the standard clustering algorithms such as k-means, PAM, etc are thought of as hard clustering since they produce partitions such that each observation belongs to only one cluster. In contrast, Fuzzy Clustering is considered a soft clustering method, where each item can be a member of more than one cluster, and each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The fuzzy c-means algorithm (FCM) is one of the most popular fuzzy clustering algorithms, in which the centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster.
In fuzzy clustering, points close to the center of a cluster or maybe in the cluster are given a higher degree than points on the edge of a cluster. The degree to which an element belongs to a given cluster is a numerical value varying from 0 to 1.
Fuzzy clustering can be used with datasets where the variables have a high level of overlap. It is preferred for image segmentation related applications and especially in bioinformatics. As in bioinformatics, identifying the overlapping gene codes makes it difficult for generic clustering algorithms to differentiate between the image’s pixels and they fail to perform a proper clustering.
This was a quick introduction to Distribution-Based Clustering and Fuzzy Clustering. And this concludes our Clustering in ML Series, I hope you enjoyed and learned new concepts 🙂