Let’s start with an example, suppose you work at a supermarket and your Manager has asked you to understand your customers better so that you can market your products to them in a better manner. You had no clue what to do in this case, as this problem is very broad. In this given problem, you are not looking for specific insights for a specific case, but looking for structures or patterns within consumer behavior data. You may want to know things like which groups of people bought which items, or how much they spent in a single visit, or which items are often bought with other items. This method of identifying similar groups of data in a dataset is called clustering.
Clustering is the task of dividing given data points into different groups such that data points within a group are similar to other data points within the same group and dissimilar to data points in other groups.
Clustering is extremely important because it determines the inherent grouping among the unlabeled data in a data set, even when there are no specific criteria for a decent clustering. It depends on the problem, which criteria can be used to satisfy its need. For instance, we could be interested in finding representatives for homogeneous groups that can be used for data reduction, in finding clusters and describing their unknown properties, in finding useful and suitable groupings, or in finding an outlier.
In the above image, you can see that the different fruits are clustered into different groups, each having similar/same features or properties. In this example, these could be taste, color, shape, etc.
Different Types of Clustering Algorithms
The various types of clustering algorithms are:
- Centroids-based Clustering
- Connectivity-based Clustering
- Distribution-based Clustering
- Density-based Clustering
- Fuzzy Clustering
1. Centroids-based Clustering
These methods partition the data points/objects into ‘k’ number of clusters. These clustering methods iteratively measure the distance between each data point and its nearest cluster’s centroid using various distance metrics, such as Euclidean distance, Manhattan Distance, etc, and then optimize over it maybe. Example – K-means, CLARANS, etc.
2. Connectivity-based Clustering
Also known as Hierarchical Clustering, it is a method that begins with a predefined top-to-bottom hierarchy of clusters. It then proceeds to perform a decomposition of the data objects based on this tree-based hierarchy, hence obtaining the clusters. It is divided into two categories, Agglomerative (bottom-up approach) and Divisive (top-down approach). Example – CURE, BIRCH, etc
3. Distribution-based Clustering
Distribution-based clustering creates and groups data points based on their likelihood of belonging to the same probability distribution in the data, such as Gaussian, Binomial, etc. It very closely relates to how datasets are generated and arranged using random sampling techniques, which determines how the data points are selected in the first place. Clusters can then be easily defined as objects that are most likely to belong to the same distribution. Example – DBCLASD.
4. Density-Based Clustering
In most of the clustering, we make two assumptions:
1- Data is devoid of any noise
2- The shape of the cluster formed is geometrical (either circular or elliptical)
But data always has some extent of noise which cannot be ignored and we must not limit ourselves to a fixed shape – it is desirable to have arbitrary shapes so as to not ignore any data points. These are the areas where density-based algorithms are used. These methods consider the clusters as the dense region having some similarities among data points, which would be different from the lower dense regions of the space. These methods have good accuracy and the ability to merge two clusters. Example – DBSCAN, OPTICS etc.
5. Fuzzy Clustering
Fuzzy clustering methods are different from other clustering algorithms. These methods assign a data point to multiple clusters with a quantified degree of some belongingness metric. The data points that are in proximity to the center of a cluster may also belong in the cluster that is at a higher degree than points in the edge of a cluster. The possibility with which an element belongs to a given cluster is measured by a membership coefficient that can have a value between 0 and 1. It can be used with datasets where the variables have a high level of overlap.
This was a quick introduction to clustering and its different types. In the subsequent parts of this series, we will study a different clustering algorithm from each type of clustering-based algorithm. See you in the next part of this series!! 🙂