Data Analytics in Bioinformatics. Группа авторов
categories
Agglomerative clustering—This follows a bottom up approach as depicted in Figure 2.4(a). Initially, every other object in the dataset is assumed as a single cluster. Distance measures are computed upon each object and based upon the similarities they are merged to form clusters.
Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.
Divisive Clustering—This follows top down approach as depicted in Figure 2.4(b). Initially, it considers all the data points as single cluster or model. The model separation is performed recursively until required criteria is met based upon the model.
Various algorithms of hierarchical clustering are described below.
2.3.4.1 AGNES (Agglomerative Nesting)
In this algorithm subsequent merging operations are performed on the dataset to arrange them into clusters. Dissimilarity coefficients are acquired using either Euclidean or Manhattan distance metrics [17]. These measured dissimilarities forms a matrix using unweighted pair group method with arithmetic mean [18].
Drawbacks of this method are that the objects once merged cannot be separated and different computation metrics leads to varying results.
2.3.4.2 DIANA (Divisive Analysis)
This follows divisive clustering approach in which the whole dataset is divided into 2 sub parts (clusters) and these clusters are further separated into sub clusters until no splitting can be performed further [17].
Drawback—once separated clusters cannot be merged, this cannot handle huge dataset [19]. This may not be applicable for gene expression data as it cannot handle missing data [17].
2.3.4.3 CURE (Clustering Using Representatives)
This algorithm operates on defining stable scatter points which are able to detect the shape and extent of a cluster. These scatter points based upon the similarities moves towards centroid in order to form a cluster. The scatter points capture the shape and extent of clusters which enables to detect non-spherical clusters. This algorithm is applied to gene expression data which produces the clusters efficiently [20].
2.3.4.4 CHAMELEON
This algorithm is based on dynamic modeling which is executed in two stages. In the first stage graph partitioning algorithm is applied resulting into smaller clusters. As graph partitioning technique is applied the elements in the clusters are strongly connected. This establishes a quality separation among the clusters by discarding the noisy data. The later stage is executed with the application of agglomerative hierarchical clustering combining all the smaller clusters iteratively. This algorithm doesn’t fall short of incorrect merging because of its 2 stage execution. And also CHAMELEON overcomes all the shortcomings of AGNES and DIANA in which the data once split cannot be merged [21].
2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies)
In this algorithmic approach entire data in the dataset is maintained in the form of a tree structure called as CF (clustering Feature) which is based on clustering feature. The CF tree also holds the information of clustering. As the outliers are eliminated the sub clusters data is made as leaf nodes. With iterative updating of CF BRICH provides the potential to deal with huge datasets. This works effectively for gene expression data [22].
Drawback—This algorithm gets skewed with the presence of nonspherical clusters [22].
2.3.5 Density-Based Approach
Density based clustering algorithmic approach is composed by splitting the low point density regions and high point density regions in a given data space. The low point density region is observed as outliers [26].
2.3.5.1 DBSCAN
The DBSCAN algorithm proposed by Ref. [27] is based on the following criterion:
Eps(€)—this depicts the space around the data point, if the difference of distance between 2 data points is greater than or equal to € then they are declared as neighbors, if the (€) value is very small then they are termed as outliers, if the (€) value is high then that data point is merged to form cluster.
Minpts—this represents the minimum number of data points within the radius (€) for the larger datasets like microarrays Minpts are obtained from the dimensions of the dataset.
From the following definition of data points clusters and outliers are identified
Core point—presence of higher Minpts within radius of (€)
Border point—presence of less Minpts within radius of (€) but close to core point
Outlier—point which is neither core nor border point.
2.3.6 Model-Based Approach
This clustering algorithm handles the shortcomings of K-Means and Fuzzy K-Means. In this approach a probabilistic model is developed to which the data from the dataset is fit into [23]. Self-Organizing Maps implement the model based approach technique.
2.3.6.1 SOM (Self-Organizing Maps)
This algorithm is popularly known in gene expression data clustering in which the expression values are closely related. SOM adapts a neural network template with a single layer in which data is split over neurons as shown in Figure 2.5. As every neuron is connected to every other neuron and is associated with a reference vector, the data objects are plotted to the nearest reference vectors to form quality clusters [24].
SOM’s are highly effective in mapping high dimensional data. The representation of data in the form of map provides quick visualization and interpretation [24].
2.3.7 Grid-Based Clustering
In grid based algorithm rather than working with the data points this operates on space around the data points. The grid structure is made up of cells which segregate the data into number of closely connected cells. Density of each cell is calculated and sorted accordingly [25]. The computational complexity of grid based algorithm is less when compared with other algorithms this makes this approach more adaptable in dealing with large datasets.
Figure 2.5 Self Organizing Map (SOM).
2.3.7.1 STING (Statistical Information Grid-Based Algorithm)
The working principle of this algorithm is based on the users input density value with respect to the cluster areas, space with the low density is referred as non-relevant which is discarded as noisy data. Computationally STING requires fewer resources. The grid structure along with the statistical information associated with the cells gives a graphical representation of cluster formed [26].
2.3.8 Soft Clustering
In this approach of soft clustering the data points in the dataset can belong to any cluster this is also defined as a probabilistic model. In simpler terms a single data