Data Analytics in Bioinformatics. Группа авторов

Data Analytics in Bioinformatics - Группа авторов


Скачать книгу
categories

Schematic illustration of the agglomerative clustering and divisive clustering.

      Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.

      Divisive Clustering—This follows top down approach as depicted in Figure 2.4(b). Initially, it considers all the data points as single cluster or model. The model separation is performed recursively until required criteria is met based upon the model.

      Various algorithms of hierarchical clustering are described below.

       2.3.4.1 AGNES (Agglomerative Nesting)

      In this algorithm subsequent merging operations are performed on the dataset to arrange them into clusters. Dissimilarity coefficients are acquired using either Euclidean or Manhattan distance metrics [17]. These measured dissimilarities forms a matrix using unweighted pair group method with arithmetic mean [18].

      Drawbacks of this method are that the objects once merged cannot be separated and different computation metrics leads to varying results.

       2.3.4.2 DIANA (Divisive Analysis)

      This follows divisive clustering approach in which the whole dataset is divided into 2 sub parts (clusters) and these clusters are further separated into sub clusters until no splitting can be performed further [17].

      Drawback—once separated clusters cannot be merged, this cannot handle huge dataset [19]. This may not be applicable for gene expression data as it cannot handle missing data [17].

       2.3.4.3 CURE (Clustering Using Representatives)

      This algorithm operates on defining stable scatter points which are able to detect the shape and extent of a cluster. These scatter points based upon the similarities moves towards centroid in order to form a cluster. The scatter points capture the shape and extent of clusters which enables to detect non-spherical clusters. This algorithm is applied to gene expression data which produces the clusters efficiently [20].

       2.3.4.4 CHAMELEON

       2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies)

      In this algorithmic approach entire data in the dataset is maintained in the form of a tree structure called as CF (clustering Feature) which is based on clustering feature. The CF tree also holds the information of clustering. As the outliers are eliminated the sub clusters data is made as leaf nodes. With iterative updating of CF BRICH provides the potential to deal with huge datasets. This works effectively for gene expression data [22].

      Drawback—This algorithm gets skewed with the presence of nonspherical clusters [22].

       2.3.5 Density-Based Approach

      Density based clustering algorithmic approach is composed by splitting the low point density regions and high point density regions in a given data space. The low point density region is observed as outliers [26].

       2.3.5.1 DBSCAN

      The DBSCAN algorithm proposed by Ref. [27] is based on the following criterion:

      Eps(€)—this depicts the space around the data point, if the difference of distance between 2 data points is greater than or equal to € then they are declared as neighbors, if the (€) value is very small then they are termed as outliers, if the (€) value is high then that data point is merged to form cluster.

      Minpts—this represents the minimum number of data points within the radius (€) for the larger datasets like microarrays Minpts are obtained from the dimensions of the dataset.

      From the following definition of data points clusters and outliers are identified

      Core point—presence of higher Minpts within radius of (€)

      Outlier—point which is neither core nor border point.

       2.3.6 Model-Based Approach

      This clustering algorithm handles the shortcomings of K-Means and Fuzzy K-Means. In this approach a probabilistic model is developed to which the data from the dataset is fit into [23]. Self-Organizing Maps implement the model based approach technique.

       2.3.6.1 SOM (Self-Organizing Maps)

      This algorithm is popularly known in gene expression data clustering in which the expression values are closely related. SOM adapts a neural network template with a single layer in which data is split over neurons as shown in Figure 2.5. As every neuron is connected to every other neuron and is associated with a reference vector, the data objects are plotted to the nearest reference vectors to form quality clusters [24].

      SOM’s are highly effective in mapping high dimensional data. The representation of data in the form of map provides quick visualization and interpretation [24].

       2.3.7 Grid-Based Clustering

Schematic illustration of the self organizing map.

      Figure 2.5 Self Organizing Map (SOM).

       2.3.7.1 STING (Statistical Information Grid-Based Algorithm)

      The working principle of this algorithm is based on the users input density value with respect to the cluster areas, space with the low density is referred as non-relevant which is discarded as noisy data. Computationally STING requires fewer resources. The grid structure along with the statistical information associated with the cells gives a graphical representation of cluster formed [26].

       2.3.8 Soft Clustering

      In this approach of soft clustering the data points in the dataset can belong to any cluster this is also defined as a probabilistic model. In simpler terms a single data


Скачать книгу