Bioinformatics and Medical Applications. Группа авторов

Bioinformatics and Medical Applications

order of individual accuracy).

• Parallel: All three algorithms are applied in parallel and maximum voting is used.

• Prob 60 SP: If probability calculated by Naive Bayes is greater than 60% apply serial method else apply parallel.

• PLS: First parallel then serial is applied for wrong classified records.

• SKmeans: Combination of Serial along with K means.

• PKmeans: Combination of Parallel along with K means.

From this analysis, we found the PKmeans method to be the most efficient. Though serial along with K means achieves the best accuracy for training data, it is not feasible for real data where target column is not present. The reliability on any single algorithm is not possible for correctly classifying all the records; hence, we use more suitable ensemble method which utilizes the wisdom of the crowd. It uses the ensemble method of the type majority voting which includes adding the decisions in favor of crisp class labels from different models and foreseeing the class with the most votes.

Our goal is to achieve the best possible accuracy which surpasses the accuracy achieved by the individual methods. Figures 1.8 to 1.11 show the confusion matrix plotted by Naive Bayes, Random Forest, and Decision Tree individually as well as their ROC curve.

Schematic illustration of the NB confusion matrix.

Figure 1.8 NB confusion matrix.

Schematic illustration of the RF confusion matrix.

Figure 1.9 RF confusion matrix.

Schematic illustration of the DT confusion matrix.

Figure 1.10 DT confusion matrix.

Figure 1.11 ROC curve analysis.

1.4.2 Method

We observed that by applying ensemble method of type majority voting on the algorithms Decision tree, Random Forest, Naive Bayes, and K means, we could achieve an accuracy of 91.56%. To additionally improve the precision, we proposed the following algorithm. The design of the proposed method is as given in Figure 1.12.

Schematic illustration of the proposed architecture.

Figure 1.12 Proposed architecture.

Algorithm 1.1 Probabilistic optimization.

initialization

d ← dataset

a1 ← Naive_Bayes_output ← ApplyNaiveBayes(d)

a2 ← Decision_tree_output ← ApplyDesisionTree(d)

a3 ← Random_tree_output ← ApplyRandomForest(d)

a4 ← K_Means_output ← ApplyKmeans(d)

winner(0, 1) ← Voting(a1, a2, a3, a4)

op ← winner_of_max_count(0,1)

if op ≠ desired_output then

Probability_calculation of each column with output 0 or 1

end

For each value in c_i

count ← c_i/2

For k to count

Add the probability (Find the max column with which probability matches)

Number of columns selected as ti

wi ← Weightage of selected columns

αi ← Append the weightage with the input of data

Find mean square error with the training and find lowest (MSE) parameter. Calculate the Euclidean distance

Find the minimum distance using this formula.

If probability of data > 0.5 and MSE < 0.5 and ED < 0.2

Classify as 1

else

Classify as 0

The following block diagram explains the flow of Algorithm 1.1.

Schematic illustration of the block diagram of the algorithm 1.

The working of the algorithm is explained briefly as follows.

1 1. The ensemble method of the four algorithms (Decision Tree, Random Forest, Naive Bayes, and K Means) is applied by majority voting and classification is obtained on presence or absence of cardiopathy.

2 2. The wrongly classified records are stored in a separate dataset.

3 3. The probability of each column with output is calculated and stored. For example, considering age, the probability of heart disease for age greater than 45 is more than otherwise.

4 4. We calculate those columns for which probability is maximum.

5 5. Only select these columns for further analysis.

6 6. Calculate the weights of these columns using formula y = mx + c for linear data using Multiple linear regression.

7 7. For non-linear data wherein the chances of misclassification are more, more complex functions such as tanh, sigmoid, and relu are used for calculating the weights.

8 8. Append the weights to the column at the time of classification.

9 9. Calculate the mean square error and Euclidean distance.

10 10. Finally, based on probability, mean square error and Euclidean distance, we classify the records as 1 or 0 which indicates presence/absence of heart disease.

11 11. Hence, accuracy achieved is higher than using the classical ensemble method.

Hence, our proposed methodology achieves a precision that not only surpasses the individual methods but also overshoots the combination method and the precision achieved thus is quite competitive.

1.5 Conclusion

An ensemble of classifiers is a collection of classification models whose singular forecasts are joined, by means of weighted or unweighted casting a ballot to dole out a classification mark to each new pattern. There is no single best method of creating successful

Скачать книгу