Data Analytics in Bioinformatics. Группа авторов
Figure 1.13 ROC curve for random forest.
1.6 K-Nearest Neighbor
K-Nearest Neighbor belongs to the category of supervised classification algorithm and hence, needs labeled data for training [77, 78]. In this approach, the value of K is suggested by the user. It can be used for both the classification and regression approaches but the attributes must be known. By performing the KNN algorithm, it will give new data points according to the k-number or the closest data points.
In the heart disease dataset also, The Area under the ROC Curve (AUC) has been used. It is the most basic tool for judging the classifier’s performance in a medical decision making concerns [79–81]. It is a graphical plot for judging the diagnostic ability with the help of a binary classifier. The generated ROC curve for KNN on the heart disease dataset [41] is presented below in Figure 1.14.
In the above figure, the true positive rate (probability of detection) is mentioned on the Y-axis, and on the x-axis, the false positive rate (probability of false alarm) is mentioned. The False Positive rate depicts the unit proportion with a known negative condition for which the predicted condition is positive.
The Area under the ROC Curve (AUC) of K-nearest neighbor is performed on the heart disease dataset [41] in python (Google Colab) and shown below in Table 1.5.
Figure 1.14 ROC curve for k-nearest neighbor.
Table 1.5 AUC: K-nearest neighbor.
Parameter | Data | Value | Result |
The area under the ROC Curve (AUC) | Training Data | 1.0000000 | Outstanding |
Test Data | 1.0000000 | Outstanding | |
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding |
The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result shows that KNN performs outstandingly on the dataset.
1.7 Decision Trees
Decision Tree is a form of supervised machine learning and was invented by William Belson in the year 1959 [82]. It predicts the response values by learning the decision rules that were derived from features [83–84]. They are good for evaluating the options. It is used in operation research and decision analysis. An example of Decision Trees considering a person is having heart disease or not is presented below in Figure 1.15 for easy understanding.
The above figure depicts the answer to the Question “A person having Heart Disease or not?” by concerning various conditions and reaching a conclusion. Initially, it is checked that a person having chest pain or not. If yes, then it is checked that the person has high blood pressure or not. If the blood pressure if high or even low, then the person is suffering from heart disease. If the person doesn’t have chest pain then he is not suffering from heart disease. After implementing the Decision tree on the heart disease dataset [41] the AUC values are generated and presented in Table 1.6. The implementation was done in Python (Google Colab).
Figure 1.15 Decision tree.
Table 1.6 AUC: Decision trees.
Parameter | Data | Value | Result |
The area under the ROC Curve (AUC) | Training Data | 0.9588996 | Outstanding |
Test Data | 0.9773333 | Outstanding | |
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding |
The obtained value of Training Data is 0.9588996 that attains an outstanding remark and the value of the testing data is 0.9773333 that attains an outstanding remark in the AUC score. The result indicates that the decision tree model performs outstandingly on the heart disease dataset.
1.8 Support Vector Machines
The original Support Vector Machine (SVM) algorithm was invented by Vladimir N. Vapnik & Alexey Ya. Chervonenkis in 1963 [85]. In machine learning, the Support Vector Classifier fits the data that the user provides, and returns the best-fit hyper-plane that categorizes the data. After getting the hyperplane, the user can feed some features to the classifier to check the predicted class [86–87]. SVM is used for analyzing data that can be used for the process of regression or classification. Taking a similar example of the bifurcation of a person suffering from heart disease or not but giving it a more detailed view, it is depicted in Figure 1.16.
In the above figure, there is an illustration of Support Vector Machines that amalgamates the Hyperplane, Support Vectors, Maximum Margins, and Data Points in a single concept and belongs to either a person is suffering from heart disease or not. Support Vectors are the points that are present very close to the hyperplane and it affects the position and orientation. If they are removed then the position and orientation of the hyperplane will be altered and the maximum margin will also get affected [88–90]. The maximum margin is the distance/length between the nearest points to both classes. Here, Class 1 belongs to the person suffering from heart diseases and Class 2 belongs to the persons who are not suffering from heart diseases. After implementing SVM on the heart disease dataset [41] through python (Google Colab), it was observed that the generated AUC values presented in Table 1.7 show that the model performs outstandingly.
Figure 1.16 Support vector machine.
Table 1.7 AUC: Support vector machines.
Parameter | Data | Value | Result |
The area under the ROC |