Bioinformatics and Medical Applications. Группа авторов

Bioinformatics and Medical Applications

of input attributes.

Figures 1.2, 1.3, 1.4, and 1.5 display the distribution of some of the input values such as age, gender, presence of cardiovascular disease, and cholesterol type.

1.3.2 Machine Learning Algorithm

Post analysis of the data, it was broken up into training (80%) and testing (20%) sets, respectively. This is necessary to accept the power of the model to summarize new details. A few classifier models have been tested which have been explained as follows.

A bar graph depicts the age distribution.

Figure 1.2 Age distribution.

A bar graph depicts the presence of cardiovascular disease.

Figure 1.3 Presence of cardiovascular disease.

A bar graph depicts the cholesterol type distribution.

Figure 1.4 Cholesterol type distribution.

A bar graph depicts the gender distribution.

Figure 1.5 Gender distribution.

1.3.3 Decision Tree

Decision Trees are amazing and well-known devices which are used for classification and forecasting. It is a tree based classifier wherein nodes represent a test on one attribute, leaves indicate the worth of the target attribute, edge represents split of 1 attribute and path is a dis junction of test to form the ultimate decision.

The current implementation offers two stages of impurity (Gini impurity and entropy) and one impurity measure for regression (variability). Gini’s impurity refers to the probability of a misdiagnosis of a replacement variate, if that condition is new organized randomly in accordance with the distribution of class labels from the information set. Bound by 0 occurs when data contains only one category. Gini Index is defined by the formula

Entropy is defined as

where p_j is the proportion of samples that belong to class c for a specific node.

Gini impurity and entropy are used as selection criterion for decision trees. Basically, they assist us with figuring out what is a decent split point for root/decision nodes on classification/regression trees. Decision trees utilizes the split point to split on the feature resulting in the highest information gain (IG) for a given criteria which is referred to as Gini or entropy. It is based on the decrease in entropy after a dataset is split on an attribute. A number of the benefits of decision tree are as follows:

• It requires less effort to process data while it is done in advance.

• It does not require standardization and data scaling.

• Intuitive and simple to clarify.

However, it has some disadvantages too, as follows:

• Minor changes in the data can cause major structural changes leading to instability.

• Sometimes math can be very difficult in some algorithms.

• It usually involves more time for training.

• It is very expensive as the complexity and time taken is too much.

• Not adequate on regression and predicting continuous values.

1.3.4 Random Forest

The Random Forest, just as its name infers, increases the number of individual decision trees that work in conjunction. The main idea behind a random forest is the wisdom of the masses. An enormous number of moderately unrelated trees functioning as a council will surpass any existing models. Random Forest allows us to change the contributions by tuning the boundaries like basis, depth of tree, and maximum and minimum leaf. It is a supervised machine learning algorithm, used for both classification and regression. It makes use of bagging and feature randomness while assembling each singular tree to try to make an uncorrelated forest whose expectation is to be more precise than that of any individual tree. The numerical clarification of the model is as given:

1 1. Let D be a collection of dataset used for purpose of training D = (x1, y1) … (xn, yn).

2 2. Let w = w1(x); w2(x) … wk(x) be an ensemble of weak classifiers.

3 3. If every wk is a decision tree, then the parameters of the tree are described as

1 4. Output of each decision tree is a classifier wk(x) = w(x|θk).

2 5. Hence, Final Classification f(x) = Majority Voting of wk(X).

Figure 1.6 gives a pictorial representation of the working of random forest.

Some of the advantages of Random Forest algorithm are as follows:

• Reduces overfitting problem.

• Solves both clasification and regression problems.

• Handles missing values automatically.

• Stable and robust to outliers.

Schematic illustration of the random forest algorithm.

Figure 1.6 Random forest algorithm.

Some of the disadvantages are as follows:

• Complexity

• Longer training period.

1.3.5 Naive Bayes Algorithm

Naive Bayes is a fantastic AI calculation utilized for prediction which depends on Bayes Theorem. Bayes Theorem expresses that given a theory H and proof E, the relationship between possibility of pre-proof likelihood P(H) and the possibility of the following theoretical evidence P (H|E) is

Assumption behind Naive Bayes classifiers is that the estimation of a unique element is not dependent on the estimation of some different element taking the class variable

Скачать книгу