Artificial Intelligence and Quantum Computing for Advanced Wireless Networks. Savo G. Glisic

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic


Скачать книгу
pruning: One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks overfitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances, and then use pruning to remove nodes that do not provide additional information.

      Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross‐validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance. Pruning can occur in a top‐down or bottom‐up fashion. A top‐down pruning will traverse nodes and trim subtrees starting at the root, while a bottom‐up pruning will start at the leaf nodes. We now describe two popular pruning algorithms.

       Reduced error pruning: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, then the change is retained. Although somewhat naive, reduced error pruning has the advantage of simplicity and speed.

       Cost‐complexity pruning: Cost‐complexity pruning generates a series of trees T0 , T1 , …, Tm, where T0 is the initial tree and Tm is the root alone. At step i, the tree is created by removing a subtree from tree i − 1 and replacing it with a leaf node having a value chosen as in the tree building algorithm. The subtree that is removed is chosen as follows: (a) Define the error rate of tree T over dataset S as e(T,S). (b) The subtree chosen for removal minimizes

       

       The function prune(T,t) defines the tree obtained by pruning the subtrees t from the tree T. Once the series of trees has been created, the best tree is chosen by generalized accuracy as measured by a training set or cross‐validation.

      The decision of making strategic splits heavily affects a tree’s accuracy. The decision criterion is different for classification and regression trees. The most popular algorithms used in practice are Gini, Chi‐Square, and Reduction in Variance. For details, see Section 2.2.2.

      Setting constraints on tree‐based algorithms can be done by using various parameters that are used to define a tree, such as the following:

      Minimum samples for a node split defines the minimum number of samples (or observations) that are required in a node to be considered for splitting. It is used to control overfitting. Higher values prevent a model from learning relations that might be highly specific to the particular sample selected for a tree. Excessively high values can lead to underfitting.

      Minimum samples for a terminal node (leaf) defines the minimum samples (or observations) required in a terminal node or leaf. It is used to control overfitting in a way that is similar to min_samples_split [51]. In general, lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in a majority will be very small.

      Maximum depth of tree (vertical depth) is used to control overfitting as a higher depth will allow the model to learn relations that are very specific to a particular sample.

      Maximum number of terminal nodes can be defined in place of max_depth [52]. Since binary trees are created, a depth of “n” would produce a maximum of 2n leaves.

      Maximum features to consider for split is the number of features to consider while searching for the best split. These will be randomly selected. As a rule of thumb, the square root of the total number of features works well, but up to 30–40% of the total number of features should be checked. Higher values can lead to overfitting, but this is case specific.

      2.1.4 Trees in R and Python

      There are multiple packages available in R to implement decision trees, such as ctree, rpart, and tree. Here is an example:

      > library(rpart) > x <- cbind(x_train,y_train) # grow tree > fit <- rpart(y_train ~ ., data = x,method="class") > summary(fit) #Predict Output > predicted= predict(fit,x_test)

      In the code above, y_train and x_train represent dependent and independent variables, respectively, and x represents training data. Similarly, in Python we have the following:

       #Import Library #Import other necessary libraries like pandas, numpy… from sklearn import tree #Assumed you have, X (predictor) and Y (target) for training dataset and x_test(predictor) of test_dataset # Create tree object model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini # model = tree.DecisionTreeRegressor() for regression # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

      2.1.5 Bagging and Random Forest

      Bagging is a technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different subsamples of the same dataset. The steps followed in bagging are as follows:

       Form multiple datasets: Sampling is done with replacement on the original data, and new datasets are formed. These new datasets can have a fraction of the columns as well as rows, which are generally hyperparameters in a bagging model. Taking row and column fractions less than one helps in making robust models that are less prone to overfitting.

       Develop multiple classifiers: Classifiers are built on each dataset. In general, the same classifier is modeled on each dataset, and predictions are made.

       Integrate classifiers: The predictions of all the classifiers are combined using a mean, median, or mode value depending on the problem at hand. The combined values are generally more robust than those from a single model. It can be theoretically shown that the variance of the combined predictions is reduced to 1/n (n: number of classifiers) of the original variance, under some assumptions.

       There are various implementations of bagging models. Random forest is one of them, and we will discuss it next.

      In random forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest), and in case of regression, it takes the average of outputs from different trees.

      In R packages, random forests have simple implementations. Here is an example;

       > library(randomForest) > x <‐ cbind(x_train,y_train) # Fitting model > fit <‐ randomForest(Species ~ ., x,ntree=500) > summary(fit) #Predict Output > predicted= predict(fit,x_test)

      2.1.6 Boosting GBM and XGBoost

      By definition, “boosting” refers to a family of algorithms that convert weak learner to strong learners. To convert a weak learner to a strong learner, we will combine the prediction of each weak learner using methods such as average/weighted


Скачать книгу