Machine Learning Techniques and Analytics for Cloud Security. Группа авторов

Machine Learning Techniques and Analytics for Cloud Security - Группа авторов


Скачать книгу
and is given to the sigmoid function which activates a curve. The generated curve is popularly known as sigmoid curve (Figure 3.1). The sigmoid function also known as logistic function generates a curve appears as a shapes like “S” and acquires any value which gets converted in the span of 0 and 1. The traditional method followed here is that when the output generated by sigmoid function is greater than 0.5, then it classifies it as 1, and if the resultant value is lower than 0.5, then it classifies it to 0. In case the generated graph proceeds toward negative direction, then predicted value of y will be considered as 0 and vice versa.

      Building predictions by applying LR is quiet a simple task and is similar to numbers that are being plugged into the equation of LR for calculating the result. During the operating phase, both LR and linear regression proceeds in the same way for making assumptions of relationship and distribution lying within the dataset. Ultimately, when any ML projects are developed using prediction-based model then accuracy of prediction is always given preference over the result interpretation. Hence, any model if works good enough and be persistent in nature, then breaking few assumptions can be considered as relevant. As we have to work with gene expression, data belong to both normal and cancerous state and at the end want to identify the candidate genes whose expression level changes beyond a threshold level. This group of collected genes will be determined as genes correlated to cancer. So, the rest of the genes will automatically be excluded from the list of candidate genes. So, the whole task becomes a binary classification where LR fit well and has been used in the present work.

      While learning the pattern with given data, then data with larger dimension makes the process complex. In ML, there are two main reasons why a greater number of features do not always work in favor. Firstly, we may fall victim to the “curse of dimensionality” that results in exponentially decreasing sample density, i.e., sparser feature space. Secondly, the more features we have, the more storage space and computation time we require. So, it can be concluded that excess amount of information is bad because the factors like quality and computational time complexity make the model inappropriate to fit. If the data is having huge dimension, then we should find a process for reduction of the same. But the process should be accomplished in such a manner where we can maintain the information which is significant as found in the original data. In this article, we are proposing an algorithm which serves that particular task. This is a very prominent algorithm and has been used extensively in different domain of work. It is named as Principal Component Analysis (PCA). PCA is primarily used to detect the highest variance dimensions of data and reshape it to lower dimensions. This is done in such a manner that the required information will be present, and when used by ML algorithms, it will have little impact on the perfection.

Graph plots sigmoid curve.

      3.3 Methodology

      3.3.1 Description

      The implementation of our work started with gene expression data which is mathematically viewed as a set G = {g1, g2, g3,…, ga}. Each member gi of G can be further expressed as gi = {gi1, gi2, gi3, …, gib}. Thus, the entire mathematical expression G can be considered as vector of vectors. In this context, gi can be thought as a vector/gene comprising of feature. More specifically, the entire dataset is represented in a matrix format of dimension a × b where a is number of genes and b is number of samples and a >> b. So, the number of samples is very less in compare to number of genes, considered as features. Two separate sets of data belong to two different states: normal and carcinogenic are taken here for generating the result from the proposed method. The whole dataset is represented mathematically as a set G = {GN, GC} where the dataset GN represents normal or non-cancerous state and GC belongs to cancerous state. Two such datasets, viz., lung and colon pertaining to non-cancerous and cancerous states, are studied to get experimental result, i.e., the set of genes whose mutations have been observed.

      In the present work, PCA is applied for the purpose of subgrouping the variables that preserves as much information present in the complete data as possible and also to speed up ML algorithm. The gene expression data G is presented here as a mathematical notation depicted as G = {g1, g2, g3,…, ga}. The dataset used here belongs to two states, i.e., normal and cancerous states and treated here for determining the genes associated with cancer. The proposed algorithm works by reducing the size of features using PCA and then applying LR model on both datasets.

      Here, we have applied LR as because the dependent variable (target) is categorical in nature. A threshold value has been considered here which helps to predict the class belongingness of a data. On the basis of the set threshold value, the predicted probability is realized for classification. After calculating the predicted value if found ≥ threshold limit, then gene is said to be cancerous in nature, otherwise non-cancerous. Considering x as independent variable and y as dependent variable in our LR model, the hypothesis function h(x) ranges between 0 and 1. As it works as a binary classifier, the result of prediction with the classification becomes y = 0 or y = 1. The hypothesis function h(x) actually can have values <0 or > 1. The mathematical expression in logistic classification used in the method is defined as 0 ≤ h(x) ≤ 1.

      As in our Logistic Regression model, we want 0 ≤ h(x) ≤ 1, so our hypothesis function might be expressed as

      (3.1)Скачать книгу