Bioinformatics and Medical Applications. Группа авторов
need to mature further.
1.2.1 Comparative Analysis
Please refer to Table 1.1 to get a comparative study of the methods and understand the strengths and weakness of each. This helped us immensely in designing our prototype.
1.2.2 Survey Analysis
Analyzing the literature, we came to know the scope and limitations of prediction techniques. In present days, heart disease rate has significantly increased and the reason behind deaths in the United States. National Heart, Lung, and Blood Institute states that cardiovascular breakdown is a problem in the typical electrical circuit of the heart and siphoning power.
The incorporation of methodologies with respect to information enhancement and model variability has been coordinating preparing and testing of AI model, Cleveland dataset from the UCI file utilized a ton of time since that is a checked dataset and is generally utilized in the preparation and testing of ML models. It has 303 tuples and 14 attributes that depend on the factors that are believed to be associated with an increased risk of cardiovascular illness. Additionally, the Kaggle dataset of coronary illness containing records of 70,000 and 12 patient attributes is also used for the purpose of training and assessment.
Table 1.1 Comparative analysis of prediction techniques.
Experimental testing and the use of AI indicate that supervised learning is certain calculation exceeds an alternate calculation for a particular issue or for a specific section of the input dataset; however, it is not phenomenal to discover an independent classifier that accomplishes excellent performance the domain of common problems.
Ensembles of classifiers are therefore produced using many techniques such as the use of separate subset of coaching dataset in a sole coaching algorithm, utilizing distinctive coaching on a solitary coaching algorithm or utilizing multiple coaching strategies. We learnt about the various techniques employed in ensemble method like bagging, boosting, stacking, and majority voting and their affect on the performance improvement.
We also learned about Hoeffding Tree which is the first distributed algorithm for studying decision trees. It incorporates a novel way of dissecting decision trees with vertical parallelism. The development of effective integration methods is an effective research field in AI. Classifier ensembles are by and large more precise than the individual hidden classifiers. This is given the fact that several learning algorithms use local optimization methods that can be traced to local optima.
A few methodologies find those features by relationship which can help successful predictive results. This used in combination with ensemble techniques achieves best results. Various combinations have been tried and tested and none is the standardized/best approach. Each technique tries to achieve a better accuracy than the previous one and the race continues.
1.3 Tools and Techniques
Machine learning and information gathering utilizes ensembles on one or more learning algorithms to get different arrangement of classifiers with the ability to improve performance. Experimental studies have time and again proven that it is unusual to get one classifier which will perform the best on the general problem domain. Hence, ensemble of classifiers is often produced using any of the subsequent methods.
• Splitting the data and using various chunks of the training data for single machine learning algorithm.
• Training one learning algorithm using multiple training parameters.
• Using multiple learning algorithms.
Key ideas such as the data setup, data classification, data mining models, and techniques are described below.
1.3.1 Description of Dataset
The source of data is Kaggle dataset for cardiovascular diseases which contains 70,000 records with patient information. The attributes include objective information, subjective information, and results of medical examination. Table 1.2 enumerates the 12 attributes.
A heatmap is a clear representation of data where data values are represented as colors. It is used to get a clear view of the relationship between the features. The coefficient of relationship is a factual proportion of the strength of the association between the general developments of two factors with values going between −1.0 and 1.0. A determined number more prominent than 1.0 or less than −1.0 indicates a slip-up in the relationship estimation. Figure 1.1 represents the heat map for the input parameters of the defined dataset.
Table 1.2 Dataset attributes.
Feature name | Variable name | Value type |
Age | Age | No. of days |
Height | Height | Centimeters |
Weight | Weight | Kilograms |
Gender | Gender | Categories |
Systolic blood pressure | Ap_hi | Integer |
Diastolic blood pressure | Ap_lo | Integer |
Cholesterol | Cholesterol | 1: Standard; 2: Above standard; 3: Well above standard. |
Glucose | Glu | 1: Standard; 2: Above standard; 3: Well above standard. |
Smoking | Smoke | Dual |
Alcohol intake | Alco | Dual |
Physical activity | Active | Dual |
Presence or absence of CVDs | cardio | Dual |