Data Mining and Machine Learning Applications. Группа авторов
A review and description of the learning methods in human-computer interaction,
Implementation strategies and future research directions used to meet the design and application requirements of several modern and real-time applications for a long time,
The scope and implementation of a majority of data mining and machine learning strategies, and
A discussion of real-time problems.
This book is a better choice than most other books available on the market because they were published a long time ago, and hence seldom elaborate on the current needs of data mining and machine learning. It is our hope that this book will promote mutual understanding among researchers in different disciplines, and facilitate future research development and collaborations.
We want to express our appreciation to all of the contributing authors who helped us tremendously with their contributions, time, critical thoughts, and suggestions to put together this peer-reviewed edited volume. The editors are also thankful to Scrivener Publishing and its team members for the opportunity to publish this volume. Lastly, we thank our family members for their love, support, encouragement, and patience during the entire period of this work.
Rohit RajaKapil Kumar Nagwanshi Sandeep Kumar K. Ramya Laxmi November 2021
1
Introduction to Data Mining
Santosh R. Durugkar1, Rohit Raja2, Kapil Kumar Nagwanshi3* and Sandeep Kumar4
1 Amity University Rajasthan, Jaipur, India
2 IT Department, GGV Bilaspur Central University, Bilaspur, India
3 ASET, Amity University Rajasthan, Jaipur, India
4 Computer Science and Engineering Department, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andra Pradesh, India
Abstract
Data mining, as its name suggests “mining”, is nothing but extracting the desired, meaningful exact information from the datasets. Its methods and algorithms help researchers and students develop the numerous applications to be used by the end-users. Its presence in the healthcare industry, marketing, scientific applications, etc., enables the end-users to extract the meaningful required information from the collection. In the initial section, we discuss KDD—knowledge discovery in the database with its different phases like data cleaning, data integration, data selection and transformation, representation. In this chapter, we give a brief introduction to data mining. Comparative discussion about classification and clustering helps the end-user to distinguish these techniques. We also discuss its applications, algorithms, etc. An introduction to a basic clustering algorithm, K-means clustering, hierarchical clustering, fuzzy clustering, and density-based clustering, will help the end-user to select a specific algorithm as per the application. In the last section of this chapter, we introduce various data mining tools like Python, Rapid Miner, and KNIME, etc., to the user to extract the required information.
Keywords: Data mining, KDD, clustering, classification, Python, KNIME
1.1 Introduction
1.1.1. Data Mining
‘Mining’—extracts the meaningful information from the databases. This method helps the researchers, students, and other IT professionals remove the exact significant details and develop the desired applications [1, 2]. It is also known as Knowledge Discovery from databases—KDD. The applications of KDD may include medical/hospitals, Marketing, Educational systems, Scientific applications, E-commerce, Retail industries, Biological analysis, Counterterrorism, use in data-warehouse, in the energy sector for decision making, Spatial data mining, and Logistics [4–6].
1.2 Knowledge Discovery in Database (KDD)
It helps detect the new patterns of previously unknown data, i.e., extracting the hidden patterns, data from the massive volume of datasets [3, 6]. Figure 1.1 gives an idea about Knowledge discovery in Database—KDD, which consists of the following phases:
Data cleaning: This step can be defined as removing irrelevant data. Removing irrelevant data is nothing but unwanted data; records can be removed. Data collection may consist of missing values which must be either needs to be removed or should impute the missing information [7].Figure 1.1 Knowledge discovery in Database—KDD.
Data integration: Data is collected from heterogeneous sources and integrated into a common source like data-warehouse (DW). A very common technique, Extract-Transform-Load (ETL), is beneficial in this regard. Integrating the data from multiple sources requires proper synchronization between the systems [2].
Data selection & transformation: Once the required data is selected, the next task is data transformation. As its name suggests transformation, it is nothing but transforming it into the desired mining procedure [8, 9].
Pattern evaluation: Evaluation is based on some measures; once these measures are applied, retrieved results are strictly compared/evaluated based on the stored patterns [9–11].
Knowledge representation: It is nothing but representing the processed data into the required formats such as tables and reports. One can say knowledge representation generates the rules, and using the exact visualization is possible [10].
1.2.1 Importance of Data Mining
◦ Useful in predictive analysis.
◦ They are storing and managing data in multidimensional systems.
◦ They are identifying the hidden patterns.
◦ Knowledge representation in desired formats, etc. [11].
1.2.2 Applications of Data Mining
Fraud Detection◦ Data mining identifies patterns, i.e., user-specific patterns, and builds a model based on valid and invalid states. Using data mining techniques, one can classify records based on fraudulent and non-fraudulent patterns [14].
Marketing Analysis◦ It is based on Association mining, i.e., identifying user’s preferences. With such techniques, one can identify purchasing habits of the users. Using this technique, one can compare different items, pricing of the items, etc. [13].
Customer Relationship Management◦ Every organization is keenly observing and maintains this segment which is popularly known as CRM. In this segment, one can distinguish users/customers based on loyalty towards the organization. User’s/Customer’s data can be collected and analyzed to get desired results [13].
Banking and Finance◦ The banking and finance sector holds huge data related to clients. Banking and financial software systems help different managers to identify the correct client segment, loyal clients. These software systems process ‘n’ transactions which a person cannot handle manually. Such soft-ware systems stores process a large volume of data and produce desired results less time [13].
Healthcare Industries◦ Everyone concerns about health. Different parameters and values help the health care professionals to diagnose the disease. The number of patients, diseases and symptoms can be processed to get an accurate prediction. Software systems used in the health care industry