Data Science For Dummies. Lillian Pierson
target="_blank" rel="nofollow" href="#fb3_img_img_023b88f3-0637-53d8-b8b0-2dca043d32ec.png" alt="Technicalstuff"/> In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides results a lot faster but cannot process much data per processing interval.
Apache Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming streaming data in near-real-time. Its power lies in its processing speed: The ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter.
Part 2
Using Data Science to Extract Meaning from Your Data
IN THIS PART …
Master the basics behind machine learning approaches.
Explore the importance of math and statistics for data science.
Work with clustering and instance-based learning algorithms.
Chapter 3
Machine Learning Means … Using a Machine to Learn from Data
IN THIS CHAPTER
If you’ve been watching any news for the past decade, you’ve no doubt heard of a concept called machine learning — often referenced when reporters are covering stories on the newest amazing invention from artificial intelligence. In this chapter, you dip your toes into the area called machine learning, and in Part 3 you see how machine learning and data science are used to increase business profits.
Defining Machine Learning and Its Processes
Machine learning is the practice of applying algorithmic models to data over and over again so that your computer discovers hidden patterns or trends that you can use to make predictions. It’s also called algorithmic learning. Machine learning has a vast and ever-expanding assortment of use cases, including
Real-time Internet advertising
Internet marketing personalization
Internet search
Spam filtering
Recommendation engines
Natural language processing and sentiment analysis
Automatic facial recognition
Customer churn prediction
Credit score modeling
Survival analysis for mechanical equipment
Walking through the steps of the machine learning process
Three main steps are involved in machine learning: setup, learning, and application. Setup involves acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand (called feature selection), and breaking the data into training and test datasets. You use the training data to train the model, and the test data to test the accuracy of the model’s predictions. The learning step involves model experimentation, training, building, and testing. The application step involves model deployment and prediction.
FIGURE 3-1: A example of a simple random sample
Becoming familiar with machine learning terms
Before diving too deeply into a discussion of machine learning methods, you need to know about the (sometimes confusing) vocabulary associated with the field. Because machine learning is an offshoot of both traditional statistics and computer science, it has adopted terms from both fields and added a few of its own. Here is what you need to know:
Instance: The same as a row (in a data table), an observation (in statistics), and a data point. Machine learning practitioners are also known to call an instance a case.
Feature: The same as a column or field (in a data table) and a variable (in statistics). In regression methods, a feature is also called an independent variable (IV).
Target variable: The same as a predictant or dependent variable (DV) in statistics.