Data Science in Theory and Practice. Maria Cristina Mariani

Data Science in Theory and Practice - Maria Cristina Mariani


Скачать книгу
href="#ub60ef093-e059-500f-9cfe-427787c72601">6.

      In recent times, the use of the term “big data” (both stored and real‐time) tend to refer to the use of user behavior analytics (UBA), predictive analytics, or certain other advanced data analytics methods that extract value from data. UBA solutions look at patterns of human behavior, and then apply algorithms and statistical analysis to detect meaningful anomalies from those patterns' anomalies that indicate potential threats. For example detection of hackers, detection of insider threats, targeted attacks, financial fraud, and several others.

      Predictive analytics deals with the process of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Generally, predictive analytics does not tell you what will happen in the future. However, it forecasts what might happen in the future with some degree of certainty. Predictive analytics goes hand in hand with big data: Businesses and organizations collect large amounts of real‐time customer data and predictive analytics and uses this historical data, combined with customer insight, to forecast future events. Predictive analytics helps organizations to use big data to move from a historical view to a forward‐looking perspective of the customer. In this book, we will discuss several methods for analyzing big data.

      1.4.1 Characteristics of Big Data

      1.4.2 Big Data Architectures

      Big data architectures are designed to handle the ingestion, processing, and analysis of data that is too large or complex for classical data-processing application tools. Some popular big data architectures are the Lambda architecture, Kappa architecture and the Internet of Things (IoT). We refer the reader to the Microsoft technical documentation on Big data architectures for a detailed discussion on the different architectures. Almost all big data architectures include all or some of the following components:

       Data sources: All big data solutions begin with one or more data sources. Some common data sources includes the following: Application data stores such as relational databases, static files produced by applications such as web server log files, and real‐time data sources such as the Internet of Things (IoT) devices.

       Data storage: Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. A data lake is a storage repository that allows one to store structured and unstructured data at any scale until it is needed.

       Batch processing: Since data sets are enormous, often a big data solution must process data files using long‐running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Normally, these jobs involve reading source files, processing them, and writing the output to new files. Options include running U‐SQL jobs or using Java, Scala, R, or Python programs. U-SQL is a data processing language that merges the benefits of SQL with the expressive power of ones own code.

       Real‐time message ingestion: If the solution includes real‐time sources, the architecture must include a way to capture and store real‐time messages for stream processing. This might be a simple data store, where incoming messages are stored into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages and to support scale‐out processing, reliable delivery, and other message queuing semantics.

       Stream processing: After obtaining real‐time messages, the solution must process them by filtering, aggregating, and preparing the data for analysis. The processed stream data is then written to an output sink.

       Analytical data store: Several big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball‐style relational data warehouse, as observed in most classical business intelligence (BI) solutions. Alternatively, the data could be presented through a low‐latency NoSQL technology, such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store.

       Analysis and reporting: The goal of most big data solutions is to provide insights into the data through analysis and reporting. Users can analyze the data using mathematical and statistical models as well using data visualization techniques. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.

       Orchestration: Several big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or move the results to a report or dashboard.

      2.1 Introduction

      The matrix algebra and random vectors presented in this chapter will enable us to precisely state statistical models. We will begin by discussing some basic concepts that will be essential throughout this chapter. For more details on matrix algebra please consult (Axler 2015).

      

      2.2.1 Vectors

      Definition 2.1 (Vector) A vector

is an array of real numbers
, and it is written as:

Скачать книгу