Probability with R. Jane M. Horgan

Probability with R

which yields Fig. 3.13.

The decimal point is 1 digit(s) to the right of the | 1 | 2344 1 | 59 2 | 11 2 | 5556777889999 3 | 0113 3 | 6 4 | 00000000 4 | 6779 5 | 12223344 5 | 56679 6 | 0011123444 6 | 566777888999 7 | 0112344 7 | 5666666899 8 | 001112222334 8 | 5678899 9 | 0122 9 | 7778

FIGURE 3.13 A Stem and Leaf Diagram

From Fig. 3.13, we are able to see the individual observations, as well as the shape of the data as a whole. Notice that there are many marks of exactly 40, whereas just one student obtains a mark between 35 and 40. One wonders if this has anything to do with the fact that 40 is a pass, and that the examiner has been generous to borderline students. This point would go unnoticed with a histogram.

3.4 SCATTER PLOTS

Plots of data are useful to investigate relationships between variables. To examine, for example, the relationship between the performance of students in Programming in Semesters 1 and 2, we could write

plot(prog1, prog2, xlab = "Programming Semester 1", ylab = "Programming Semester 2")

to obtain Fig. 3.14.

Figure 3.14 A Scatter Plot

When more than two variables are involved, R provides a facility for producing scatter plots of all possible pairs.

To do this, first create a data frame of all the variables that you want to compare.

courses <- results[2:5]

This creates a data frame images containing the second to the fifth variables in images , that is, images and images . Writing

pairs(courses)

or equivalently

pairs(results[2:5])

will generate Fig. 3.15, which, as you can see, gives scatter plots for all possible pairs.

Figure 3.15 Use of the

Function

3.5 THE LINE OF BEST FIT

Returning to Fig. 3.14, we can see that there is a images in these data. One variable increases with the other; not surprisingly, students doing well in Programming in Semester 1 are likely to do well also in Programming in Semester 2, and those doing badly in Semester 1 will tend to do badly in Semester 2. We might ask, if it is possible to estimate the Semester 2 results from those obtained in Semester 1.

In the case of the Programming subjects, we have a set of points ( images , images ), and having established, from the scatter plot, that a linear trend exists, we attempt to fit a line that best fits the data. In R

lm(prog2∼prog1)

calculates what is referred to as the linear model (lm) of images on images , or simply the line

that best fits the data.

The output is

Call: lm(formula = prog2∼prog1) Coefficients: (Intercept) prog1 -5.455 0.960

Therefore, the line that best fits these data is

To draw this line on the scatter diagram, write

plot(prog2, prog1) abline(lm(prog2∼prog1))

which gives Fig. 3.16.

Figure 3.16 The Line of Best Fit

The line of best fit may be used to make predictions. For example, we might be able to predict how students will do in Semester 2 from the results that they obtained in Semester 1. If the mark on Programming 1 for a particular student is 70, that student would be expected to do well also in Programming 2, estimated to obtain images . A student doing badly in Programming 1, 30 say, would also be expected to do badly in Programming 2. images . These predictions may not be exact but, if the linear trend is strong and past trends continue, they will be reasonably close.

A word of warning is appropriate here. The estimated values are based on the assumption that the past trend continues. This may not always be the case. For example, students who do badly in Semester 1, may get such a shock that they work harder in Semester 2, and change the pattern. Similarly, students getting high marks in Semester 1 may be lulled into a sense of false security and take it easy in Semester 2. Consequently, they may not do as well as expected. Hence, the Semester 1 trends may not continue, and the model may no longer be valid.

3.6 MACHINE LEARNING AND THE LINE OF BEST FIT

Machine learning is the science of getting computer systems to use algorithms and statistical models to study patterns and learn from data. Supervised learning is the machine learning task of using past data to learn a function in order to predict a future output.

The line of best fit is one of the many techniques that machine learning has borrowed from the field of Probability and Statistics to “train” the machine to make predictions. In this case of what is also known as the simple linear regression line in statistics, a set of pairs images of data is obtained, images is referred to as the independent variable, and images is the dependent variable. The objective is to estimate Скачать книгу