Probability with R. Jane M. Horgan
which yields Fig. 3.13.
The decimal point is 1 digit(s) to the right of the | 1 | 2344 1 | 59 2 | 11 2 | 5556777889999 3 | 0113 3 | 6 4 | 00000000 4 | 6779 5 | 12223344 5 | 56679 6 | 0011123444 6 | 566777888999 7 | 0112344 7 | 5666666899 8 | 001112222334 8 | 5678899 9 | 0122 9 | 7778
FIGURE 3.13 A Stem and Leaf Diagram
From Fig. 3.13, we are able to see the individual observations, as well as the shape of the data as a whole. Notice that there are many marks of exactly 40, whereas just one student obtains a mark between 35 and 40. One wonders if this has anything to do with the fact that 40 is a pass, and that the examiner has been generous to borderline students. This point would go unnoticed with a histogram.
3.4 SCATTER PLOTS
Plots of data are useful to investigate relationships between variables. To examine, for example, the relationship between the performance of students in Programming in Semesters 1 and 2, we could write
plot(prog1, prog2, xlab = "Programming Semester 1", ylab = "Programming Semester 2")
to obtain Fig. 3.14.
Figure 3.14 A Scatter Plot
When more than two variables are involved, R provides a facility for producing scatter plots of all possible pairs.
To do this, first create a data frame of all the variables that you want to compare.
courses <- results[2:5]
This creates a data frame
pairs(courses)
or equivalently
pairs(results[2:5])
will generate Fig. 3.15, which, as you can see, gives scatter plots for all possible pairs.
Figure 3.15 Use of the
Function3.5 THE LINE OF BEST FIT
Returning to Fig. 3.14, we can see that there is a
In the case of the Programming subjects, we have a set of points (
lm(prog2∼prog1)
calculates what is referred to as the linear model (lm) of
that best fits the data.
The output is
Call: lm(formula = prog2∼prog1) Coefficients: (Intercept) prog1 -5.455 0.960
Therefore, the line that best fits these data is
To draw this line on the scatter diagram, write
plot(prog2, prog1) abline(lm(prog2∼prog1))
which gives Fig. 3.16.
Figure 3.16 The Line of Best Fit
The line of best fit may be used to make predictions. For example, we might be able to predict how students will do in Semester 2 from the results that they obtained in Semester 1. If the mark on Programming 1 for a particular student is 70, that student would be expected to do well also in Programming 2, estimated to obtain
A word of warning is appropriate here. The estimated values are based on the assumption that the past trend continues. This may not always be the case. For example, students who do badly in Semester 1, may get such a shock that they work harder in Semester 2, and change the pattern. Similarly, students getting high marks in Semester 1 may be lulled into a sense of false security and take it easy in Semester 2. Consequently, they may not do as well as expected. Hence, the Semester 1 trends may not continue, and the model may no longer be valid.
3.6 MACHINE LEARNING AND THE LINE OF BEST FIT
Machine learning is the science of getting computer systems to use algorithms and statistical models to study patterns and learn from data. Supervised learning is the machine learning task of using past data to learn a function in order to predict a future output.
The line of best fit is one of the many techniques that machine learning has borrowed from the field of Probability and Statistics to “train” the machine to make predictions. In this case of what is also known as the simple linear regression line in statistics, a set of pairs