Probability with R. Jane M. Horgan
this and alerts us to the possibility of an error.
To compare the performance of females and males in Architecture in Semester 1, write
gender <- factor(gender, levels = c("f", "m"), labels = c("Female", "Male"))
which changes the labels from “f ” and “m” to “Female” and “Male,” respectively. Then
boxplot(arch1∼gender, ylab = "Marks (%)", main = "Architecture Semester 1", font.main = 1)
outputs Fig. 3.6.
Figure 3.6 A Gender Comparison
Notice the effect of using main = "Architecture Semester 1"
that puts the title on the diagram. Also, the use of font.main = 1
ensures that the main title is in plain font.
We can display plots as a matrix using the par
function: par(mfrow = c(2,2))
causes the outputs to be displayed in a
par(mfrow = c(2,2)) boxplot(arch1∼gender, main = "Architecture Semester 1", font.main = 1) boxplot(arch2∼gender, main = "Architecture Semester 2", font.main = 1) boxplot(prog1∼gender, main = "Programming Semester 1", font.main = 1) boxplot(prog2∼gender, main = "Programming Semester 2", font.main = 1)
produces Fig. 3.7.
Figure 3.7 A Lattice of Boxplots
We see from Fig. 3.7 that female students seem to do less well than their male counterparts in Programming in Semester 1, where the median mark of the females is considerably lower than that of the males: it is lower even than the first quartile of the male marks. In the other subjects, there do not appear to be any substantial differences.
To undo a matrix‐type output, write
par(mfrow = c(1,1))
which restores the graphics output to the full screen.
3.2 HISTOGRAMS
A histogram is a graphical display of frequencies in categories of a variable and is the traditional way of examining the “shape” of the data.
hist(prog1, xlab ="Marks (%)", main = "Programming Semester 1")
yields Fig. 3.8.
Figure 3.8 A Histogram with Default Breaks
As we can see from Fig. 3.8, hist
gives the count of the observations that fall within the categories or “bins” as they are sometimes called. R chooses a “suitable” number of categories, unless otherwise specified. Alternatively, breaks
may be used as an argument in hist
to determine the number of categories. For example, to get five categories of equal width, you need to include breaks = 5
as an argument.
hist(prog1, xlab = "Marks (%)", main = "Programming Semester 1", breaks = 5)
gives Fig. 3.9
Figure 3.9 A Histogram with Five Breaks of Equal Width
Recall that par
can be used to represent all the subjects in one diagram. Type
par (mfrow = c(2,2)) hist(arch1, xlab = "Architecture", main = "Semester 1", ylim = c(0, 35)) hist(arch2, xlab = "Architecture", main = "Semester 2", ylim = c(0, 35)) hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) hist(prog2, xlab = "Programming", main = " ", ylim = c(0, 35))
to get Fig. 3.10. The ylim = c(0, 35)
ensures that the
Figure 3.10 Histogram of Each Subject in Each Semester
Up until now, we have invoked the default parameters of the histogram, notably the bin widths are equal and the frequency in each bin is calculated. These parameters may be changed as appropriate. For example, you may want to specify the bin break‐points to represent the failures and the various classes of passes and honors.
bins <- c(0, 40, 60, 80, 100)hist(prog1, xlab ="Marks (%)", main = "Programming Semester 1", breaks = bins)
yields Fig. 3.11.
Figure 3.11 A Histogram with Breaks of a Specified Width
In Fig. 3.11, observe that the
To get a histogram of percentages, write in R
h <- hist(prog1, plot = FALSE, breaks = 5) #this postpones the plot display h$density <- h$counts/sum(h$counts)*100 #this calculates percentages plot(h, xlab = "Marks (%)", freq = FALSE, ylab = "Percentage", main = "Programming Semester 1")
The output is given in Fig. 3.12. The # allows for a comment. Anything written after # is ignored.
Figure 3.12 Histogram with Percentages
3.3 STEM AND LEAF
The stem and leaf diagram is a more modern way of displaying data than the histogram. It is a depiction of the shape of the data using the actual numbers observed. Similar to the histogram, the stem and leaf gives the frequencies of categories of the variable, but it goes further than that and gives the actual values in each category.
The marks obtained in Programming in Semester 1 are depicted as a stem and leaf diagram using
stem(prog1)