Probability with R. Jane M. Horgan

Probability with R

Скачать книгу

this and alerts us to the possibility of an error.

To compare the performance of females and males in Architecture in Semester 1, write

gender <- factor(gender, levels = c("f", "m"), labels = c("Female", "Male"))

which changes the labels from “f ” and “m” to “Female” and “Male,” respectively. Then

boxplot(arch1∼gender, ylab = "Marks (%)", main = "Architecture Semester 1", font.main = 1)

outputs Fig. 3.6.

Figure 3.6 A Gender Comparison

Notice the effect of using main = "Architecture Semester 1" that puts the title on the diagram. Also, the use of font.main = 1 ensures that the main title is in plain font.

We can display plots as a matrix using the par function: par(mfrow = c(2,2)) causes the outputs to be displayed in a images array.

par(mfrow = c(2,2)) boxplot(arch1∼gender, main = "Architecture Semester 1", font.main = 1) boxplot(arch2∼gender, main = "Architecture Semester 2", font.main = 1) boxplot(prog1∼gender, main = "Programming Semester 1", font.main = 1) boxplot(prog2∼gender, main = "Programming Semester 2", font.main = 1)

produces Fig. 3.7.

Figure 3.7 A Lattice of Boxplots

We see from Fig. 3.7 that female students seem to do less well than their male counterparts in Programming in Semester 1, where the median mark of the females is considerably lower than that of the males: it is lower even than the first quartile of the male marks. In the other subjects, there do not appear to be any substantial differences.

To undo a matrix‐type output, write

par(mfrow = c(1,1))

which restores the graphics output to the full screen.

3.2 HISTOGRAMS

A histogram is a graphical display of frequencies in categories of a variable and is the traditional way of examining the “shape” of the data.

hist(prog1, xlab ="Marks (%)", main = "Programming Semester 1")

yields Fig. 3.8.

Figure 3.8 A Histogram with Default Breaks

As we can see from Fig. 3.8, hist gives the count of the observations that fall within the categories or “bins” as they are sometimes called. R chooses a “suitable” number of categories, unless otherwise specified. Alternatively, breaks may be used as an argument in hist to determine the number of categories. For example, to get five categories of equal width, you need to include breaks = 5 as an argument.

hist(prog1, xlab = "Marks (%)", main = "Programming Semester 1", breaks = 5)

gives Fig. 3.9

Figure 3.9 A Histogram with Five Breaks of Equal Width

Recall that par can be used to represent all the subjects in one diagram. Type

par (mfrow = c(2,2)) hist(arch1, xlab = "Architecture", main = "Semester 1", ylim = c(0, 35)) hist(arch2, xlab = "Architecture", main = "Semester 2", ylim = c(0, 35)) hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) hist(prog2, xlab = "Programming", main = " ", ylim = c(0, 35))

to get Fig. 3.10. The ylim = c(0, 35) ensures that the images ‐axis is the same scale for all the four subjects.

Figure 3.10 Histogram of Each Subject in Each Semester

Up until now, we have invoked the default parameters of the histogram, notably the bin widths are equal and the frequency in each bin is calculated. These parameters may be changed as appropriate. For example, you may want to specify the bin break‐points to represent the failures and the various classes of passes and honors.

bins <- c(0, 40, 60, 80, 100)hist(prog1, xlab ="Marks (%)", main = "Programming Semester 1", breaks = bins)

yields Fig. 3.11.

Figure 3.11 A Histogram with Breaks of a Specified Width

In Fig. 3.11, observe that the images ‐axis now represents the density. When the bins are not of equal length, R returns a normalized histogram, so that its total area is equal to one.

To get a histogram of percentages, write in R

h <- hist(prog1, plot = FALSE, breaks = 5) #this postpones the plot display h$density <- h$counts/sum(h$counts)*100 #this calculates percentages plot(h, xlab = "Marks (%)", freq = FALSE, ylab = "Percentage", main = "Programming Semester 1")

The output is given in Fig. 3.12. The # allows for a comment. Anything written after # is ignored.

Figure 3.12 Histogram with Percentages

3.3 STEM AND LEAF

The stem and leaf diagram is a more modern way of displaying data than the histogram. It is a depiction of the shape of the data using the actual numbers observed. Similar to the histogram, the stem and leaf gives the frequencies of categories of the variable, but it goes further than that and gives the actual values in each category.

The marks obtained in Programming in Semester 1 are depicted as a stem and leaf diagram using

stem(prog1)

Скачать книгу