Probability with R. Jane M. Horgan

Probability with R - Jane M. Horgan


Скачать книгу

      Measures of central tendency are typical or central points in the data. The most commonly used are the mean and the median.

      Mean: The mean is the sum of all values divided by the number of cases, excluding the missing values.

      To obtain the mean of the data in Example 1.1 stored in

write

      mean(downtime)[1] 25.04348

      So the average downtime of all the computers in the laboratory is just over 25 minutes.

      Going back to the original data in Exercise 1.1 stored in marks, to obtain the mean, write

      mean(marks)

      [1] 57.44

      To obtain the mean marks for females, write

      mean(marks[1:23]) [1] 65.86957

      For males,

      mean(marks[24:50]) [1] 50.25926

      illustrating that the female average is substantially higher than the male average.

      To obtain the mean of the corrected data in Exercise 1.1, recall that the mark of 86 for the 34th student on the list was an error, and that it should have been 46. We changed it with

      marks[34] <- 46

      The new overall average is

      mean(marks) 56.64

      and the new male average is

      mean(marks[24:50]) [1] 48.77778

      increasing the gap between the male and female averages even further.

      If we perform a similar operation for the variables in the examination data given in Example 1.2, we run into trouble. Suppose we want the mean mark for Architecture in Semester 1. In R

      mean(arch1)

      gives

      [1] NA

to indicate that these marks were “not available”. R will not perform arithmetic operations on objects containing NA, unless specifically mandated to skip
remove missing values. To do this, you need to insert the argument na.rm = T or na.rm = TRUE, (not available, remove) into the function.

      For arch1, writing

      mean(arch1, na.rm = TRUE)

      yields

      [1] 63.56897

      To obtain the mean of all the variables in results file, we use the R function sapply.

      sapply(results, mean, na.rm = T)

      yields

       gender arch1 prog1 arch2 prog2 NA 63.56897 59.01709 51.97391 53.78378

      Notice that a

message is returned for gender. The reason for this is that the gender variable is nonnumeric, and R cannot calculate its mean. We could, instead specify the columns that we want to work on.

      sapply(results[2:5], mean, na.rm = TRUE)

      gives

       arch1 prog1 arch2 prog2 63.56897 59.01709 51.97391 53.78378

      Median: The median is the middle value of the data set; 50% of the observations is less and 50% is more than this value.

      In R

      median(downtime)

      yields

      [1] 25

      which means that 50% of the computers experienced less than 25 minutes of downtime, while 50% experienced more than 25 minutes of downtime.

      Also,

      median(marks) [1] 55.5

and
), you will observe that the medians are not too far away from their respective means.

      The median is particularly useful when there are extreme values in the data. Let us look at another example.

      Examining the nine apps with greatest usage on your smartphone, you may find the usage statistics (in MB) are

App Usage (MB)
Facebook 39.72
Chrome 35.37
WhatsApp 5.73
Google 5.60
System Account 3.30
Instagram 3.22
Gmail 2.52
Messenger 1.71
Maps 1.55

      To enter the data, write

      usage <- c(39.72, 35.27, 5.73, 5.6, 3.3, 3.22, 2.52, 1.71, 1.55)

      The mean is

      mean(usage) [1] 10.95778

      while the median is

      median(usage) [1] 3.3

      mean(usage[3:9]) [1] 3.375714 median(usage[3:9]) [1] 3.22

      Now, we see that there is not much difference between the mean and median.

      When there are extremely high values in the data, using the mean as a measure of central tendency gives the wrong impression. A classic example of this is wage statistics where there may be a few instances of very high salaries, which will grossly inflate the average, giving the impression that salaries are higher than they


Скачать книгу