The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book

While the mean (and the average in particular) is widely used, it is actually quite vulnerable to outliers. It would therefore, make sense to have a measure that is less influenced by the outliers and rather answers the question: what would a typical observation look like. The median is such measure.

central tendency – median

The median is the middle-value so that 50% of the observations are lower and 50% are higher.

x <- c(1:5,5e10,NA) x ## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 5e+10 NA median(x) # no meaningful result with NAs ## [1] NA median(x,na.rm = TRUE) # ignore the NA ## [1] 3.5 # Note how the median is not impacted by the outlier, # but the outlier dominates the mean: mean(x, na.rm = TRUE) ## [1] 8333333336

8.1.3 The Mode

mode

central tendency – mode

The mode is the value that has highest probability to occur. For a series of observations, this should be the one that occurs most often. Note that the mode is also defined for variables that have no order-relation (even labels such as “green,” “yellow,” etc. have amode, but not a mean or median—without further abstraction or a numerical representation).

In R, the function mode() or storage.mode() returns a character string describing how a variable is stored. In fact, R does not have a standard function to calculate mode, so let us create our own:

mode()

storage.mode()

# my_mode # Finds the first mode (only one) # Arguments: # v -- numeric vector or factor # Returns: # the first mode my_mode <- function(v) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) uniqv[which.max(tabv)] } # now test this function x <- c(1,2,3,3,4,5,60,NA) my_mode(x) ## [1] 3 x1 <- c(“relevant”, “N/A”, “undesired”, “great”, “N/A”, “undesired”, “great”, “great”) my_mode(x1) ## [1] “great” # text from https://www.r-project.org/about.html t <- “R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.” v <- unlist(strsplit(t,split=” “)) my_mode(v) ## [1] “and”

unique()

Linux

FreeBSD

tabulate()

uniqv()

While this function works fine on the examples provided, it only returns the first mode encountered. In general, however, the mode is not necessarily unique and it might make sense to return them all. This can be done by modifying the code as follows:

# my_mode # Finds the mode(s) of a vector v # Arguments: # v -- numeric vector or factor # return.all -- boolean -- set to true to return all modes # Returns: # the modal elements my_mode <- function(v, return.all = FALSE) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) if (return.all) { uniqv[tabv == max(tabv)] } else { uniqv[which.max(tabv)] } } # example: x <- c(1,2,2,3,3,4,5) my_mode(x) ## [1] 2 my_mode(x, return.all = TRUE) ## [1] 2 3

Hint – Use default values to keep code backwards compatible

We were confident that it was fine to over-ride the definition of the function my_mode. Indeed, if the function was already used in some older code, then one would expect to see only one mode appear. That behaviour is still the same, because we chose the default value for the optional parameter return.all to be FALSE. If the default choice would be TRUE, then older code would produce wrong results and if we would not use a default value, then older code would fail to run.

8.2. Measures of Variation or Spread

measures of spread

Variation or spread measures how different observations are compared to the mean or other central measure. If variation is small, one can expect observations to be closer to each other.

Definition: Variance

variance

8.2.1 Standard Deviation

Definition: Standard deviation

spread – standard deviation

standard deviation

The estimator for standard deviation is:

t <- rnorm(100, mean=, sd=20) var(t) ## [1] 248.2647 sd(t) ## [1] 15.75642 sqrt(var(t)) ## [1] 15.75642 sqrt(sum((t - mean(t))^∧2)/(length(t) - 1)) ## [1] 15.75642

sd()

8.2.2 Median absolute deviation

Definition: mad

mad

median absolute deviation

mad(t) ## [1] 14.54922 mad(t,constant=1) ## [1] 9.813314

mad()

The default “constant=1.4826” (approximately images ensures consistency, i.e.,

for X_i distributed as N(μ, σ²) and large n.

8.3. Measures of Covariation

covariation

When

Скачать книгу