The Big R-Book. Philippe J. S. De Brouwer
While the mean (and the average in particular) is widely used, it is actually quite vulnerable to outliers. It would therefore, make sense to have a measure that is less influenced by the outliers and rather answers the question: what would a typical observation look like. The median is such measure.
central tendency – median
The median is the middle-value so that 50% of the observations are lower and 50% are higher.
x <- c(1:5,5e10,NA) x ## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 5e+10 NA median(x) # no meaningful result with NAs ## [1] NA median(x,na.rm = TRUE) # ignore the NA ## [1] 3.5 # Note how the median is not impacted by the outlier, # but the outlier dominates the mean: mean(x, na.rm = TRUE) ## [1] 8333333336
8.1.3 The Mode
mode
central tendency – mode
The mode is the value that has highest probability to occur. For a series of observations, this should be the one that occurs most often. Note that the mode is also defined for variables that have no order-relation (even labels such as “green,” “yellow,” etc. have amode, but not a mean or median—without further abstraction or a numerical representation).
In R, the function mode()
or storage.mode()
returns a character string describing how a variable is stored. In fact, R does not have a standard function to calculate mode, so let us create our own:
mode()
storage.mode()
# my_mode # Finds the first mode (only one) # Arguments: # v -- numeric vector or factor # Returns: # the first mode my_mode <- function(v) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) uniqv[which.max(tabv)] } # now test this function x <- c(1,2,3,3,4,5,60,NA) my_mode(x) ## [1] 3 x1 <- c(“relevant”, “N/A”, “undesired”, “great”, “N/A”, “undesired”, “great”, “great”) my_mode(x1) ## [1] “great” # text from https://www.r-project.org/about.html t <- “R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.” v <- unlist(strsplit(t,split=” “)) my_mode(v) ## [1] “and”
unique()
Linux
FreeBSD
tabulate()
uniqv()
While this function works fine on the examples provided, it only returns the first mode encountered. In general, however, the mode is not necessarily unique and it might make sense to return them all. This can be done by modifying the code as follows:
# my_mode # Finds the mode(s) of a vector v # Arguments: # v -- numeric vector or factor # return.all -- boolean -- set to true to return all modes # Returns: # the modal elements my_mode <- function(v, return.all = FALSE) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) if (return.all) { uniqv[tabv == max(tabv)] } else { uniqv[which.max(tabv)] } } # example: x <- c(1,2,2,3,3,4,5) my_mode(x) ## [1] 2 my_mode(x, return.all = TRUE) ## [1] 2 3
We were confident that it was fine to over-ride the definition of the function my_mode. Indeed, if the function was already used in some older code, then one would expect to see only one mode appear. That behaviour is still the same, because we chose the default value for the optional parameter return.all
to be FALSE
. If the default choice would be TRUE
, then older code would produce wrong results and if we would not use a default value, then older code would fail to run.
8.2. Measures of Variation or Spread
measures of spread
Variation or spread measures how different observations are compared to the mean or other central measure. If variation is small, one can expect observations to be closer to each other.
Definition: Variance
variance
8.2.1 Standard Deviation
Definition: Standard deviation
spread – standard deviation
standard deviation
The estimator for standard deviation is:
t <- rnorm(100, mean=, sd=20) var(t) ## [1] 248.2647 sd(t) ## [1] 15.75642 sqrt(var(t)) ## [1] 15.75642 sqrt(sum((t - mean(t))∧2)/(length(t) - 1)) ## [1] 15.75642
sd()
8.2.2 Median absolute deviation
Definition: mad
mad
median absolute deviation
mad(t) ## [1] 14.54922 mad(t,constant=1) ## [1] 9.813314
mad()
The default “constant=1.4826” (approximately
for Xi distributed as N(μ, σ2) and large n.
8.3. Measures of Covariation
covariation
When