The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book - Philippe J. S. De Brouwer


Скачать книгу

      While the mean (and the average in particular) is widely used, it is actually quite vulnerable to outliers. It would therefore, make sense to have a measure that is less influenced by the outliers and rather answers the question: what would a typical observation look like. The median is such measure.

       central tendency – median

      The median is the middle-value so that 50% of the observations are lower and 50% are higher.

      x <- c(1:5,5e10,NA) x ## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 5e+10 NA median(x) # no meaningful result with NAs ## [1] NA median(x,na.rm = TRUE) # ignore the NA ## [1] 3.5 # Note how the median is not impacted by the outlier, # but the outlier dominates the mean: mean(x, na.rm = TRUE) ## [1] 8333333336

      8.1.3 The Mode

       mode

       central tendency – mode

      In R, the function mode() or storage.mode() returns a character string describing how a variable is stored. In fact, R does not have a standard function to calculate mode, so let us create our own:

       mode()

       storage.mode()

      # my_mode # Finds the first mode (only one) # Arguments: # v -- numeric vector or factor # Returns: # the first mode my_mode <- function(v) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) uniqv[which.max(tabv)] } # now test this function x <- c(1,2,3,3,4,5,60,NA) my_mode(x) ## [1] 3 x1 <- c(“relevant”, “N/A”, “undesired”, “great”, “N/A”, “undesired”, “great”, “great”) my_mode(x1) ## [1] “great” # text from https://www.r-project.org/about.html t <- “R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.” v <- unlist(strsplit(t,split=” “)) my_mode(v) ## [1] “and”

       unique()

       Linux

       FreeBSD

       tabulate()

       uniqv()

      While this function works fine on the examples provided, it only returns the first mode encountered. In general, however, the mode is not necessarily unique and it might make sense to return them all. This can be done by modifying the code as follows:

      # my_mode # Finds the mode(s) of a vector v # Arguments: # v -- numeric vector or factor # return.all -- boolean -- set to true to return all modes # Returns: # the modal elements my_mode <- function(v, return.all = FALSE) { uniqv <- unique(v) tabv <- tabulate(match(v, uniqv)) if (return.all) { uniqv[tabv == max(tabv)] } else { uniqv[which.max(tabv)] } } # example: x <- c(1,2,2,3,3,4,5) my_mode(x) ## [1] 2 my_mode(x, return.all = TRUE) ## [1] 2 3

      image Hint – Use default values to keep code backwards compatible

       measures of spread

      Definition: Variance

equation

       variance

      8.2.1 Standard Deviation

      Definition: Standard deviation

equation

       spread – standard deviation

       standard deviation

      The estimator for standard deviation is:

equation

      t <- rnorm(100, mean=, sd=20) var(t) ## [1] 248.2647 sd(t) ## [1] 15.75642 sqrt(var(t)) ## [1] 15.75642 sqrt(sum((t - mean(t))2)/(length(t) - 1)) ## [1] 15.75642

       sd()

      8.2.2 Median absolute deviation

      Definition: mad

equation

       mad

       median absolute deviation

      mad(t) ## [1] 14.54922 mad(t,constant=1) ## [1] 9.813314

       mad()

equation

      for Xi distributed as N(μ, σ2) and large n.

       covariation


Скачать книгу