The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book - Philippe J. S. De Brouwer


Скачать книгу

       stringr expands the standard functions to work with strings and provides a nice coherent set of functions that all start with str_.stringiThe package is built on top of stringi, which uses the ICU library that is written in C, so it is fast too. For more information, see Chapter 17.5 “String Manipulation in the tidyverse” on page 299.stringr

       forcats provides tools to address common problems when working with categorical variables7.forcats

      7.2.2 The Non-core Tidyverse

      Besides the core tidyverse packages – that are loaded with the command library(tidyverse), there are many other packages that are part of the tidyverse. In this section we will describe briefly the most important ones.

       Importing data: readxl for .xls and .xlsx files) and haven for SPSS, Stata, and SAS data.8readxlxlsxxls

       Wrangling data: lubridate for dates and date-times, hms for time-of-day values, blob for storing binary data. lubridate –for example – is discussed in Chapter 17.6 “Dates with lubridate” on page 314.lubridatehmsblob

       Programming: purrr for iterating within R objects, magrittr provides the famous pipe, %>% command plus some more specialised piping operators (like %$% and %<>%), and glue provides an enhancement to the paste() function.purrrmagrittrpaste()glue

       Modelling: this is not really ready, though recipes and rsample are already operational and show the direction this is taking. The aim is to replace modelr 9. Note that there is also the package broom that turns models into tidy data.recipesrsamplemodelrbroom

      image Warning –Work in progress

      While the core-tidyverse is stable, the packages that are not core tend still to change and improve. Check their online documentation when using them.

      7.3.1 Tibbles

      x <- seq(from = 0, to = 2 * pi, length.out = 100) s <- sin(x) c <- cos(x) z <- s + c plot(x, z, type = “l”,col=“red”, lwd=7) lines(x, c, col = “blue”, lwd = 1.5) lines(x, s, col = “darkolivegreen”, lwd = 1.5)Graph depicts the sum of sine and cosine illustrated.

      Imagine further that our purpose is not only to plot these functions, but to use them in other applications. Then it would make sense to put them in a data, frame. The following code does exactly the same using a data frame.

      x <- seq(from = 0, to = 2 * pi, length.out = 100) #df <- as.data.frame((x)) df <- rbind(as.data.frame((x)),cos(x),sin(x), cos(x) + sin(x)) # plot etc.

      This is already more concise. With the tidyverse, it would look as follows (still without using the piping):

      library(tidyverse) x <- seq(from = 0, to = 2 * pi, length.out = 100) tb <- tibble(x, sin(x), cos(x), cos(x) + sin(x))

Schematic illustration of a tibble plots itself like a data-frame.

      The code with a tibble is just a notch shorter, but that is not the point here. Themain advantage in using a tibble is that it will usually do things that make more sense for the modern R-user. For example, consider how a tibble prints itself (compared to what a data frame does).

      # Note how concise and relevant the output is: print(tb) ## # A tibble: 100 x 4 ## x `sin(x)` `cos(x)` `cos(x) + sin(x)` ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 1 1 ## 2 0.0635 0.0634 0.998 1.06 ## 3 0.127 0.127 0.992 1.12 ## 4 0.190 0.189 0.982 1.17 ## 5 0.254 0.251 0.968 1.22 ## 6 0.317 0.312 0.950 1.26 ## 7 0.381 0.372 0.928 1.30 ## 8 0.444 0.430 0.903 1.33 ## 9 0.508 0.486 0.874 1.36 ## 10 0.571 0.541 0.841 1.38 ## # … with 90 more rows # This does the same as for a data-frame: plot(tb) # Actually a tibble will still behave as a data frame: is.data.frame(tb) ## [1] TRUE

      Digression – Special characters in column names

      tb$`sin(x)`[1] ## [1] 0

      This convention is not specific to tibbles, it is used throughout R (e.g. the same back-ticks are needed in ggplot2, tidyr, dyplr, etc.).

      image Hint

      tb <- tibble(`1` = 1:3, `2` = sin(`1`), `1`*pi, 1*pi) tb ## # A tibble: 3 x 4 ## `1` `2` `\`1\` * pi` `1 * pi` ## <int> <dbl> <dbl> <dbl> ## 1 1 0.841 3.14 3.14 ## 2 2 0.909 6.28 3.14 ## 3 3 0.141 9.42 3.14

      However, is this good practice?

      So, why use a tibble instead of a data frame?

      1 It will do less things (such as changing strings into factors, creating row names, change names of variables, no partial matching, but a warning message when you try to access a column that does not exist, etc.).

      2 A tibble will report more errors instead of doing something silently (data type conversions, import, etc.), so they are safer to use.

      3 The specific print function for the tibble, print.tibble(), will not overrun your screen with thousands of lines, it reports only on the ten first. If


Скачать книгу