The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book

clean and readable for humans. For example, prefer meaningful but long variable names over short but meaningless ones, be considerate towards people using auto-complete in RStudio (so add an id in the first and not last letters of a function name), etc.

Tidyverse is in permanent development as core R itself and many other packages. For further and most up-to-date information we refer to the website of the Tidyverse: http://tidyverse.tidyverse.org.

Tidy Data

Tidy data is in essence data that is easy to understand by people and is formatted and structured with the following rules in mind.

1 a tibble/data-frame for each dataset,

2 a column for each variable,

3 a row for each observation,

4 a value (or NA) in each cell (a “cell” is the intersection between row and column).

The concept of tidy data is so important that we will devote a whole section to tidy data (Section 17.2 “Tidy Data” on page 275) and how to make data tidy (Chapter 17 “Data Wrangling in the tidyverse” on page 265). For now, it is sufficient to have the previous rules in mind. This will allow us to introduce the tools of the tidyverse first and then later come back to making data tidy by using these tools.

Tidy Conventions

The tidyverse also enforces some rules to keep code tidy. The aims are to make code easier to read, reduce the potential misunderstandings, etc.

For example, we remember the convention that R uses to implement it is S3 object oriented programming framework from Chapter 6.2 “S3 Objects” on page 91. In that section we have explained how R finds for example the right method (function) to use when printing an object via the generic dispatcher function print(). When an object of class “glm” is passed to print(), then the function will dispatch the handling to the function print.glm().

However, this is also true for data-frames: the handling is dispatched to print.data.frame(). This example illustrate how at this point it becomes unclear if the function print.data.frame() is the specific case for a data.frame for the print() function or if it is the special case to print a “frame” in the framework of a call to “print.data().” Therefore, the tidyverse recommends naming conventions to avoid the dot ( .). And use the snake_style or UpperCase style instead.

Further information – Tidyverse philosophy

More about programming style in the tidyverse can be found in the online manifesto of the tidyverse website: https://tidyverse.tidyverse.org/articles/manifesto.html.

7.2. Packages in the Tidyverse

Loading the tidyverse will report on which packages are included:

tidyverse

# we assume that you installed the package before: # install.packages(“tidyverse”) # so load it: library(tidyverse) ## - Attaching packages ----------- tidyverse 1.3.0 - ## v ggplot2 3.2.1 v purrr 0.3.3 ## v tibble 2.1.3 v dplyr 0.8.3 ## v tidyr 1.0.0 v stringr 1.4.0 ## v readr 1.3.1 v forcats 0.4.0 ## - Conflicts ------------- tidyverse_conflicts() - ## x purrr::compose() masks pryr::compose() ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x purrr::partial() masks pryr::partial()

So, loading the library tidyverse, loads actually a series of other packages. The collection of these packages are called “core-tidyverse.”

Further, loading tidyverse also informs you about which potential conflicts may occur. For example, we see that calling the function filter() will dispatch to dplyr::filter() (ie. “the function filter in the package dplyr,” while before loading tidyverse, the function stats::filter() would have been called).⁴

filter()

Digression – Calling methods of not loaded packages

When a package is not loaded, it is still possible to call its member functions. To call a function from a certain package, we can use the :: operator.

In other words, when we use the :: operator, we specify in which package this function should be found. Therefore it is possible to use a function froma package that is not loaded or is superseded by a function with the same name from a package that got loaded later.

R allows you to stand on the shoulders of giants: when making your analysis, you can rely on existing packages. It is best to use packages that are part of the tidyverse, whenever there is choice. Doing so, your code can be more consistent, readable, and it will become overall a more satisfying experience to work with R.

7.2.1 The Core Tidyverse

The core tidyverse includes some packages that are commonly used in data wrangling and modelling. Here is a word of explanation already. Later we will explore some of those packages more in detail.

tidyr provides a set of functions that help you get to tidy up data and make adhering to the rules of tidy data easier.tidyrThe idea of tidy data is really simple: it is data where every variable has its own column, and every column is a variable. For more information, see Chapter 17.3 “Tidying Up Data with tidyr” on page 277.

dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. For more information, see Chapter 17 “DataWrangling in the tidyverse” on page 265.

ggplot2 is a system to create graphics with a philosophy: it adheres to a “Grammar of Graphics” and is able to create really stunning results at a reasonable price (it is a notch more abstract to use than the core-R functionality). For more information, see Chapter 31 “A Grammar of Graphics with ggplot2” on page 687.ggplot2For both reasons, we will talk more about it in the sections about reporting: see Chapter 31 on page 687.

readr expands R's standard⁵ functionality to read in rectangular⁶ data.readrIt is more robust, knows more data types and is faster than the core-R functionality. For more information, see Chapter 17.1.2 “Importing Flat Files in the Tidyverse” on page 267 and its subsections.

purrr is casually mentioned in the section about the OO model in R (see Chapter 6 on page 87), and extensively used in Chapter 25.1 “Model Quality Measures” on page 476.purrrIt is a rather complete and consistent set of tools for working with functions and vectors. Using purrr it should be possible to replace most loops with call to purr functions that will work faster.

tibble is a new take on the data frame of core-R. It provides a new base type: tibbles.tibbleTibbles are in essence data frames, that do a little less (so there is less clutter on the screen and less unexpected things happen), but rather give more feedback (showwhat went wrong instead of assuming that you have read all manuals and remember everything).

Скачать книгу