The Big R-Book. Philippe J. S. De Brouwer

The Big R-Book

dnorm(x,mean(SP500),sd(SP500)),col=“blue”,lwd=2)

A better way to check for normality is to study the Q-Q plot. A Q-Q plot compares the sample quantiles with the quantiles of the distribution and it makes very clear where deviations appear.

Q-Q plot

library(MASS)

qqnorm(SP500,col=“red”); qqline(SP500,col=“blue”)

From the Q-Q plot in Figure 8.3 on page 153 (that is generated by the aforementioned code block), it is clear that the returns of the S&P-500 index are not Normally distributed. Outliers far from the mean appear much more often than the Normal distribution would predict. In other words: returns on stock exchanges have “fat tales.”

Graph depicts a Q-Q plot is a good way to judge if a set of observations is normally distributed or not

Figure 8.3: A Q-Q plot is a good way to judge if a set of observations is normally distributed or not.

8.4.2 Binomial Distribution

The Binomial distribution models the probability of an event which has only two possible outcomes. For example, the probability of finding exactly 6 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.

distribution – binomial

The Binomial Distribution in R

As for all distributions, R has four in-built functions to generate binomial distribution:

dbinom(x, size, prob): The density function

dbinom()

pbinom()

dbinom()

pbinom(x, size, prob): The cumulative probability of an event

pbinom()

qbinom(p, size, prob): Gives a number whose cumulative value matches a given probability value

qbinom()

rbinom(n, size, prob): Generates random variables following the binomial distribution.

rbinom()

Following parameters are used:

x: A vector of numbers

p: A vector of probabilities

n: The number of observations

size: The number of trials

prob: The probability of success of each trial

An Example of the Binomial Distribution

The example below illustrates the biniomial distribution and generates the plot in Figure 8.4.

Graph depicts the probability to get maximum x tails when ﬂipping a fair coin, illustrated with the binomial distribution.

Figure 8.4: The probability to get maximum x tails when flipping a fair coin, illustrated with the binomial distribution.

# Probability of getting 5 or less heads from 10 tosses of # a coin. pbinom(5,10,0.5) ## [1] 0.6230469 # visualize this for one to 10 numbers of tosses x <- 1:10 y <- pbinom(x,10,0.5) plot(x,y,type=“b”,col=“blue”, lwd=3, xlab=“Number of tails”, ylab=“prob of maxium x tails”, main=“Ten tosses of a coin”)# How many heads should we at least expect (with a probability # of 0.25) when a coin is tossed 10 times. qbinom(0.25,10,1/2) ## [1] 4

Similar to theNormal distribution, random draws of the Binomial distribution can be obtained via a function that starts with the letter ‘r’: rbinom().

rbinom()

# Find 20 random numbers of tails from and event of 10 tosses # of a coin rbinom(20,10,.5) ## [1] 5 7 2 6 7 4 6 7 3 2 5 9 5 9 5 5 5 5 5 6

8.5. Creating an Overview of Data Characteristics

In the Chapter 4 “The Basics of R” on page 21, we presented some of the basic functions of R that – of course – include the some of the most important functions to describe data (such as mean and standard deviation).

Mileage may vary, but in many research people want to document what they have done and will need to include some summary statistics in their paper or model documentation. The standard summary of the relevant object might be sufficient.

N <- 100 t <- data.frame(id = 1:N, result = rnorm(N)) summary(t) ## id result ## Min. : 1.00 Min. :-1.8278 ## 1st Qu.: 25.75 1st Qu.:-0.5888 ## Median : 50.50 Median :-0.0487 ## Mean : 50.50 Mean :-0.0252 ## 3rd Qu.: 75.25 3rd Qu.: 0.4902 ## Max. :100.00 Max. : 2.3215

This already produces a neat summary that can directly be used in most reports.²

Note – A tibble is a special form of data-frame

A tibble and data frame will produce the same summaries.

We might want to produce some specific information that somehow follows the format of the table. To illustrate this, we start from the dataset mtcars and assume that we want to make a summary per brand for the top-brands (defined as the most frequent appearing in our database).

library(tidyverse) # not only for %>% but also for group_by, etc. # In mtcars the type of the car is only in the column names, # so we need to extract it to add it to the data n <- rownames(mtcars) # Now, add a column brand (use the first letters of the type) t <- mtcars %>% mutate(brand = str_sub(n, 1, 4)) # add column

To achieve this, the function group_by() from dplyr will be very handy. Note that this function does not change the dataset as such, it rather adds a layer of information about the grouping.

group_by()

# First, we need to find out which are the most abundant brands # in our dataset (set cutoff at 2: at least 2 cars in database) top_brands <- count(t, brand)

Скачать книгу