The Big R-Book. Philippe J. S. De Brouwer
dnorm(x,mean(SP500),sd(SP500)),col=“blue”,lwd=2)
A better way to check for normality is to study the Q-Q plot. A Q-Q plot compares the sample quantiles with the quantiles of the distribution and it makes very clear where deviations appear.
Q-Q plot
library(MASS)
qqnorm(SP500,col=“red”); qqline(SP500,col=“blue”)
From the Q-Q plot in Figure 8.3 on page 153 (that is generated by the aforementioned code block), it is clear that the returns of the S&P-500 index are not Normally distributed. Outliers far from the mean appear much more often than the Normal distribution would predict. In other words: returns on stock exchanges have “fat tales.”
Figure 8.3: A Q-Q plot is a good way to judge if a set of observations is normally distributed or not.
8.4.2 Binomial Distribution
The Binomial distribution models the probability of an event which has only two possible outcomes. For example, the probability of finding exactly 6 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.
distribution – binomial
The Binomial Distribution in R
As for all distributions, R has four in-built functions to generate binomial distribution:
dbinom(x, size, prob): The density function
dbinom()
pbinom()
dbinom()
pbinom(x, size, prob): The cumulative probability of an event
pbinom()
qbinom(p, size, prob): Gives a number whose cumulative value matches a given probability value
qbinom()
rbinom(n, size, prob): Generates random variables following the binomial distribution.
rbinom()
Following parameters are used:
x: A vector of numbers
p: A vector of probabilities
n: The number of observations
size: The number of trials
prob: The probability of success of each trial
An Example of the Binomial Distribution
The example below illustrates the biniomial distribution and generates the plot in Figure 8.4.
Figure 8.4: The probability to get maximum x tails when flipping a fair coin, illustrated with the binomial distribution.
# Probability of getting 5 or less heads from 10 tosses of # a coin. pbinom(5,10,0.5) ## [1] 0.6230469 # visualize this for one to 10 numbers of tosses x <- 1:10 y <- pbinom(x,10,0.5) plot(x,y,type=“b”,col=“blue”, lwd=3, xlab=“Number of tails”, ylab=“prob of maxium x tails”, main=“Ten tosses of a coin”)# How many heads should we at least expect (with a probability # of 0.25) when a coin is tossed 10 times. qbinom(0.25,10,1/2) ## [1] 4
Similar to theNormal distribution, random draws of the Binomial distribution can be obtained via a function that starts with the letter ‘r’: rbinom()
.
rbinom()
# Find 20 random numbers of tails from and event of 10 tosses # of a coin rbinom(20,10,.5) ## [1] 5 7 2 6 7 4 6 7 3 2 5 9 5 9 5 5 5 5 5 6
8.5. Creating an Overview of Data Characteristics
In the Chapter 4 “The Basics of R” on page 21, we presented some of the basic functions of R that – of course – include the some of the most important functions to describe data (such as mean and standard deviation).
Mileage may vary, but in many research people want to document what they have done and will need to include some summary statistics in their paper or model documentation. The standard summary
of the relevant object might be sufficient.
N <- 100 t <- data.frame(id = 1:N, result = rnorm(N)) summary(t) ## id result ## Min. : 1.00 Min. :-1.8278 ## 1st Qu.: 25.75 1st Qu.:-0.5888 ## Median : 50.50 Median :-0.0487 ## Mean : 50.50 Mean :-0.0252 ## 3rd Qu.: 75.25 3rd Qu.: 0.4902 ## Max. :100.00 Max. : 2.3215
This already produces a neat summary that can directly be used in most reports.2
A tibble and data frame will produce the same summaries.
We might want to produce some specific information that somehow follows the format of the table. To illustrate this, we start from the dataset mtcars
and assume that we want to make a summary per brand for the top-brands (defined as the most frequent appearing in our database).
library(tidyverse) # not only for %>% but also for group_by, etc. # In mtcars the type of the car is only in the column names, # so we need to extract it to add it to the data n <- rownames(mtcars) # Now, add a column brand (use the first letters of the type) t <- mtcars %>% mutate(brand = str_sub(n, 1, 4)) # add column
To achieve this, the function group_by()
from dplyr
will be very handy. Note that this function does not change the dataset as such, it rather adds a layer of information about the grouping.
group_by()
# First, we need to find out which are the most abundant brands # in our dataset (set cutoff at 2: at least 2 cars in database) top_brands <- count(t, brand)