The Big R-Book. Philippe J. S. De Brouwer
## 3 Paula Female 92 26 ## 4 Lisa Female 89 30 ## 5 Laura Female 84 35 # Get the last rows: tail(data_test) ## Name Gender Score Age ## 1 Piotr Male 78 42 ## 2 Pawel Male 88 38 ## 3 Paula Female 92 26 ## 4 Lisa Female 89 30 ## 5 Laura Female 84 35 # Extract the column 2 and 4 and keep all rows data_test.1 <- data_test[,c(2,4)] print(data_test.1) ## Gender Age ## 1 Male 42 ## 2 Male 38 ## 3 Female 26 ## 4 Female 30 ## 5 Female 35 # Extract columns by name and keep only selected rows data_test[c(2:4),c(2,4)] ## Gender Age ## 2 Male 38 ## 3 Female 26 ## 4 Female 30
The default behaviour of R is to convert strings to factors when a data.frame is created. Decades ago this was useful for performance reasons. Now, this is usually unwanted behaviour.a To avoid this put stringsAsFactors = FALSE
in the data.frame()
function.
d <- data.frame( Name = c(“Piotr”, “Pawel”,“Paula”,“Lisa”,“Laura”), Gender = c(“Male”, “Male”,“Female”, “Female”,“Female”), Score = c(78,88,92,89,84), Age = c(42,38,26,30,35), stringsAsFactors = FALSE ) d$Gender <- factor(d$Gender) # manually factorize gender str(d) ## ‘data.frame’: 5 obs. of 4 variables: ## $ Name : chr “Piotr” “Pawel” “Paula” “Lisa” … ## $ Gender: Factor w/ 2 levels “Female”,“Male”: 2 2 1 1 1 ## $ Score : num 78 88 92 89 84 ## $ Age : num 42 38 26 30 35
4.3.8.3 Editing Data in a Data Frame
While one usually reads in large amounts of data and uses an IDE such as RStudio that facilitates the visualization and manual modification of data frames, it is useful to know how this is done when no graphical interface is available. Even when working on a server, all these functions will always be available.
de()
data.entry()
edit()
de(x) # fails if x is not defined de(x <- c(NA)) # works x <- de(x <- c(NA)) # will also save the changes data.entry(x) # de is short for data.entry x <- edit(x) # use the standard editor (vi in *nix)
Of course, there are also multiple ways to address data directly in R.
# The following lines do the same. data_test$Score[1] <- 80 data_test[3,1] <- 80
4.3.8.4 Modifying Data Frames
Add Columns to a Data-frame
Typically, the variables are in the columns and adding a column corresponds to adding a new, observed variable. This is done via the function cbind()
.
cbind()
# Expand the data frame, simply define the additional column: data_test$End_date <- as.Date(c(“2014-03-01”, “2017-02-13”, “2014-10-10”, “2015-05-10”,“2010-08-25”)) print(data_test) ## Name Gender Score Age End_date ## 1 Piotr Male 80 42 2014-03-01 ## 2 Pawel Male 88 38 2017-02-13 ## 3 <NA> Female 92 26 2014-10-10 ## 4 Lisa Female 89 30 2015-05-10 ## 5 Laura Female 84 35 2010-08-25 # Or use the function cbind() to combine data frames along columns: Start_date <- as.Date(c(“2012-03-01”, “2013-02-13”, “2012-10-10”, “2011-05-10”,“2001-08-25”)) # Use this vector directly: df <- cbind(data_test, Start_date) print(df) ## Name Gender Score Age End_date Start_date ## 1 Piotr Male 80 42 2014-03-01 2012-03-01 ## 2 Pawel Male 88 38 2017-02-13 2013-02-13 ## 3 <NA> Female 92 26 2014-10-10 2012-10-10 ## 4 Lisa Female 89 30 2015-05-10 2011-05-10 ## 5 Laura Female 84 35 2010-08-25 2001-08-25 # or first convert to a data frame: df <- data.frame(“Start_date” = t(Start_date)) df <- cbind(data_test, Start_date) print(df) ## Name Gender Score Age End_date Start_date ## 1 Piotr Male 80 42 2014-03-01 2012-03-01 ## 2 Pawel Male 88 38 2017-02-13 2013-02-13 ## 3 <NA> Female 92 26 2014-10-10 2012-10-10 ## 4 Lisa Female 89 30 2015-05-10 2011-05-10 ## 5 Laura Female 84 35 2010-08-25 2001-08-25
Adding Rows to a Data-frame
Adding rows corresponds to adding observations. This is done via the function rbind().
rbind()
# To add a row, we need the rbind() function: data_test.to.add <- data.frame( Name = c(“Ricardo”, “Anna”), Gender = c(“Male”, “Female”), Score = c(66,80), Age = c(70,36), End_date = as.Date(c(“2016-05-05”,“2016-07-07”)) ) data_test.new <- rbind(data_test,data_test.to.add) print(data_test.new) ## Name Gender Score Age End_date ## 1 Piotr Male 80 42 2014-03-01 ## 2 Pawel Male 88 38 2017-02-13 ## 3 <NA> Female 92 26 2014-10-10 ## 4 Lisa Female 89 30 2015-05-10 ## 5 Laura Female 84 35 2010-08-25 ## 6 Ricardo Male 66 70 2016-05-05 ## 7 Anna Female 80 36 2016-07-07
Merging data frames
Merging allows to extract the subset of two data-frames where a given set of columns match.
data_test.1 <- data.frame( Name = c(“Piotr”, “Pawel”,“Paula”,“Lisa”,“Laura”), Gender = c(“Male”, “Male”,“Female”, “Female”,“Female”), Score = c(78,88,92,89,84), Age = c(42,38,26,30,35) ) data_test.2 <- data.frame( Name = c(“Piotr”, “Pawel”,“notPaula”,“notLisa”,“Laura”), Gender = c(“Male”, “Male”,“Female”, “Female”,“Female”), Score = c(78,88,92,89,84), Age = c(42,38,26,30,135) ) data_test.merged <- merge(x=data_test.1,y=data_test.2, by.x=c(“Name”,“Age”),by.y=c(“Name”,“Age”)) # Only records that match in name and age are in the merged table: print(data_test.merged) ## Name Age Gender.x Score.x Gender.y Score.y ## 1 Pawel 38 Male 88 Male 88 ## 2 Piotr 42 Male 78 Male 78
merge()
Short-cuts
R will allow the use of short-cuts, provided that they are unique. For example, in the data-frame data_test
there is a column Name
. There are no other columns whose name start with the letter “N”; hence. this one letter is enough to address this column.
short-cut
data_test$N ## [1] Piotr Pawel Paula Lisa Laura ## Levels: Laura Lisa Paula Pawel Piotr
Use “short-cuts” sparingly and only when working