Sports Analytics in Practice with R. Ted Kwartler
in effect the factor level alone represents specific “meta” information such as the other teams in the conference, and even perhaps some of the team’s schedule. This meta-information is inherited as a pattern within the larger data set, not explicitly defined within the object type. While this may be confusing, it will make sense eventually as the object types and classes move to multiple values instead of single values later in this chapter. The code below simply creates a single object, `
teamA
` with a factor defined as the Eastern conference. The function to declare value as a factor is simply `as.factor
`.
teamA <- as.factor('Eastern_Conference')
In addition to factors, the last commonplace variable type includes “character.” Character objects represent natural language, for example, from social media or fan forums that need to be analyzed. The field of character and string analysis is referred to as Natural Language Processing (NLP). These methods and technology underpin the popular smart speakers and voice assistants among other everyday common technologies such as e-mail spam filters. This book devotes one chapter to gauging fan engagement on a popular forum. Thus, this type of data type will be covered extensively. However, one chapter merely covers the basics of NLP and much more can be accomplished with additional methods, code, and academic literature. Below is a fictitious social media post from a fan. Character values can be declared with `as.character
` but, as written here, are not necessary.
fanTweet <- "I love baseball"
In review, Table 1.2 reviews the common data types used in R and within the book. There are additional data types like `NULL
` and `NA
` but these are more straightforward, requiring less explanation. Once you have run all the code in the table, you can simply call `class
` on each object to check that R is interpreting the object type as expected.
Table 1.2 Common R data types including integer, numeric, logical, factor, and character.
Name | Code | Description |
---|---|---|
“integer” |
x <- 5L
|
A whole number without a decimal point |
“numeric” |
y <- 5.123
|
A floating point number |
“logical” |
z <- TRUE
z <- T #capital T or F is acceptable too
|
A logical “Boolean” operator either TRUE or FALSE. R will interpret TRUE as 1 and FALSE as 0 for some operations |
“factor” |
playerPosition <- as.factor(“forward”)
|
A factor is a distinct class often representing non-unique information. The factor classes are referred to as “levels.” Here, a player position is defined as a factor with the level “forward” |
“character” |
fanComment <- “I love the hot dogs at the stadium”
|
Character values, known as strings, represent natural language. Unlike factors, they can be repeating or mutually exclusive. A growing subset of analytics work includes Natural Language Processing (NLP) |
Previously, the objects created such as `xVal
` and `i
` represented a single value. R’s coding environment relies on specific data types and corresponding classes that can be more complex than a single value. For instance, R can create and work with “vectors.” Vectors are merely columns of data that you may be familiar with if you’re coming to R from a spreadsheets program. To create a numeric vector, you employ the combine function which is `c
`. In the following code, a vector of numbers is created called `xVec
`. The `xVec
` object utilizes some of the objects previously created along with additional values that are explicitly declared within the `c`, combine function. Each value within the vector is separated by a comma. Once `xVec
` is created, calling in the console will return multiple values where the object such as `xVal
` is now substituted to their numeric equivalents.
xVec <- c(xVal, i, newObj, 345,678)
Scaling up from a single vector, one method for arranging multiple columns into a single object is with `cbind
`. The `cbind
` function arranges vectors in a column-wise fashion. Similarly, the `rbind
` function will stack vectors as rows. The resulting object type is no longer a “numeric” or other previous type discussed, but instead “matrix” type. A matrix arranges data into rows and columns within a single object. This code creates `xMatrix
` using `cbind
` and simply repeating the previous vector `xVec
` to create a second column. Once executed the `xMatrix
` variable is in the environment and when called demonstrates a five row by two column arrangement of the data in a single object. Calling `class
` on the object will return “matrix.”
xMatrix <- cbind(xVec, xVec)
R has another method for arranging data as rows and columns called a data frame. The data frame object type is useful when you are working with mixed data types, for example, a player roster with names, as characters, teams as factors, statistics as numeric, and so on. All of these vectors can be organized into a single object using `data.frame
`. This code is a bit more complex because it nests functions when constructing the data frame. Within the `data.frame
` function call, the first column is names “number1.” It is assigned a value of `xVec
` which equates to the numeric values previously constructed. The next column, “logical2,” is separated by a comma and employs the `c` function combining logical values. Next, the “factor3” column is declared. This column has multiple functions including `c` to combine a vector of “a,” “b,” “a,” “b,” and “b” but then it is changed from a simple character vector to factor using `as.factor
`. Finally, the fourth column, “string4,” consists of various character strings. Once instantiated in the console, the `xDataFrame
` object can be called to illustrate the mixed data types held within the single object. Table 1.3 shows the results of creating the `xDataFrame
` object.
Table 1.3 The constructed data frame with mixed data types.
number1 | logical2 | factor3 | string4 |
1 | TRUE | a | string1 |
4 | TRUE | b | s2 |
1 | FALSE | a | s3 |
345 | FALSE | b | s4 |
678 | TRUE | b | s5 |
xDataFrame <- data.frame(number1 = xVec, logical2 = c(T,T,F,F,T), factor3 = as.factor(c('a','b','a','b','b')), string4 = c('string1', 's2', 's3', 's4', 's5'))
R can employ either a matrix or data frame to arrange data in rows and columns. In both object types, the columns and rows must be complete. For example, you cannot `cbind` a vector with three values to another with two values. This makes the data “ragged” and for matrices r data frames requires you to fill in the cell value with NA. However, some functions require one object class over another. The difference is that a matrix must have all values be of the same data type. For example, each value in all of the columns must all be numeric or all logical. If this is not the case, the matrix function will coerce the data into characters automatically which can cause issues. As a result, most often in this