Working with Strings and Factors
Objectives
- Manipulate and work with string data using the stringr package
- Manipulate an work with factors using the forcats package.
Overview
In this short section, I will provide some additional demonstrations for working with string and factor data in R using the stringr and forcats packages. These packages are part of the tidyverse and make working with string and factor data much easier than using the base R methods. Cheat sheets for stringr and forcats can be found here. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.
library(stringr)
library(forcats)
library(dplyr)Working with Strings
Before I can manipulate strings, I need to read in some data that contains strings. To demonstrate, I will read in matts_movies.csv. Using summary(), you can see that five columns are provided: “Name”“,”Director"“,”Year“,”Genre"“, and”Rating“. I have set the stringsAsFactors argument to FALSE so that all character or text columns are read in as character data as opposed to factors. Later, I will convert specific columns to factors as needed. For this exercise, we will treat the”Name"" and “Director”" columns as character strings and the “Genre” column as a factor.
setwd("D:/ossa/strings_and_factors")
movies <- read.csv("matts_movies.csv", sep=",", header=TRUE, stringsAsFactors=FALSE)
summary(movies)
Movie.Name Director Release.Year My.Rating
Length:1852 Length:1852 Min. :1921 Min. :0.670
Class :character Class :character 1st Qu.:1995 1st Qu.:6.320
Mode :character Mode :character Median :2004 Median :7.020
Mean :1999 Mean :6.953
3rd Qu.:2009 3rd Qu.:7.850
Max. :2014 Max. :9.990
Genre Own
Length:1852 Length:1852
Class :character Class :character
Mode :character Mode :character
Let’s start by finding all titles that contain the word “River” in them. This can be accomplished using the str_detect() function. Instead of creating a new data object, I am saving the result as a new column in the table. I then use the result to filter out the titles that contain this word. TRUE indicates that the title contains “River” while FALSE indicates that it does not.
movies$river <- (str_detect(movies$Movie.Name, "River"))
movies %>% filter(river==TRUE)
Movie.Name Director Release.Year My.Rating
1 Mystic River Clint Eastwood 2003 9.56
2 Frozen River Courtney Hunt 2008 7.69
3 A River Runs Through It Robert Redford 1992 7.66
4 Joan Rivers: A Piece of Work Ricki Stern 2010 6.43
5 The River Wild Curtis Hanson 1994 5.12
Genre Own river
1 Drama Yes TRUE
2 Independent No TRUE
3 Drama Yes TRUE
4 Documentary No TRUE
5 Thriller No TRUEThe str_length() function returns the length of a string. As demonstrated in the result below, this count will included spaces. In the second example, I am obtaining the length of each title as a new column. I am then using the result to find all titles that have a length greater than 20. 472 titles have a length greater than 20.
str_length("A B C D E")
[1] 9movies$len <- str_length(movies$Movie.Name)
movies %>% filter(len > 20) %>% count()
n
1 324str_to_upper() will convert the string to all upper case as demonstrated in the next example. There are similar operations available for converting to lower case and title case.
head(str_to_upper(movies$Movie.Name))
[1] "ALMOST FAMOUS" "THE SHAWSHANK REDEMPTION"
[3] "GROUNDHOG DAY" "DONNIE DARKO"
[5] "CHILDREN OF MEN" "ANNIE HALL" One complexity of string manipulation in R, and also in other coding languages, is that some characters, such as “", have special meaning and cannot be directly interpreted within a string. So, escape characters must be used to make sure the symbol is interpreted as plain text. In this case,”" is used as an escape character. This is another way to reformat a file path in Windows other than converting backslashes to forward slashes (e.g., “C:/Data” and “C:\Data” are both acceptable).
#This will yield an error.
#path <- "C:\R_Examples"
#This will not.
path <- "C:\\R_Examples"
print(path)
[1] "C:\\R_Examples"Working with Factors
Remember that factors in R are actually represented by numeric codes. These codes are then linked to the text description. In the example below I have used the fct_count() function from forcats to count the number of movies in each defined genre. Note, that I first have to convert the column from a character to a factor data type. Using nlevels() I obtain the number of defined levels for the factor. Matt has differentiated 6 different genres.
movies$Genre <- as.factor(movies$Genre)
head(fct_count(movies$Genre))
# A tibble: 6 x 2
f n
<fct> <int>
1 Action 197
2 Classic 47
3 comedy 1
4 Comedy 275
5 Documentary 78
6 Drama 321
nlevels(movies$Genre)
[1] 18The fct_infreq() function allows you to reorder factors based on frequency. Here, you can see that the most common genres were drama, comedy, and action.
common_genres <- fct_infreq(movies$Genre)
head(fct_count(common_genres))
# A tibble: 6 x 2
f n
<fct> <int>
1 Drama 321
2 Comedy 275
3 Action 197
4 Independent 190
5 Thriller 164
6 Foreign 157In the example below, I have filtered out only movies that are in the top three most common genres. However, if I call nlevels(), there are still 454 levels defined. So, I will need to drop unused levels. This can be accomplished with fct_drop(). Printing the number of levels after applying this function will confirm that unused levels have been removed.
movies2 <- movies %>% filter(Genre == "Drama" | Genre == "Documentary" | Genre == "Comedy")
nlevels(movies2$Genre)
[1] 18
movies2$Genre <- fct_drop(movies2$Genre)
nlevels(movies2$Genre)
[1] 3Factor levels can be recoded and/or combined using fct_collapse(). Here, I am combining the “drama” and “comedy” levels into a new level called “fiction” and recoding “documentary” to “nonfiction”. Recoding can also be accomplished using the recode() function from dplyr.
movies2$Genre <- fct_collapse(movies2$Genre, fiction = c("Drama", "Comedy"), nonfiction = c("Documentary"))
nlevels(movies2$Genre)
[1] 2My preferred method for changing the names of factor levels is the recode() function from dplyr, which is demonstrated below.
movies2$Genre <- recode(movies2$Genre, fiction = "F", nonfiction = "NF")
levels(movies2$Genre)
[1] "F" "NF"Regular Expressions
If you work with character or string data, it is worth learning about regular expressions, which is a language for describing patterns in strings. The stringr R cheat sheet includes a page devoted to regular expressions. This language is also used for searching strings in other computational environments, such as Python.
Concluding Remarks
There are many additional functions, analyses, and tasks that can be applied to strings and factors in R. If you are interested in this topic, please consult the stringr and forcats documentation for additional use cases and examples.