9 Extract, Reduce, Subset Data
9.1 Extraction with $
Base R uses the dollar sign, $, to identify data structures inside larger data structures, such as vectors inside dataframes and the elements of a list. $ identifies the larger data structure on its immediate left and smaller data structure on its immediate right.
Note that dollar signs are also used to create italics in R Markdown. We can “escape” this functionality by preceding $ with a backslash, “\”. As with other formatting characters, these backslashes are visible by switching to the source code editor within RStudio.
Below we create vectors, a tibble, and a list, and then demonstration extraction using $.
Code
# create a tibble
<- c(20:25)
age <- c('greg', 'sally', 'sean', 'carlos', 'becka', 'doug')
name <- tibble(name, age)
tbl
# Create a list
= list('item1' = age,
my_list 'item2' = name,
'item3' = tbl,
'nested_list' = list('one' = age^2,
'two' = name,
'three' = tbl))
# extract and summarize age from tbl
summary(tbl$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 21.25 22.50 22.50 23.75 25.00
Code
# extract a vector from deep within a list
$nested_list$two my_list
[1] "greg" "sally" "sean" "carlos" "becka" "doug"
The $ operator can also create new variables from existing variables, as shown below.
Code
$age_plus5 <- tbl$age + 5
tblhead(tbl, 1)
# A tibble: 1 × 3
name age age_plus5
<chr> <int> <dbl>
1 greg 20 25
9.2 Relational and Logical Operators
Relational operators return logical values, TRUE or FALSE, and include <
, >
, ==
, <=
, >=
, and !=
(not equal to). Multiple relational operators can be combined with logical operators &
(“and”) and |
(“inclusive or”) inside filter()
to subset a tibble into fewer rows that meet specific criteria.
Using the console window, run each command below one at a time. All but the first two lines use relational operators.
Code
<- 7
x print(x)
== 4
x == 7
x < 4
x <= 4
x >= 7
x != 0
x < 1 | x > 3
x != 0 & x > 10 x
=
, is NOT a relational operator. =
can be used for assignment just like <- (see note below)
. For example, the command below creates a vector called x.
Code
= 4 x
Although either <-
or =
can be used for assignment, The Tidyverse Style Guide states that <-
should be used for assignment and =
should be used inside functions. Sticking with stylistic conventions makes your code more readable to others and in some cases avoids sneaky problems.
9.3 Select() and Filter()
select()
and filter()
extract columns and rows from tibbles. Both commands come from dplyr, a package within the tidyverse ecosystem of packages.
Run each line of code below and view the output.
Code
# create a smaller tibble with select columns.
select(tbl, name)
# A tibble: 6 × 1
name
<chr>
1 greg
2 sally
3 sean
4 carlos
5 becka
6 doug
Code
select(tbl, name, age)
# A tibble: 6 × 2
name age
<chr> <int>
1 greg 20
2 sally 21
3 sean 22
4 carlos 23
5 becka 24
6 doug 25
Code
# filter the rows of a tibble using relational operators.
filter(tbl, age >= 21, name != 'sally')
# A tibble: 4 × 3
name age age_plus5
<chr> <int> <dbl>
1 sean 22 27
2 carlos 23 28
3 becka 24 29
4 doug 25 30
Code
# combine select() and filter() statements with the pipe operator.
%>%
tbl select(age, name) %>%
filter(age >= 21, name != 'sally')
# A tibble: 4 × 2
age name
<int> <chr>
1 22 sean
2 23 carlos
3 24 becka
4 25 doug
9.4 Subsetting with Brackets (AKA Braces)
9.4.1 Single Brackets [ ]
While Data Essentials with R will mostly use tidyverse commands, students should also be familiar with the [ ] and [[ ]] methods. The most important details about brackets are how they behave on tibbles and dataframes. Brackets used with tibbles and dataframes can take one or two arguments:
- Single arguments identify columns by name or index:
- df[‘col_name’]
- df[‘col_name’]
- When two arguments are used, they identify rows first then columns:
- df[row_numbers, ‘col_name’]
Single brackets used with tibbles always return a tibble, even if the tibble contains a single value. Single brackets used with dataframes may return a vector or a dataframe, depending on context. The consistent behavior of tibbles is an advantage over dataframes.
Code
# extract the age_plus5 column
3] tbl[
# A tibble: 6 × 1
age_plus5
<dbl>
1 25
2 26
3 27
4 28
5 29
6 30
Code
'age_plus5'] tbl[
# A tibble: 6 × 1
age_plus5
<dbl>
1 25
2 26
3 27
4 28
5 29
6 30
Code
# extract the row 3, column 2 element, "sean"
3, 2] tbl[
# A tibble: 1 × 1
age
<int>
1 22
9.4.2 Double brackets [[ ]]
Two nested brackets are often used with a list data structure, but can be used with vectors, tibbles, and dataframes. Double brackets strip the parent data structure from the returned object. The returned object has the simplest data structure possible.
Run each line of code below one at a time. Track the “temp” object in your environment panel. When temp is created as a tibble it will appear under “Data”. When temp is created as a vector it will appear under “Values”.
Code
<- tbl[c('name', 'age_plus5')]
temp temp
# A tibble: 6 × 2
name age_plus5
<chr> <dbl>
1 greg 25
2 sally 26
3 sean 27
4 carlos 28
5 becka 29
6 doug 30
Code
<- tbl[2]
temp temp
# A tibble: 6 × 1
age
<int>
1 20
2 21
3 22
4 23
5 24
6 25
Code
<- tbl[2,3]
temp temp
# A tibble: 1 × 1
age_plus5
<dbl>
1 26
Code
<- tbl[c(1,2), c(2,3)]
temp temp
# A tibble: 2 × 2
age age_plus5
<int> <dbl>
1 20 25
2 21 26
Code
<- tbl$age_plus5[2]
temp temp
[1] 26
Code
<- tbl[[2]]
temp temp
[1] 20 21 22 23 24 25
Code
<- tbl[[2,3]]
temp temp
[1] 26
Bracket methods are quick and commonly used in help vignettes and online forums. Explicit commands like select()
and filter()
might be more typing, but are easier to read and understand. Also, select()
and filter()
can be piped with other tidyverse commands, unlike the bracket methods.