9 Extract, Reduce, Subset Data
9.1 Extraction with $
Base R uses the dollar sign, $, to identify data structures inside larger data structures, such as vectors inside dataframes and the elements of a list. $ identifies the larger data structure on its immediate left and smaller data structure on its immediate right.
Note that dollar signs are also used to create italics in R Markdown. We can “escape” this functionality by preceding $ with a backslash, “\”. As with other formatting characters, these backslashes are visible by switching to the source code editor within RStudio.
Below we create vectors, a tibble, and a list, and then demonstration extraction using $.
Code
# create a tibble
<- c(20:25)
age <- c('greg', 'sally', 'sean', 'carlos', 'becka', 'doug')
name <- tibble(name, age)
tbl
# Create a list
= list('item1' = age,
my_list 'item2' = name,
'item3' = tbl,
'nested_list' = list('one' = age^2,
'two' = name,
'three' = tbl))
# extract and summarize age from tbl
summary(tbl$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 21.25 22.50 22.50 23.75 25.00
Code
# extract a vector from deep within a list
$nested_list$two my_list
[1] "greg" "sally" "sean" "carlos" "becka" "doug"
The $ operator can also create new variables from existing variables, as shown below.
Code
$age_plus5 <- tbl$age + 5
tblhead(tbl, 1)
# A tibble: 1 × 3
name age age_plus5
<chr> <int> <dbl>
1 greg 20 25
9.2 Relational and Logical Operators
Relational operators return logical values, TRUE or FALSE, and include <
, >
, ==
, <=
, >=
, and !=
(not equal to). Multiple relational operators can be combined with logical operators &
(“and”) and |
(“inclusive or”) inside filter()
to subset a tibble into fewer rows that meet specific criteria.
9.3 Select() and Filter()
select()
and filter()
extract columns and rows from tibbles. Both commands come from dplyr, a package within the tidyverse ecosystem of packages.
9.4 Subsetting with Brackets (AKA Braces)
9.4.1 Single Brackets [ ]
While Data Essentials with R will mostly use tidyverse commands, students should also be familiar with the [ ] and [[ ]] methods. The most important details about brackets are how they behave on tibbles and dataframes. Brackets used with tibbles and dataframes can take one or two arguments:
- Single arguments identify columns by name or index:
- df[‘col_name’]
- df[‘col_name’]
- When two arguments are used, they identify rows first then columns:
- df[row_numbers, ‘col_name’]
Single brackets used with tibbles always return a tibble, even if the tibble contains a single value. Single brackets used with dataframes may return a vector or a dataframe, depending on context. The consistent behavior of tibbles is an advantage over dataframes.
Code
# extract the age_plus5 column
3] tbl[
# A tibble: 6 × 1
age_plus5
<dbl>
1 25
2 26
3 27
4 28
5 29
6 30
Code
'age_plus5'] tbl[
# A tibble: 6 × 1
age_plus5
<dbl>
1 25
2 26
3 27
4 28
5 29
6 30
Code
# extract the row 3, column 2 element, "sean"
3, 2] tbl[
# A tibble: 1 × 1
age
<int>
1 22
9.4.2 Double brackets [[ ]]
Two nested brackets are often used with a list data structure, but can be used with vectors, tibbles, and dataframes. Double brackets strip the parent data structure from the returned object. The returned object has the simplest data structure possible.
Bracket methods are quick and commonly used in help vignettes and online forums. Explicit commands like select()
and filter()
might be more typing, but are easier to read and understand. Also, select()
and filter()
can be piped with other tidyverse commands, unlike the bracket methods.