9  Extract, Reduce, Subset Data

9.1 Extraction with $

Base R uses the dollar sign, $, to identify data structures inside larger data structures, such as vectors inside dataframes and the elements of a list. $ identifies the larger data structure on its immediate left and smaller data structure on its immediate right.

Note that dollar signs are also used to create italics in R Markdown. We can “escape” this functionality by preceding $ with a backslash, “\”. As with other formatting characters, these backslashes are visible by switching to the source code editor within RStudio.

Below we create vectors, a tibble, and a list, and then demonstration extraction using $.

Code
# create a tibble
age <-  c(20:25)
name <-  c('greg', 'sally', 'sean', 'carlos', 'becka', 'doug')  
tbl <- tibble(name, age)

# Create a list 
my_list = list('item1' = age, 
               'item2' = name, 
               'item3' = tbl, 
               'nested_list' = list('one' = age^2, 
                                    'two' = name, 
                                    'three' = tbl))

# extract and summarize age from tbl
summary(tbl$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.00   21.25   22.50   22.50   23.75   25.00 
Code
# extract a vector from deep within a list
my_list$nested_list$two
[1] "greg"   "sally"  "sean"   "carlos" "becka"  "doug"  

The $ operator can also create new variables from existing variables, as shown below.

Code
tbl$age_plus5 <-  tbl$age + 5
head(tbl, 1)
# A tibble: 1 × 3
  name    age age_plus5
  <chr> <int>     <dbl>
1 greg     20        25

9.2 Relational and Logical Operators

Relational operators return logical values, TRUE or FALSE, and include <, >, ==, <=, >=, and != (not equal to). Multiple relational operators can be combined with logical operators & (“and”) and | (“inclusive or”) inside filter() to subset a tibble into fewer rows that meet specific criteria.

Using the console window, run each command below one at a time. All but the first two lines use relational operators.

Code
x <- 7
print(x)
x == 4 
x == 7
x < 4
x <= 4
x >= 7
x != 0 
x < 1 | x > 3  
x != 0 & x > 10 
IMPORTANT = vs ==

=, is NOT a relational operator. = can be used for assignment just like <- (see note below). For example, the command below creates a vector called x.

Code
x = 4
NOTE R Code Style Guide

Although either <- or = can be used for assignment, The Tidyverse Style Guide states that <- should be used for assignment and = should be used inside functions. Sticking with stylistic conventions makes your code more readable to others and in some cases avoids sneaky problems.

9.3 Select() and Filter()

select() and filter() extract columns and rows from tibbles. Both commands come from dplyr, a package within the tidyverse ecosystem of packages.

Run each line of code below and view the output.

Code
# create a smaller tibble with select columns.
select(tbl, name)  
# A tibble: 6 × 1
  name  
  <chr> 
1 greg  
2 sally 
3 sean  
4 carlos
5 becka 
6 doug  
Code
select(tbl, name, age)
# A tibble: 6 × 2
  name     age
  <chr>  <int>
1 greg      20
2 sally     21
3 sean      22
4 carlos    23
5 becka     24
6 doug      25
Code
# filter the rows of a tibble using relational operators.
filter(tbl, age >= 21, name != 'sally')
# A tibble: 4 × 3
  name     age age_plus5
  <chr>  <int>     <dbl>
1 sean      22        27
2 carlos    23        28
3 becka     24        29
4 doug      25        30
Code
# combine select() and filter() statements with the pipe operator.
tbl %>% 
    select(age, name) %>% 
    filter(age >= 21, name != 'sally')
# A tibble: 4 × 2
    age name  
  <int> <chr> 
1    22 sean  
2    23 carlos
3    24 becka 
4    25 doug  

9.4 Subsetting with Brackets (AKA Braces)

9.4.1 Single Brackets [ ]

While Data Essentials with R will mostly use tidyverse commands, students should also be familiar with the [ ] and [[ ]] methods. The most important details about brackets are how they behave on tibbles and dataframes. Brackets used with tibbles and dataframes can take one or two arguments:

  • Single arguments identify columns by name or index:
    • df[‘col_name’]
  • When two arguments are used, they identify rows first then columns:
    • df[row_numbers, ‘col_name’]

Single brackets used with tibbles always return a tibble, even if the tibble contains a single value. Single brackets used with dataframes may return a vector or a dataframe, depending on context. The consistent behavior of tibbles is an advantage over dataframes.

Code
# extract the age_plus5 column
tbl[3]
# A tibble: 6 × 1
  age_plus5
      <dbl>
1        25
2        26
3        27
4        28
5        29
6        30
Code
tbl['age_plus5']
# A tibble: 6 × 1
  age_plus5
      <dbl>
1        25
2        26
3        27
4        28
5        29
6        30
Code
# extract the row 3, column 2 element, "sean"
tbl[3, 2]
# A tibble: 1 × 1
    age
  <int>
1    22

9.4.2 Double brackets [[ ]]

Two nested brackets are often used with a list data structure, but can be used with vectors, tibbles, and dataframes. Double brackets strip the parent data structure from the returned object. The returned object has the simplest data structure possible.

Run each line of code below one at a time. Track the “temp” object in your environment panel. When temp is created as a tibble it will appear under “Data”. When temp is created as a vector it will appear under “Values”.

Code
temp <- tbl[c('name', 'age_plus5')]
temp
# A tibble: 6 × 2
  name   age_plus5
  <chr>      <dbl>
1 greg          25
2 sally         26
3 sean          27
4 carlos        28
5 becka         29
6 doug          30
Code
temp <- tbl[2]
temp
# A tibble: 6 × 1
    age
  <int>
1    20
2    21
3    22
4    23
5    24
6    25
Code
temp <- tbl[2,3]
temp
# A tibble: 1 × 1
  age_plus5
      <dbl>
1        26
Code
temp <- tbl[c(1,2), c(2,3)]
temp
# A tibble: 2 × 2
    age age_plus5
  <int>     <dbl>
1    20        25
2    21        26
Code
temp <- tbl$age_plus5[2]
temp
[1] 26
Code
temp <- tbl[[2]]
temp
[1] 20 21 22 23 24 25
Code
temp <- tbl[[2,3]]
temp
[1] 26

Bracket methods are quick and commonly used in help vignettes and online forums. Explicit commands like select() and filter() might be more typing, but are easier to read and understand. Also, select() and filter() can be piped with other tidyverse commands, unlike the bracket methods.