8 Data Structures And Metadata

8.1 Vectors

Atomic vectors or “vectors” are collections of values that are the same data type. Data types include character (text), factor (categorical), numeric (integer or double), logical (TRUE or FALSE), and a few others rarely encountered at the introductory level.

8.1.1 Character

Character vectors are composed of text or “strings.” Quotes (single or double) are required when creating or referencing character values. A character vector might be all the words that exist in a novel, or all the words used in tweets over a time period.

8.1.2 Factor

Factor vectors are also composed of text, but are treated as mutually exclusive groupings (categories) by R commands. Factor vectors are ideal for variables like sex, ethnicity, and treatment group.

The mutually exclusive values of a factor variable are called levels. For example, the levels a variable called treatment_group might include ‘placebo’, ‘low_dose’, and ‘high_dose’. Quotes (single or double) are required when creating or referencing factor levels.

8.1.3 Numeric

Numeric vectors come in two flavors, integer and double. The difference is usually unimportant, and R will automatically switch between integer and double as needed. Briefly, integer vectors use less memory, but double vectors can hold larger numbers. An L after a number indicates an integer type. However, whole numbers displayed in R are not always integers, as shown below.

Code

typeof(2)

[1] "double"

Code

typeof(2L)

[1] "integer"

8.1.4 Logical

Logical vectors can only assume two values, TRUE and FALSE. Filtering commands that allow us to analyze or graph observations meeting specific criteria use logical vectors.

The “c” command (c = “collect” or “combine”) creates vectors. The code below uses the c command to create four vectors: x, age, name, and age_plus5.

Code

x <- 7
age <-  c(20:25)
name <-  c('greg', 'sally', 'sean', 'carlos', 'becka', 'doug')  
age_plus5 <-  age + 5

8.2 Dataframes & Tibbles

Dataframes and tibbles are rectangular arrangements of data with variables in columns and observations in rows. Tibbles are updated versions of dataframes provided by the tidyverse package. Forums and text books often use “dataframe” to refer to both tibbles and dataframes.

The Tidyverse commands read_csv(), tibble(), and data_frame() create tibbles.

The base R commands read.csv() and data.frame() create dataframes.

Tibbles have a few advantages over dataframes, although for most code these data structures work interchangeably and any differences are subtle. Data Essentials With R will use tibbles almost exclusively. Below we demonstrate common ways of creating dataframes and tibbles. The head() command displays the first 6 rows of data structures.

Code

# create a dataframe from vectors.
df <- data.frame(age, name, age_plus5)
head(df)

  age   name age_plus5
1  20   greg        25
2  21  sally        26
3  22   sean        27
4  23 carlos        28
5  24  becka        29
6  25   doug        30

Code

# create a tibble from vectors
tbl <- tibble(age, name, age_plus5)
head(tbl)

# A tibble: 6 × 3
    age name   age_plus5
  <int> <chr>      <dbl>
1    20 greg          25
2    21 sally         26
3    22 sean          27
4    23 carlos        28
5    24 becka         29
6    25 doug          30

Code

# import data from a .csv file into a tibble
cholesterol <- read_csv(file = "data/Friends_Cholesterol.csv")
head(cholesterol)

# A tibble: 6 × 11
  name       age gender height group weight_i hdl_i ldl_i weight_f hdl_f ldl_f
  <chr>    <dbl>  <dbl>  <dbl> <dbl>    <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
1 William     26      0     69     0      172    49    81      156    48    86
2 Richard     37      0     67     0      184    54    95      165    53    93
3 Joseph      45      0     66     0      191    49   106      180    46   112
4 Daniel      35      0     72     0      181    50    98      168    51   100
5 Jennifer    44      1     64     0      203    44   107      189    47   111
6 Barbara     23      1     61     0      193    59   103      187    59   111

8.3 Lists

The next level up in data structure hierarchy is a list. Lists are collections of key value pairs, where the key is an identifier and the value is an element of any data structure, including another list. Lists are similar to dictionaries in the Python language.

Because of their flexibility, lists often store the output of commands and statistical tests, which may involve tables, tibbles, and vectors. Below we create a list of 5 objects and then display the list structure with the str() command. The keys do not have to be quoted, but doing so helps the keys stand out from values.

Code

# Create a list 
my_list = list('item1' = x, 
               'item2' = age, 
               'data1' = df, 
               'data2' =  tbl,
               'my_list2' = list('one' = x, 
                                 'two' = name, 
                                 'three' = age_plus5))

str(my_list)

List of 5
 $ item1   : num 7
 $ item2   : int [1:6] 20 21 22 23 24 25
 $ data1   :'data.frame':   6 obs. of  3 variables:
  ..$ age      : int [1:6] 20 21 22 23 24 25
  ..$ name     : chr [1:6] "greg" "sally" "sean" "carlos" ...
  ..$ age_plus5: num [1:6] 25 26 27 28 29 30
 $ data2   : tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ age      : int [1:6] 20 21 22 23 24 25
  ..$ name     : chr [1:6] "greg" "sally" "sean" "carlos" ...
  ..$ age_plus5: num [1:6] 25 26 27 28 29 30
 $ my_list2:List of 3
  ..$ one  : num 7
  ..$ two  : chr [1:6] "greg" "sally" "sean" "carlos" ...
  ..$ three: num [1:6] 25 26 27 28 29 30

8.4 Metadata

Metadata is information about data, including attributes. Every dataframe or tibble has at least three attributes, column names, row names (or numbers), and class. Click the white triangle in a blue circle icon next to objects in the Environment tab to view their metadata. The str(data) command (str for “structure”) will also display metadata for any object, as shown below.

Code

# show metadata for age
str(tbl$age)

 int [1:6] 20 21 22 23 24 25

The output indicates that age is an integer vector, “int”, and has 6 values, “[1:6]”. The first few values of the age variable are also displayed.

Code

str(tbl)

tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
 $ age      : int [1:6] 20 21 22 23 24 25
 $ name     : chr [1:6] "greg" "sally" "sean" "carlos" ...
 $ age_plus5: num [1:6] 25 26 27 28 29 30

The metadata output shows that tbl is a “tibble” with 6 rows and 3 columns “[6 x 3]”. The metadata for each column in the tibble is also displayed. The information in parentheses indicates that tibble is a method derived from an S3 style of object-oriented programming. The details of S3 and object-oriented programming are beyond the scope of Data Essentials With R.