8 Data Structures And Metadata
8.1 Vectors
Atomic vectors or “vectors” are collections of values that are the same data type. Data types include character (text), factor (categorical), numeric (integer or double), logical (TRUE or FALSE), and a few others rarely encountered at the introductory level.
8.1.1 Character
Character vectors are composed of text or “strings.” Quotes (single or double) are required when creating or referencing character values. A character vector might be all the words that exist in a novel, or all the words used in tweets over a time period.
8.1.2 Factor
Factor vectors are also composed of text, but are treated as mutually exclusive groupings (categories) by R commands. Factor vectors are ideal for variables like sex, ethnicity, and treatment group.
The mutually exclusive values of a factor variable are called levels. For example, the levels a variable called treatment_group might include ‘placebo’, ‘low_dose’, and ‘high_dose’. Quotes (single or double) are required when creating or referencing factor levels.
8.1.3 Numeric
Numeric vectors come in two flavors, integer and double. The difference is usually unimportant, and R will automatically switch between integer and double as needed. Briefly, integer vectors use less memory, but double vectors can hold larger numbers. An L after a number indicates an integer type. However, whole numbers displayed in R are not always integers, as shown below.
Code
typeof(2)
[1] "double"
Code
typeof(2L)
[1] "integer"
8.1.4 Logical
Logical vectors can only assume two values, TRUE and FALSE. Filtering commands that allow us to analyze or graph observations meeting specific criteria use logical vectors.
The “c” command (c = “collect” or “combine”) creates vectors. The code below uses the c command to create four vectors: x, age, name, and age_plus5.
Code
<- 7
x <- c(20:25)
age <- c('greg', 'sally', 'sean', 'carlos', 'becka', 'doug')
name <- age + 5 age_plus5
8.2 Dataframes & Tibbles
Dataframes and tibbles are rectangular arrangements of data with variables in columns and observations in rows. Tibbles are updated versions of dataframes provided by the tidyverse package. Forums and text books often use “dataframe” to refer to both tibbles and dataframes.
The Tidyverse commands read_csv(), tibble(), and data_frame() create tibbles.
The base R commands read.csv() and data.frame() create dataframes.
Tibbles have a few advantages over dataframes, although for most code these data structures work interchangeably and any differences are subtle. Data Essentials With R will use tibbles almost exclusively. Below we demonstrate common ways of creating dataframes and tibbles. The head() command displays the first 6 rows of data structures.
Code
# create a dataframe from vectors.
<- data.frame(age, name, age_plus5)
df head(df)
age name age_plus5
1 20 greg 25
2 21 sally 26
3 22 sean 27
4 23 carlos 28
5 24 becka 29
6 25 doug 30
Code
# create a tibble from vectors
<- tibble(age, name, age_plus5)
tbl head(tbl)
# A tibble: 6 × 3
age name age_plus5
<int> <chr> <dbl>
1 20 greg 25
2 21 sally 26
3 22 sean 27
4 23 carlos 28
5 24 becka 29
6 25 doug 30
Code
# import data from a .csv file into a tibble
<- read_csv(file = "data/Friends_Cholesterol.csv")
cholesterol head(cholesterol)
# A tibble: 6 × 11
name age gender height group weight_i hdl_i ldl_i weight_f hdl_f ldl_f
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 William 26 0 69 0 172 49 81 156 48 86
2 Richard 37 0 67 0 184 54 95 165 53 93
3 Joseph 45 0 66 0 191 49 106 180 46 112
4 Daniel 35 0 72 0 181 50 98 168 51 100
5 Jennifer 44 1 64 0 203 44 107 189 47 111
6 Barbara 23 1 61 0 193 59 103 187 59 111
8.3 Lists
The next level up in data structure hierarchy is a list. Lists are collections of key value pairs, where the key is an identifier and the value is an element of any data structure, including another list. Lists are similar to dictionaries in the Python language.
Because of their flexibility, lists often store the output of commands and statistical tests, which may involve tables, tibbles, and vectors. Below we create a list of 5 objects and then display the list structure with the str() command. The keys do not have to be quoted, but doing so helps the keys stand out from values.
Code
# Create a list
= list('item1' = x,
my_list 'item2' = age,
'data1' = df,
'data2' = tbl,
'my_list2' = list('one' = x,
'two' = name,
'three' = age_plus5))
str(my_list)
List of 5
$ item1 : num 7
$ item2 : int [1:6] 20 21 22 23 24 25
$ data1 :'data.frame': 6 obs. of 3 variables:
..$ age : int [1:6] 20 21 22 23 24 25
..$ name : chr [1:6] "greg" "sally" "sean" "carlos" ...
..$ age_plus5: num [1:6] 25 26 27 28 29 30
$ data2 : tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
..$ age : int [1:6] 20 21 22 23 24 25
..$ name : chr [1:6] "greg" "sally" "sean" "carlos" ...
..$ age_plus5: num [1:6] 25 26 27 28 29 30
$ my_list2:List of 3
..$ one : num 7
..$ two : chr [1:6] "greg" "sally" "sean" "carlos" ...
..$ three: num [1:6] 25 26 27 28 29 30
8.4 Metadata
Metadata is information about data, including attributes. Every dataframe or tibble has at least three attributes, column names, row names (or numbers), and class. Click the white triangle in a blue circle icon next to objects in the Environment tab to view their metadata. The str(data)
command (str for “structure”) will also display metadata for any object, as shown below.
Code
# show metadata for age
str(tbl$age)
int [1:6] 20 21 22 23 24 25
The output indicates that age is an integer vector, “int”, and has 6 values, “[1:6]”. The first few values of the age variable are also displayed.
Code
str(tbl)
tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
$ age : int [1:6] 20 21 22 23 24 25
$ name : chr [1:6] "greg" "sally" "sean" "carlos" ...
$ age_plus5: num [1:6] 25 26 27 28 29 30
The metadata output shows that tbl is a “tibble” with 6 rows and 3 columns “[6 x 3]”. The metadata for each column in the tibble is also displayed. The information in parentheses indicates that tibble is a method derived from an S3 style of object-oriented programming. The details of S3 and object-oriented programming are beyond the scope of Data Essentials With R.