11 Explore Data
Exploratory data analysis (EDA) is learning about a dataset without focusing on preconceived expectations or hypotheses. EDA can reveal unexpected insights, and influence how data are subsequently analyzed. While every data situation is unique, a basic EDA should explore the completeness of the data (AKA missingness), univariate distributions for categorical and quantitative variables, outliers and erroneous or impossible values, and correlations between variables. Simple plots like histograms and bar graphs are created to explore distributions, while tables with measures of center, variation, and frequency are created to explore variables numerically. In this section we start with data that are already wrangled and ready for analysis. However, data wrangling and exploratory data analysis are often intermixed, as data wrangling enables exploratory data analysis, and exploratory data analysis can reveal additional problems with the data.
11.1 Characterize Missing Values
Knowing which variables and observations have missing values can be important for an analysis. At a minimum we should know how many missing values exist and if the missing values are randomly distributed or concentrated at certain timepoints, groups, or other conditions. Below we demonstrate a few simple methods for characterizing missing values using a tibble (tbl) produced in the previous lesson.
Code
# How many missing values exist?
%>% is.na %>% sum tbl
[1] 36
Code
# How many rows have missing values?
%>%
tbl filter(!complete.cases(tbl)) %>%
nrow()
[1] 19
Code
# Which variables have missing values?
summary(tbl)
name age age_c sex height
William : 2 Min. :21.00 20 - 30:18 male :36 Min. :54.00
Richard : 2 1st Qu.:32.50 31 - 40:22 female:38 1st Qu.:61.75
Joseph : 2 Median :40.50 41 - 55:40 NA's : 6 Median :65.00
Daniel : 2 Mean :39.62 Mean :64.65
Jennifer: 2 3rd Qu.:47.25 3rd Qu.:67.00
Barbara : 2 Max. :55.00 Max. :74.00
(Other) :68
group timepoint weight bmi hdl
control:40 initial:40 Min. :156.0 Min. :20.03 Min. :44.00
statin :40 final :40 1st Qu.:177.0 1st Qu.:27.68 1st Qu.:50.00
Median :187.0 Median :31.37 Median :54.00
Mean :185.7 Mean :31.88 Mean :53.69
3rd Qu.:194.0 3rd Qu.:35.28 3rd Qu.:56.50
Max. :214.0 Max. :49.18 Max. :64.00
NA's :6 NA's :6 NA's :5
ldl tc
Min. : 47.00 Min. :104.0
1st Qu.: 82.00 1st Qu.:137.0
Median : 93.50 Median :148.0
Mean : 94.32 Mean :148.2
3rd Qu.:107.50 3rd Qu.:162.0
Max. :140.00 Max. :190.0
NA's :4 NA's :9
Code
# Are the total cholesterol missing values balanced across timepoints?
%>%
tbl group_by(timepoint) %>%
filter(is.na(tc)) %>%
summarize("Total Cholesterol NA's" = n())
# A tibble: 2 × 2
timepoint `Total Cholesterol NA's`
<fct> <int>
1 initial 4
2 final 5
11.2 Descriptive Statistics
11.2.1 Base R Commands
Base R uses intuitively named descriptive statistics commands that work with tibbles and vectors. Often the command is as simple as one word with a data object passed into the command.
Code
# Base R commands with tibbles or dataframes
mean(tbl$height)
min(81, 57, 66, 94, 79, 44, 19, 100, 64, 20)
max(tbl$height)
range(81, 57, 66, 94, 79, 44, 19, 100, 64, 20)
sd(tbl$age)
# Base R commands with vectors
mean(0:100)
min(0:100)
max(0:100)
range(0:100)
sd(0:100)
11.2.2 The summary command
Like many commands in R, summary() uses different methods for different object types. As shown below, summary() will produce different outputs for tibbles, factor vectors, and numeric vectors.
Code
summary(tbl$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 32.50 40.50 39.62 47.25 55.00
Code
summary(tbl$sex)
male female NA's
36 38 6
Code
summary(tbl)
name age age_c sex height
William : 2 Min. :21.00 20 - 30:18 male :36 Min. :54.00
Richard : 2 1st Qu.:32.50 31 - 40:22 female:38 1st Qu.:61.75
Joseph : 2 Median :40.50 41 - 55:40 NA's : 6 Median :65.00
Daniel : 2 Mean :39.62 Mean :64.65
Jennifer: 2 3rd Qu.:47.25 3rd Qu.:67.00
Barbara : 2 Max. :55.00 Max. :74.00
(Other) :68
group timepoint weight bmi hdl
control:40 initial:40 Min. :156.0 Min. :20.03 Min. :44.00
statin :40 final :40 1st Qu.:177.0 1st Qu.:27.68 1st Qu.:50.00
Median :187.0 Median :31.37 Median :54.00
Mean :185.7 Mean :31.88 Mean :53.69
3rd Qu.:194.0 3rd Qu.:35.28 3rd Qu.:56.50
Max. :214.0 Max. :49.18 Max. :64.00
NA's :6 NA's :6 NA's :5
ldl tc
Min. : 47.00 Min. :104.0
1st Qu.: 82.00 1st Qu.:137.0
Median : 93.50 Median :148.0
Mean : 94.32 Mean :148.2
3rd Qu.:107.50 3rd Qu.:162.0
Max. :140.00 Max. :190.0
NA's :4 NA's :9
When we load packages such as tidyverse, those packages often add additional methods to commands like summary()
. Presently, base R
has 32 methods available for summary()
. Loading tidyverse adds another 15 methods for a total of 47. A powerful R feature is that we do not have to choose from this list explicitly. R recognizes the type of object and applies the appropriate method.
11.2.3 R
looping functions
Base R includes a collection of “apply” functions that are designed to apply a function over the elements of a list, tibble or vector. The most common are apply()
, sapply()
and lapply()
.
11.2.3.1 apply()
apply()
runs a command over either the rows (MARGIN = 1) or columns (MARGIN = 2) of a tibble, dataframe, or matrix (not covered). Below we use apply() to calculate the mean of every numeric variable in tbl. The first argument identifies a subset of tbl, where we have used select_if() to choose only the variables that are numeric.
Code
apply(X = select_if(tbl, is.numeric),
MARGIN = 2,
FUN = mean)
age height weight bmi hdl ldl tc
39.625 64.650 NA NA NA NA NA
11.2.3.2 lapply()
lapply()
functions like apply() but always returns output as a list.
Code
%>% select_if(is.numeric) %>% lapply(mean) tbl
$age
[1] 39.625
$height
[1] 64.65
$weight
[1] NA
$bmi
[1] NA
$hdl
[1] NA
$ldl
[1] NA
$tc
[1] NA
11.2.3.3 sapply()
sapply()
functions like apply()
, but returns output using the most simple base R
data structure possible, which could be a list, dataframe, or vector depending on the scenario.
Code
%>% select_if(is.numeric) %>% sapply(mean) tbl
age height weight bmi hdl ldl tc
39.625 64.650 NA NA NA NA NA
11.2.4 Ignoring NA’s
R commands don’t ignore NA by default, as demonstrated below.
Code
mean(tbl$hdl)
[1] NA
Code
%>%
tbl ggplot(aes(x = sex, y = hdl)) +
geom_bar(stat = 'summary',
fun = mean)
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
11.2.4.1 na.rm = TRUE
The na.rm = TRUE tells R to remove (ignore) NAs
Code
mean(tbl$hdl, na.rm =TRUE)
[1] 53.69333
11.2.4.2 drop_na
The drop_na commad will drop all rows with any missing values (default) or will drop rows with missing values for specific columns.
Code
# drop all rows with any missing values
%>%
tbl %>%
drop_na summary
name age age_c sex height
William : 2 Min. :21.00 20 - 30:16 male :31 Min. :54.00
Richard : 2 1st Qu.:30.00 31 - 40:14 female:30 1st Qu.:62.00
Joseph : 2 Median :41.00 41 - 55:31 Median :65.00
Jennifer: 2 Mean :39.69 Mean :65.08
Jessica : 2 3rd Qu.:47.00 3rd Qu.:67.00
Emily : 2 Max. :55.00 Max. :74.00
(Other) :49
group timepoint weight bmi hdl
control:29 initial:30 Min. :156.0 Min. :20.03 Min. :44.00
statin :32 final :31 1st Qu.:177.0 1st Qu.:27.32 1st Qu.:50.00
Median :185.0 Median :30.50 Median :54.00
Mean :185.2 Mean :31.38 Mean :53.84
3rd Qu.:194.0 3rd Qu.:34.76 3rd Qu.:57.00
Max. :205.0 Max. :49.18 Max. :64.00
ldl tc
Min. : 47.00 Min. :104.0
1st Qu.: 81.00 1st Qu.:137.0
Median : 93.00 Median :146.0
Mean : 93.48 Mean :147.3
3rd Qu.:109.00 3rd Qu.:161.0
Max. :140.00 Max. :190.0
Code
# drop rows with missing values in specific columns
%>%
tbl drop_na(sex, hdl) %>%
ggplot(aes(x = sex, y = hdl)) +
geom_bar(stat = 'summary',
fun = mean)
11.3 Distributions & Bivariate Relationships
Distributions present the values of a variable and the frequency each value is observed. Exploring distributions tell us qualitative information about center, variation, shape, and outliers and identify strengths and limitations in the data, such as normal distributions (a strength) and imbalanced experimental designs (a weakness). In the exploratory data analysis phase, we generally want to explore many possible relationships and are not yet creating polished graphs for publication.
11.3.1 GGally::ggpairs
The tidyverse-aligned command ggpairs() from the GGally package creates plot mosaics that are helpful for exploring many distributions and relationships with just a few lines of code.
11.3.1.1 Categorical Distributions and Bivariate Relationships
Code
%>%
tbl filter(timepoint == "initial") %>%
select(age_c, sex, group) %>%
drop_na() %>%
ggpairs(showStrips = TRUE)
11.3.1.2 Quantitative distributions & Bivariate Relationships
Code
%>%
tbl filter(timepoint == 'initial') %>%
select(where(is.numeric)) %>%
drop_na() %>%
ggpairs()
11.3.1.3 Specific Quantitative distributions & Bivariate Relationships, Colored by Factor
Code
<-
pm %>%
tbl filter(timepoint == 'initial') %>%
drop_na() %>%
ggpairs(aes(color = sex), columns = c('age', 'bmi', 'hdl', 'ldl', 'tc'))
pm
11.3.1.4 Isolate and Customize Plots
With the plot mosaic assigned to an object name (“pm” above), we can use base R brackets to isolate and customize a plot with additional ggplot layers. Below, we isolate and enhance the subplot located in row 5, column 1. A detailed coverage of ggplot layers is provided in lesson 12.
Code
5,1] +
pm[labs(y = "total cholesterol (ng/dl)") +
geom_smooth(method = 'lm', se = F) +
theme_bw()
11.4 Frequency tables
Frequency tables show the counts at each level of a categorical variable. When a frequency table includes a second categorical variable, it is called a two-way table. The base R table()
command provides this information, as demonstrated below.
Code
table(tbl$sex, tbl$group)
control statin
male 14 22
female 22 16
Code
# Add HTML formatting with commands from knitr::kable and kableExtra::kable_classic
table(tbl$sex, tbl$group) %>%
kable() %>%
kable_classic(full_width = F)
control | statin | |
---|---|---|
male | 14 | 22 |
female | 22 | 16 |
11.5 Frequency Tables with Janitor Package
With multiple categorical variables, we often wish to know conditional percentages in addition to observation counts. The Janitor package provides the tabyl() (spelled with a ‘y’) command for creating frequency tables with conditional percentages, and has the added benefit of being tidyverse aligned. However, janitor is not imported with tidyverse and must be loaded with a separate library command, or added to the package list in p_load().
Code
library(janitor)
# Load and preview the starwars data
data(starwars, package = 'dplyr')
glimpse(starwars)
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
Code
# Convert character variables into factors using mutate_at
%<>%
starwars mutate_at(vars(hair_color, skin_color, eye_color, sex, gender, homeworld, species),
as.factor)
# A simple one-way frequency table using tabyl with HTML formatting provided by kable() %>% kable_classic(full_width = F)
%>%
starwars tabyl(eye_color) %>%
adorn_rounding(2) %>%
kable() %>% kable_classic(full_width = F)
eye_color | n | percent |
---|---|---|
black | 10 | 0.11 |
blue | 19 | 0.22 |
blue-gray | 1 | 0.01 |
brown | 21 | 0.24 |
dark | 1 | 0.01 |
gold | 1 | 0.01 |
green, yellow | 1 | 0.01 |
hazel | 3 | 0.03 |
orange | 8 | 0.09 |
pink | 1 | 0.01 |
red | 5 | 0.06 |
red, blue | 1 | 0.01 |
unknown | 3 | 0.03 |
white | 1 | 0.01 |
yellow | 11 | 0.13 |
Code
# A two-way table for eye color and gender for non-droids using tabyl
%>%
starwars filter(species != 'Droid') %>%
tabyl(gender, eye_color) %>%
kable() %>% kable_classic(full_width = F)
gender | black | blue | blue-gray | brown | dark | gold | green, yellow | hazel | orange | pink | red | red, blue | unknown | white | yellow |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feminine | 2 | 6 | 0 | 4 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
masculine | 7 | 12 | 1 | 15 | 1 | 1 | 1 | 1 | 8 | 1 | 2 | 0 | 2 | 0 | 9 |
11.5.1 “adorn” frequency table with details
After the initial table setup using tabyl()
, we “adorn” the table with additional features linked together in a pipe chain. Lastly, because the output is still a tibble, we can pipe into the kable() and kable_classic commands to create an aesthetically pleasing table ready for presentation or publication.
Code
%>%
starwars filter(species == 'Human') %>%
tabyl(gender, eye_color) %>%
# add totals, conditional percents, rounding
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(rounding = "half up", digits = 0) %>%
# add sample sizes in parentheses
adorn_ns() %>%
# add a title
adorn_title("combined",
row_name = "Gender",
col_name = "Eye Color") %>%
# add nice aesthetics
%>% kable_classic(full_width = FALSE) kable
Gender/Eye Color | black | blue | blue-gray | brown | dark | gold | green, yellow | hazel | orange | pink | red | red, blue | unknown | white | yellow | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feminine | 0% (0) | 33% (3) | 0% (0) | 44% (4) | 0% (0) | 0% (0) | 0% (0) | 11% (1) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 11% (1) | 0% (0) | 0% (0) | 100% (9) |
masculine | 0% (0) | 35% (9) | 4% (1) | 46% (12) | 4% (1) | 0% (0) | 0% (0) | 4% (1) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 8% (2) | 100% (26) |
Total | 0% (0) | 34% (12) | 3% (1) | 46% (16) | 3% (1) | 0% (0) | 0% (0) | 6% (2) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 3% (1) | 0% (0) | 6% (2) | 100% (35) |
11.6 Pivot tables with group_by()
& summarize()
Pivot tables present descriptive information about data grouped at the levels of one or more categorical variables. This is a useful approach when we wish to compare characteristics between males and females, or across ethnicity, age groups, treatment groups, etc.
The tidyverse commands group_by()
and summarize()
work together to create new tibbles, “pivot tables”, with summary statistics, grouped by one or more factor variables. When we produce these summary tibbles for reports and presentations, we often break the naming rules by including spaces and enclosing the name in quotes as shown below. Lastly, we use commands from the knitr and kableExtra packages to format our tibble with publication-ready aesthetics.
Pivot table with one grouping variable:
Code
%>%
tbl
drop_na(sex) %>%
group_by(sex) %>%
summarize(Count = n(),
'Age Mean' = mean(age) %>% round,
'Age min' = min(age),
'Age max'= max(age),
'Age SD' = sd(age) %>% round) %>%
kable() %>% # kable is from knitr
kable_classic(full_width = F) # kable_classic is from kableExtra
sex | Count | Age Mean | Age min | Age max | Age SD |
---|---|---|---|---|---|
male | 36 | 39 | 21 | 55 | 11 |
female | 38 | 39 | 23 | 55 | 9 |
Pivot table with two grouping variables and nice formatting:
Code
%>%
tbl
drop_na(sex) %>%
group_by(sex, group) %>%
summarize(Count = n(),
'Age Mean' = mean(age) %>% round,
'Age min' = min(age),
'Age max'= max(age),
'Age SD' = sd(age) %>% round) %>%
kable() %>% # kable is from knitr
kable_classic(full_width = F) # kable_classic is from kableExtra
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
sex | group | Count | Age Mean | Age min | Age max | Age SD |
---|---|---|---|---|---|---|
male | control | 14 | 39 | 26 | 54 | 9 |
male | statin | 22 | 39 | 21 | 55 | 12 |
female | control | 22 | 38 | 23 | 51 | 8 |
female | statin | 16 | 41 | 23 | 55 | 11 |