11 Explore Data

Exploratory data analysis (EDA) is learning about a dataset without focusing on preconceived expectations or hypotheses. EDA can reveal unexpected insights, and influence how data are subsequently analyzed. While every data situation is unique, a basic EDA should explore the completeness of the data (AKA missingness), univariate distributions for categorical and quantitative variables, outliers and erroneous or impossible values, and correlations between variables. Simple plots like histograms and bar graphs are created to explore distributions, while tables with measures of center, variation, and frequency are created to explore variables numerically. In this section we start with data that are already wrangled and ready for analysis. However, data wrangling and exploratory data analysis are often intermixed, as data wrangling enables exploratory data analysis, and exploratory data analysis can reveal additional problems with the data.

11.1 Characterize Missing Values

Knowing which variables and observations have missing values can be important for an analysis. At a minimum we should know how many missing values exist and if the missing values are randomly distributed or concentrated at certain timepoints, groups, or other conditions. Below we demonstrate a few simple methods for characterizing missing values using a tibble (tbl) produced in the previous lesson.

Code

# How many missing values exist?
tbl %>% is.na %>% sum

[1] 36

Code

# How many rows have missing values?
tbl %>%
    filter(!complete.cases(tbl)) %>%
    nrow()

[1] 19

Code

# Which variables have missing values?
summary(tbl)

       name         age            age_c        sex         height     
 William : 2   Min.   :21.00   20 - 30:18   male  :36   Min.   :54.00  
 Richard : 2   1st Qu.:32.50   31 - 40:22   female:38   1st Qu.:61.75  
 Joseph  : 2   Median :40.50   41 - 55:40   NA's  : 6   Median :65.00  
 Daniel  : 2   Mean   :39.62                            Mean   :64.65  
 Jennifer: 2   3rd Qu.:47.25                            3rd Qu.:67.00  
 Barbara : 2   Max.   :55.00                            Max.   :74.00  
 (Other) :68                                                           
     group      timepoint      weight           bmi             hdl       
 control:40   initial:40   Min.   :156.0   Min.   :20.03   Min.   :44.00  
 statin :40   final  :40   1st Qu.:177.0   1st Qu.:27.68   1st Qu.:50.00  
                           Median :187.0   Median :31.37   Median :54.00  
                           Mean   :185.7   Mean   :31.88   Mean   :53.69  
                           3rd Qu.:194.0   3rd Qu.:35.28   3rd Qu.:56.50  
                           Max.   :214.0   Max.   :49.18   Max.   :64.00  
                           NA's   :6       NA's   :6       NA's   :5      
      ldl               tc       
 Min.   : 47.00   Min.   :104.0  
 1st Qu.: 82.00   1st Qu.:137.0  
 Median : 93.50   Median :148.0  
 Mean   : 94.32   Mean   :148.2  
 3rd Qu.:107.50   3rd Qu.:162.0  
 Max.   :140.00   Max.   :190.0  
 NA's   :4        NA's   :9

Code

# Are the total cholesterol missing values balanced across timepoints?
tbl %>%
    group_by(timepoint) %>%
    filter(is.na(tc)) %>%
    summarize("Total Cholesterol NA's" = n())

# A tibble: 2 × 2
  timepoint `Total Cholesterol NA's`
  <fct>                        <int>
1 initial                          4
2 final                            5

11.2 Descriptive Statistics

11.2.1 Base R Commands

Base R uses intuitively named descriptive statistics commands that work with tibbles and vectors. Often the command is as simple as one word with a data object passed into the command.

Code

# Base R commands with tibbles or dataframes
mean(tbl$height)
min(81, 57, 66, 94, 79, 44, 19, 100, 64, 20)
max(tbl$height)
range(81, 57, 66, 94, 79, 44, 19, 100, 64, 20)
sd(tbl$age)

# Base R commands with vectors
mean(0:100)
min(0:100)
max(0:100)
range(0:100)
sd(0:100)

11.2.2 The summary command

Like many commands in R, summary() uses different methods for different object types. As shown below, summary() will produce different outputs for tibbles, factor vectors, and numeric vectors.

Code

summary(tbl$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   32.50   40.50   39.62   47.25   55.00

Code

summary(tbl$sex)

  male female   NA's 
    36     38      6

Code

summary(tbl)

       name         age            age_c        sex         height     
 William : 2   Min.   :21.00   20 - 30:18   male  :36   Min.   :54.00  
 Richard : 2   1st Qu.:32.50   31 - 40:22   female:38   1st Qu.:61.75  
 Joseph  : 2   Median :40.50   41 - 55:40   NA's  : 6   Median :65.00  
 Daniel  : 2   Mean   :39.62                            Mean   :64.65  
 Jennifer: 2   3rd Qu.:47.25                            3rd Qu.:67.00  
 Barbara : 2   Max.   :55.00                            Max.   :74.00  
 (Other) :68                                                           
     group      timepoint      weight           bmi             hdl       
 control:40   initial:40   Min.   :156.0   Min.   :20.03   Min.   :44.00  
 statin :40   final  :40   1st Qu.:177.0   1st Qu.:27.68   1st Qu.:50.00  
                           Median :187.0   Median :31.37   Median :54.00  
                           Mean   :185.7   Mean   :31.88   Mean   :53.69  
                           3rd Qu.:194.0   3rd Qu.:35.28   3rd Qu.:56.50  
                           Max.   :214.0   Max.   :49.18   Max.   :64.00  
                           NA's   :6       NA's   :6       NA's   :5      
      ldl               tc       
 Min.   : 47.00   Min.   :104.0  
 1st Qu.: 82.00   1st Qu.:137.0  
 Median : 93.50   Median :148.0  
 Mean   : 94.32   Mean   :148.2  
 3rd Qu.:107.50   3rd Qu.:162.0  
 Max.   :140.00   Max.   :190.0  
 NA's   :4        NA's   :9

NOTE Summary Methods

When we load packages such as tidyverse, those packages often add additional methods to commands like summary(). Presently, base R has 32 methods available for summary(). Loading tidyverse adds another 15 methods for a total of 47. A powerful R feature is that we do not have to choose from this list explicitly. R recognizes the type of object and applies the appropriate method.

11.2.3 `R` looping functions

Base R includes a collection of “apply” functions that are designed to apply a function over the elements of a list, tibble or vector. The most common are apply(), sapply() and lapply().

11.2.3.1 `apply()`

apply() runs a command over either the rows (MARGIN = 1) or columns (MARGIN = 2) of a tibble, dataframe, or matrix (not covered). Below we use apply() to calculate the mean of every numeric variable in tbl. The first argument identifies a subset of tbl, where we have used select_if() to choose only the variables that are numeric.

Code

apply(X = select_if(tbl, is.numeric),
      MARGIN = 2,
      FUN = mean)

   age height weight    bmi    hdl    ldl     tc 
39.625 64.650     NA     NA     NA     NA     NA

11.2.3.2 `lapply()`

lapply() functions like apply() but always returns output as a list.

Code

tbl %>% select_if(is.numeric) %>% lapply(mean)

$age
[1] 39.625

$height
[1] 64.65

$weight
[1] NA

$bmi
[1] NA

$hdl
[1] NA

$ldl
[1] NA

$tc
[1] NA

11.2.3.3 `sapply()`

sapply() functions like apply(), but returns output using the most simple base R data structure possible, which could be a list, dataframe, or vector depending on the scenario.

Code

tbl %>% select_if(is.numeric) %>% sapply(mean)

   age height weight    bmi    hdl    ldl     tc 
39.625 64.650     NA     NA     NA     NA     NA

11.2.4 Ignoring NA’s

R commands don’t ignore NA by default, as demonstrated below.

Code

mean(tbl$hdl)

[1] NA

Code

tbl %>%
    ggplot(aes(x = sex, y = hdl)) +
    geom_bar(stat = 'summary',
             fun = mean)

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).

11.2.4.1 na.rm = TRUE

The na.rm = TRUE tells R to remove (ignore) NAs

Code

mean(tbl$hdl, na.rm =TRUE)

[1] 53.69333

11.2.4.2 drop_na

The drop_na commad will drop all rows with any missing values (default) or will drop rows with missing values for specific columns.

Code

# drop all rows with any missing values
tbl %>%
    drop_na %>%
    summary

       name         age            age_c        sex         height     
 William : 2   Min.   :21.00   20 - 30:16   male  :31   Min.   :54.00  
 Richard : 2   1st Qu.:30.00   31 - 40:14   female:30   1st Qu.:62.00  
 Joseph  : 2   Median :41.00   41 - 55:31               Median :65.00  
 Jennifer: 2   Mean   :39.69                            Mean   :65.08  
 Jessica : 2   3rd Qu.:47.00                            3rd Qu.:67.00  
 Emily   : 2   Max.   :55.00                            Max.   :74.00  
 (Other) :49                                                           
     group      timepoint      weight           bmi             hdl       
 control:29   initial:30   Min.   :156.0   Min.   :20.03   Min.   :44.00  
 statin :32   final  :31   1st Qu.:177.0   1st Qu.:27.32   1st Qu.:50.00  
                           Median :185.0   Median :30.50   Median :54.00  
                           Mean   :185.2   Mean   :31.38   Mean   :53.84  
                           3rd Qu.:194.0   3rd Qu.:34.76   3rd Qu.:57.00  
                           Max.   :205.0   Max.   :49.18   Max.   :64.00  
                                                                          
      ldl               tc       
 Min.   : 47.00   Min.   :104.0  
 1st Qu.: 81.00   1st Qu.:137.0  
 Median : 93.00   Median :146.0  
 Mean   : 93.48   Mean   :147.3  
 3rd Qu.:109.00   3rd Qu.:161.0  
 Max.   :140.00   Max.   :190.0

Code

# drop rows with missing values in specific columns
tbl %>%
    drop_na(sex, hdl) %>%
    ggplot(aes(x = sex, y = hdl)) +
    geom_bar(stat = 'summary',
             fun = mean)

11.3 Distributions & Bivariate Relationships

Distributions present the values of a variable and the frequency each value is observed. Exploring distributions tell us qualitative information about center, variation, shape, and outliers and identify strengths and limitations in the data, such as normal distributions (a strength) and imbalanced experimental designs (a weakness). In the exploratory data analysis phase, we generally want to explore many possible relationships and are not yet creating polished graphs for publication.

11.3.1 GGally::ggpairs

The tidyverse-aligned command ggpairs() from the GGally package creates plot mosaics that are helpful for exploring many distributions and relationships with just a few lines of code.

11.3.1.1 Categorical Distributions and Bivariate Relationships

Code

tbl %>% 
    filter(timepoint == "initial") %>%
    select(age_c, sex, group) %>%
    drop_na() %>%
    ggpairs(showStrips = TRUE)

11.3.1.2 Quantitative distributions & Bivariate Relationships

Code

tbl %>%
    filter(timepoint == 'initial') %>% 
    select(where(is.numeric)) %>%
    drop_na() %>%
    ggpairs()

11.3.1.3 Specific Quantitative distributions & Bivariate Relationships, Colored by Factor

Code

pm <- 
    tbl %>%
    filter(timepoint == 'initial') %>%
    drop_na() %>%
    ggpairs(aes(color = sex), columns = c('age', 'bmi', 'hdl', 'ldl', 'tc'))
pm

11.3.1.4 Isolate and Customize Plots

With the plot mosaic assigned to an object name (“pm” above), we can use base R brackets to isolate and customize a plot with additional ggplot layers. Below, we isolate and enhance the subplot located in row 5, column 1. A detailed coverage of ggplot layers is provided in lesson 12.

Code

pm[5,1] + 
    labs(y = "total cholesterol (ng/dl)") + 
    geom_smooth(method = 'lm', se = F) +
    theme_bw()

11.4 Frequency tables

Frequency tables show the counts at each level of a categorical variable. When a frequency table includes a second categorical variable, it is called a two-way table. The base R table() command provides this information, as demonstrated below.

Code

table(tbl$sex, tbl$group)

        
         control statin
  male        14     22
  female      22     16

Code

# Add HTML formatting with commands from knitr::kable and kableExtra::kable_classic

table(tbl$sex, tbl$group) %>%
    kable() %>% 
    kable_classic(full_width = F)

	control	statin
male	14	22
female	22	16

11.5 Frequency Tables with Janitor Package

With multiple categorical variables, we often wish to know conditional percentages in addition to observation counts. The Janitor package provides the tabyl() (spelled with a ‘y’) command for creating frequency tables with conditional percentages, and has the added benefit of being tidyverse aligned. However, janitor is not imported with tidyverse and must be loaded with a separate library command, or added to the package list in p_load().

Code

library(janitor) 

# Load and preview the starwars data
data(starwars, package = 'dplyr')
glimpse(starwars)

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

Code

# Convert character variables into factors using mutate_at
starwars %<>% 
    mutate_at(vars(hair_color, skin_color, eye_color, sex, gender, homeworld, species), 
              as.factor)

# A simple one-way frequency table using tabyl with HTML formatting provided by kable() %>% kable_classic(full_width = F)
starwars %>% 
    tabyl(eye_color) %>% 
    adorn_rounding(2) %>%
    kable() %>% kable_classic(full_width = F)

eye_color	n	percent
black	10	0.11
blue	19	0.22
blue-gray	1	0.01
brown	21	0.24
dark	1	0.01
gold	1	0.01
green, yellow	1	0.01
hazel	3	0.03
orange	8	0.09
pink	1	0.01
red	5	0.06
red, blue	1	0.01
unknown	3	0.03
white	1	0.01
yellow	11	0.13

Code

# A two-way table for eye color and gender for non-droids using tabyl
starwars %>%
    filter(species != 'Droid') %>%
    tabyl(gender, eye_color) %>% 
    kable() %>% kable_classic(full_width = F)

gender	black	blue	blue-gray	brown	dark	gold	green, yellow	hazel	orange	pink	red	red, blue	unknown	white	yellow
feminine	2	6	0	4	0	0	0	2	0	0	0	0	1	0	1
masculine	7	12	1	15	1	1	1	1	8	1	2	0	2	0	9

11.5.1 “adorn” frequency table with details

After the initial table setup using tabyl(), we “adorn” the table with additional features linked together in a pipe chain. Lastly, because the output is still a tibble, we can pipe into the kable() and kable_classic commands to create an aesthetically pleasing table ready for presentation or publication.

Code

starwars %>%
    filter(species == 'Human') %>%
    tabyl(gender, eye_color) %>%

    # add totals, conditional percents, rounding
    adorn_totals(c("row", "col")) %>%
    adorn_percentages("row") %>% 
    adorn_pct_formatting(rounding = "half up", digits = 0) %>%
    
    # add sample sizes in parentheses
    adorn_ns() %>% 
    
    # add a title
    adorn_title("combined", 
                row_name = "Gender", 
                col_name = "Eye Color") %>%
    
    # add nice aesthetics
    kable %>% kable_classic(full_width = FALSE)

Gender/Eye Color	black	blue	blue-gray	brown	dark	gold	green, yellow	hazel	orange	pink	red	red, blue	unknown	white	yellow	Total
feminine	0% (0)	33% (3)	0% (0)	44% (4)	0% (0)	0% (0)	0% (0)	11% (1)	0% (0)	0% (0)	0% (0)	0% (0)	11% (1)	0% (0)	0% (0)	100% (9)
masculine	0% (0)	35% (9)	4% (1)	46% (12)	4% (1)	0% (0)	0% (0)	4% (1)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	8% (2)	100% (26)
Total	0% (0)	34% (12)	3% (1)	46% (16)	3% (1)	0% (0)	0% (0)	6% (2)	0% (0)	0% (0)	0% (0)	0% (0)	3% (1)	0% (0)	6% (2)	100% (35)

11.6 Pivot tables with `group_by()` & `summarize()`

Pivot tables present descriptive information about data grouped at the levels of one or more categorical variables. This is a useful approach when we wish to compare characteristics between males and females, or across ethnicity, age groups, treatment groups, etc.

The tidyverse commands group_by() and summarize() work together to create new tibbles, “pivot tables”, with summary statistics, grouped by one or more factor variables. When we produce these summary tibbles for reports and presentations, we often break the naming rules by including spaces and enclosing the name in quotes as shown below. Lastly, we use commands from the knitr and kableExtra packages to format our tibble with publication-ready aesthetics.

Pivot table with one grouping variable:

Code

tbl %>% 
    
    drop_na(sex) %>%
    
    group_by(sex) %>% 
    
    summarize(Count = n(),
              'Age Mean' = mean(age) %>% round,  
              'Age min' = min(age),
              'Age max'= max(age),
              'Age SD' = sd(age) %>% round) %>%
    
    kable() %>% # kable is from knitr
    
    kable_classic(full_width = F) # kable_classic is from kableExtra

sex	Count	Age Mean	Age min	Age max	Age SD
male	36	39	21	55	11
female	38	39	23	55	9

Pivot table with two grouping variables and nice formatting:

Code

tbl %>%
    
    drop_na(sex) %>%
    
    group_by(sex, group) %>%
    
    summarize(Count = n(),
              'Age Mean' = mean(age) %>% round,  
              'Age min' = min(age),
              'Age max'= max(age),
              'Age SD' = sd(age) %>% round) %>%
    
    kable() %>% # kable is from knitr
    
    kable_classic(full_width = F) # kable_classic is from kableExtra

`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

sex	group	Count	Age Mean	Age min	Age max	Age SD
male	control	14	39	26	54	9
male	statin	22	39	21	55	12
female	control	22	38	23	51	8
female	statin	16	41	23	55	11

11.1 Characterize Missing Values

11.2 Descriptive Statistics

11.2.1 Base R Commands

11.2.2 The summary command

11.2.3 R looping functions

11.2.3.1 apply()

11.2.3.2 lapply()

11.2.3.3 sapply()

11.2.4 Ignoring NA’s

11.2.4.1 na.rm = TRUE

11.2.4.2 drop_na

11.3 Distributions & Bivariate Relationships

11.3.1 GGally::ggpairs

11.3.1.1 Categorical Distributions and Bivariate Relationships

11.3.1.2 Quantitative distributions & Bivariate Relationships

11.3.1.3 Specific Quantitative distributions & Bivariate Relationships, Colored by Factor

11.3.1.4 Isolate and Customize Plots

11.4 Frequency tables

11.5 Frequency Tables with Janitor Package

11.5.1 “adorn” frequency table with details

11.6 Pivot tables with group_by() & summarize()

11.2.3 `R` looping functions

11.2.3.1 `apply()`

11.2.3.2 `lapply()`

11.2.3.3 `sapply()`

11.6 Pivot tables with `group_by()` & `summarize()`