2  Introduction

2.1 Why learn about data before learning statistics?

In the biological sciences (and other disciplines), many sharp and highly motivated students struggle to learn the technical nuances of data, computers, software applications, and coding languages while simultaneously learning the abstract concepts of statistics. Many students are so intimidated by college statistics that they put off taking a course until their 2nd, 3rd, or 4th year, by which time they have already been struggling with data in their coursework and research.

The obvious question is, why do we pack so much into statistics courses? Historically, we did not. Not too long ago, our courses were paper-based and lightly supported by computers. Graduate-level courses might have included a weekly computer lab, hosted on university computers with expensive university-licensed software. Raw data was comparatively scarce and impractical to integrate deeply into most classes. Our goal, especially at the undergraduate level, was mainly to produce intelligent consumers of statistics.

Today, data are abundant and accessible, and virtually all college students own or can access a computer. Free, open-source statistical applications dominate the landscape and run on nearly any computer. Furthermore, to work in research at any level, we want students who can work with raw data and produce statistics. Consequently, many introductory statistics courses have evolved to include the additional burden of working with data, even though those courses were already rigorous.

Much of data analysis is the tedious business of acquiring and fixing problems with data. After data wrangling, exploratory data analysis and graphing typically happen before estimation, hypothesis testing, and modelling, yet can also be cumbersome. Meanwhile, the most abstract and challenging statistical methods are often reduced to a simple line of code.

Data Essentials with R seeks to remove early obstacles to exploring data in coursework and research. Students that finish all lessons will be able to import, clean, and wrangle data, produce summary statistics, tables, and visualizations, and create professional write-ups of their analyses using Quarto files.

2.2 Why R

R is the primary tool in academia for data analysis and statistics. R is a computer language built for statistics that is powerful, free, and open source. R is a relatively easy first computer language to learn.

2.3 Why Quarto

Quarto is a versatile multilingual file format that allows authoring and analyzing data in the same document. From an educator’s perspective, Quarto is a tool we can use to provide explanations and interpretations intermixed with executable code and output. Similarly, students can use Quarto to write and annotate code and provide interpretations and explanations.

2.4 Why RStudio

RStudio is an Integrative Development Environment (IDE) for data analysis. RStudio contains features for writing and executing code, visualizing graphs and output, and managing data. RStudio is accessible to beginners but is also an industry standard for data analysis with R.