Working with health data in R and RStudio
Corinne Riddell
August 28, 2020
Learning objectives for today:
1. What is a data frame
2. How to read a comma separated values (CSV) file using read_csv()
3. Get to know the data using str(), head(), dim(), and names()
4. Manipulate the data frame using the R package dplyr’s main functions:
• rename()
• select()
• arrange()
• filter()
• mutate()
• group_by()
• summarize()
Readings
• There are no chapters from the textbook for this lecture.
• Here are some additional online resources (optional, but helpful!):
– Data Frames
– 15 min intro to dplyr
– Data wrangling cheat sheet
What is a data frame?
• A data frame is a data set.
• We read data into R from common sources like Excel spreadsheets (.xls or .xlsx), text files (.txt), comma
separate value files (.csv), and other formats.
• The simplest format of data contains one row for each individual in the study.
• The first column of the data identifies the individual (perhaps by a name or an ID variable).
• Subsequent columns are variables that have been recorded or measured.
Lake data from Baldi and Moore (B&M)
• Exercise 1.25 from Edition 4 of B&M
• Six rows of data from a study of mercury concentration across 53 lakes
• I’ve added three fabricated rows
• I’ve placed these data in Day-2 folder
• Let’s find it there
readr is a library to import data into R
• To access readr’s functions we load the library like this:
1
, library(readr)
• Click the green arrow to run the code or place your cursor on the line of code and type cmd + enter
(Mac) or control + enter (PC)
• A green rectangle that temporarily appears next to the code shows you that it has run.
read_csv() to load the lake data in R
• read_csv() is a function from the readr library used to import csv files.
• code template: your_data <- read_csv("pathway_to_data.csv")
• The <- is called the assignment operator. It says to save the imported data into an object called
your_data.
lake_data <- read_csv("Data_mercury_lake.csv")
## Parsed with column specification:
## cols(
## lakes = col_character(),
## ph = col_double(),
## chlorophyll = col_double(),
## mercury_in_fish = col_double(),
## number_fish_sampled = col_double(),
## age_data = col_character()
## )
• Anytime you see “##” on the html slides or in the PDF lecture files, the text in those lines are the
output of running the code in the previous line. So the lines above are the output displayed when you
run the read_csv() function.
Exercise 1
1. Execute the above code using either the green arrow or by clicking on it and hitting the keyboard
shortcut (cmd + enter on mac or Ctrl + enter on PC).
2. Note that the data appears in the Environment pane in the top right.
• Notice the number of observations and the number of variables.
3. Click the tiny table icon to the right of the lake_data in the Environment pane to open the Viewer
tab and inspect the data.
Check your understanding!
Four functions to get to know a dataset
• head(your_data): Shows the first six rows of the supplied dataset
• dim(your_data): Provides the number of rows by the number of columns
• names(your_data): Lists the variable names of the columns in the dataset
• str(your_data): Summarizes the above information and more
I use these functions all the time! Multiple times per session when working with data to remind me what the
variable names are, and what the data looks like.
head()
First six rows:
head(lake_data)
2
Corinne Riddell
August 28, 2020
Learning objectives for today:
1. What is a data frame
2. How to read a comma separated values (CSV) file using read_csv()
3. Get to know the data using str(), head(), dim(), and names()
4. Manipulate the data frame using the R package dplyr’s main functions:
• rename()
• select()
• arrange()
• filter()
• mutate()
• group_by()
• summarize()
Readings
• There are no chapters from the textbook for this lecture.
• Here are some additional online resources (optional, but helpful!):
– Data Frames
– 15 min intro to dplyr
– Data wrangling cheat sheet
What is a data frame?
• A data frame is a data set.
• We read data into R from common sources like Excel spreadsheets (.xls or .xlsx), text files (.txt), comma
separate value files (.csv), and other formats.
• The simplest format of data contains one row for each individual in the study.
• The first column of the data identifies the individual (perhaps by a name or an ID variable).
• Subsequent columns are variables that have been recorded or measured.
Lake data from Baldi and Moore (B&M)
• Exercise 1.25 from Edition 4 of B&M
• Six rows of data from a study of mercury concentration across 53 lakes
• I’ve added three fabricated rows
• I’ve placed these data in Day-2 folder
• Let’s find it there
readr is a library to import data into R
• To access readr’s functions we load the library like this:
1
, library(readr)
• Click the green arrow to run the code or place your cursor on the line of code and type cmd + enter
(Mac) or control + enter (PC)
• A green rectangle that temporarily appears next to the code shows you that it has run.
read_csv() to load the lake data in R
• read_csv() is a function from the readr library used to import csv files.
• code template: your_data <- read_csv("pathway_to_data.csv")
• The <- is called the assignment operator. It says to save the imported data into an object called
your_data.
lake_data <- read_csv("Data_mercury_lake.csv")
## Parsed with column specification:
## cols(
## lakes = col_character(),
## ph = col_double(),
## chlorophyll = col_double(),
## mercury_in_fish = col_double(),
## number_fish_sampled = col_double(),
## age_data = col_character()
## )
• Anytime you see “##” on the html slides or in the PDF lecture files, the text in those lines are the
output of running the code in the previous line. So the lines above are the output displayed when you
run the read_csv() function.
Exercise 1
1. Execute the above code using either the green arrow or by clicking on it and hitting the keyboard
shortcut (cmd + enter on mac or Ctrl + enter on PC).
2. Note that the data appears in the Environment pane in the top right.
• Notice the number of observations and the number of variables.
3. Click the tiny table icon to the right of the lake_data in the Environment pane to open the Viewer
tab and inspect the data.
Check your understanding!
Four functions to get to know a dataset
• head(your_data): Shows the first six rows of the supplied dataset
• dim(your_data): Provides the number of rows by the number of columns
• names(your_data): Lists the variable names of the columns in the dataset
• str(your_data): Summarizes the above information and more
I use these functions all the time! Multiple times per session when working with data to remind me what the
variable names are, and what the data looks like.
head()
First six rows:
head(lake_data)
2