Return to main page

1 Tidyverse functionality

Tidyverse is a package, or a set of add-on tools, that you can optionally use in R to easily and clearly process and visualise your data. In the tidyverse, there are a number of included packages. You do not need to use them all, nor do you need to load them all, but for simplicity’s sake, it’s easier to load the whole thing and then not worry about it.

## Packages required for this lesson:
#install.packages(c("tidyverse","palmerpenguins"))
library(tidyverse) # Rstudio should prompt you if a package is required to run a notebook but isn't installed.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(palmerpenguins)

1.1 Quick summary

  • %>% and |> are each a type of “pipe”
  • They each pass the previous line into the data argument of the next line
  • They do not save any changes after output

For our examples for now, we’ll use built-in datasets called penguins and penguins_raw.

# An example as a reminder
str(penguins) # base R code
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
penguins %>% str() # tidy R code
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

How would you write the base R function head(penguins_raw) using tidyverse?

# Check it here, then raise your hand when you know it works!

2 Processing data

Before we start learning anything about the shape of our data and statistical results, we often need to process and organise the data.

2.1 Logical operators

These will be useful throughout your work in R, but also pretty much any other programming language you encounter. They return values of TRUE or FALSE when evaluated. This type of value is called a boolean value generally, but R specifically calls it a logical value (abbreviated lgl).

  • == equivalent to
  • > greater than
  • < less than
  • >= greater than or equal to
  • <= less than or equal to
  • != NOT equivalent to (“bang-equals”)
  • & and (conjunction)
  • | or (disjunction)
class(1==1)
## [1] "logical"

2.2 Filter

If we only want to look at measures taken between 2008-11-01 and 2008-11-09 from penguins_raw, we can filter the dataset (which is like subsetting). Think of the function filter as saying “create a filter that allows these things through”:

penguins_raw %>% 
  filter(`Date Egg` >= "2008-11-01", # list in a logical (&) order
         `Date Egg` <  "2008-11-09")

2.3 Add columns

How can you make a new column?

# `mutate` just means "create", the way a mutated frog might have a "new" third leg
penguins %>% 
  # formula for area of a triangle is: base * height / 2
  mutate(bill_area_mm2 = bill_depth_mm * bill_length_mm / 2) 

2.3.1 Activity on your own

Pipe the dataset penguins into a mutate function that creates a new column called year_fctr. This column should contain the exact same content as the the column year, but instead of being of class int (an integer, or numeric value), turn it into a value of class fctr (a factor, or categorical value) using the function as.factor(). You may need to search for examples in the Help documentation or online:

# duplicate the `year` column but change the class from int to fctr

2.4 Case when

The function case_when() is simple and powerful because it allows you to go through the dataset line by line and apply ordered rules to each line without having to create a loop in your code. This saves time and computational energy, compared to how we had to do this in the past.

Here’s an example of how one might create a column that translates the values in body_mass_g in penguins to a word, which turns our numeric data into categorical data:

penguins %>% 
  mutate(birdsize = case_when(body_mass_g <  3500 ~ "small", # choose first category to label
                              # choose another easy-to-delimit category to label
                              body_mass_g >= 4800 ~ "large",
                              # label all other cases, especially if they're harder to delimit
                              TRUE ~ "medium"))

2.4.1 Activity on your own

Now, how would you create a column in penguins that groups flipper length (flipper_length_mm) into “short”, “medium” and “long”? You can find the range of possible values using the code provided below to get you started.

# find the range of flipper lengths
penguins %>% pull(flipper_length_mm) %>% range(na.rm = TRUE)
## [1] 172 231
# use `mutate` and `case_when` to create a new column with contingent values
# you can copy the code directly from the previous chunk and only change the relevant values

What if we want a fourth category, so that we end up with “short”, “medium-short”, “medium-long” and “long”? You’ll have to use more complex logical operations (such as &) rather than just listing each term on its own line.

# try creating four categories of flipper length

2.5 Group and summarise

What if we want to get aggregate values from our dataset, rather than looking at it as a whole?

group_by() is a function that flags certain columns for operations applied by category. summarise() checks which columns are flagged and performs operations based on the combination of values in those columns.

Nothing appears to change when we use group_by by itself:

# look at each step of the code by itself to understand what it is doing
penguins %>% 
  group_by(species,year)

How many observations are there per “species” and “year”? What are the mean body masses per year?

# using `summarise()`
penguins %>% 
  group_by(species, year) %>% 
  summarise(.groups = "drop", # this is optional, but 'dropping' groups simplifies the output
            counts = n(),
            massPerYear = mean(body_mass_g, 
                               na.rm=TRUE)) # what does the `na.rm=TRUE` argument do?

2.5.1 Activity on your own

We can use group_by and summarise to do a lot more than just count. Using penguins, group the dataset by species and summarise over species to find the mean (mean()) and standard deviation (sd()) of body mass for each species:

# calculate mean and st dev values of `body_mass_g` for `species` values
# remember to specify that NAs should be removed using the argument demonstrated above

(This is VERY useful for graphing and creating summary statistics tables!)

2.6 Join and reshape

Due to constraints on time, read through this section on your own.

One type of data processing that can be a huge hassle without a programming language is merging or joining datasets.

In order to illustrate this, I will create two small datasets that imitate a survey.

Here is the demographic data:

participant <- c("John", "Simone", "Aaliyah", "Marcus")
gender <- c("m", "f", "f", "m")
age <- c(24, 18, 38, NA)

demographics <- tibble(participant, gender, age)

demographics # to view the table below

Here is some quantitative survey data that also has some more qualitative responses:

q1 <- c("yes", "yes", "yes", "no") %>% as.factor()
q2 <- c(4, 3, 4, 5) %>% as.factor()
q3 <- c(1, 4, 5, 2) %>% as.factor()
q4 <- c("rarely", "often", "always", "sometimes") %>% as.factor()
q5 <- c(1, 2, 1, 1) %>% as.factor()
q6 <- c(3, 1, 5, 5) %>% as.factor()

survey <- tibble(participant, q1, q2, q3, q4, q5, q6)

survey # to view the table below

Combine the two datasets using the column they have in common:

# save it as `mySurvey`
full_join(demographics, survey, 
          by = "participant") -> mySurvey

mySurvey # to view the table below

Switch the rows and columns for only numeric survey questions:

# save it as `mySurvey_long` with two new columns: `questions` and `responses`
mySurvey %>% 
  pivot_longer(cols = c("q1", "q2", "q3", "q4", "q5", "q6"), # you can also use col indicies, e.g. 4:9
               names_to = "questions",
               values_to = "responses") -> mySurvey_long

mySurvey_long

Why would you want to do this? It’s not as easy for human eyes to read, but it’s much easier to graph, as we’ll see tomorrow.

We can also reverse the operation, if you receive long data and you prefer wide data.

# `pivot_wider` is the function to reverse this, if you prefer each question to be in its own column
mySurvey_long %>% 
  pivot_wider(names_from = "questions",
              values_from = "responses")

3 Challenge activities

Work on these activities on your own or in small groups this afternoon to practice what we’ve learned here.

Putting together the data wrangling and data visualisation sections of this lesson, we can create a bar plot. Bar plots are different from other types of plots because they require some calculation to happen between the raw data set and the plot. We can do this calculation ourselves very easily, and then specify to R that we’ve already done it and we just want R to plot it using the values provided. Here’s how to do that:

  • Group penguins by species
  • Summarise to find the mean flipper length
    • Remember to remove NA values!
  • Pipe the resulting summary table into ggplot()
  • Specify the x-axis is mapped from species so each bar will represent each species
  • Specify the y-axis is mapped from your summarised mean values
  • You can optionally fill the bars by species as well
  • Add the geom_bar() geometry as a new layer in your plot
    • Within this geometry, specify that the argument stat is "identity":
      • stat = "identity"
    • This tells R that the numbers to use are the exact ones provided
  • Optionally specify that you’d like the theme_bw() layer to make the plot more accessible
# try it out here!
# you can copy code from previous chunks AND from any internet searches you do

Below is code for a really useful type of plot called an interaction plot. There is a lot going on, both in how the data are summarised and how the plot is constructed. Go through the code line by line and figure out what each piece does. Write comments to yourself so you can remember later! (# comments start with a hash symbol)

penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(
            .groups = "drop",
            count = n(),
            mean_mass = mean(body_mass_g),
            sd_mass = sd(body_mass_g),
            standard_error = sd_mass / sqrt(count)
            ) %>% 
  ggplot(aes(
             x = sex,
             y = mean_mass,
             colour = species,
             group = species
             )
         ) +
  theme_bw() +
  geom_point() +
  geom_path() +
  geom_errorbar(aes(
                    ymin = mean_mass - standard_error,
                    ymax = mean_mass + standard_error
                    ),
                width = .1
                ) +
  ggtitle("Interaction of species and sex for aggregate body mass in grams") +
  ylab("mean body mass (g)") +
  NULL

4 Additional resources

  1. Data Wrangling Cheatsheet PDF

  2. R for Data Science (R4DS) is a free book about using R and tidyverse to do all types of data science.