Tidyverse is a package, or a set of add-on tools, that you can optionally use in R to easily and clearly process and visualise your data. In the tidyverse, there are a number of included packages. You do not need to use them all, nor do you need to load them all, but for simplicity’s sake, it’s easier to load the whole thing and then not worry about it.
## Packages required for this lesson:
#install.packages(c("tidyverse","palmerpenguins"))
library(tidyverse) # Rstudio should prompt you if a package is required to run a notebook but isn't installed.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(palmerpenguins)
%>%
and |>
are each a type of
“pipe”data
argument
of the next lineFor our examples for now, we’ll use built-in datasets called
penguins
and penguins_raw
.
# An example as a reminder
str(penguins) # base R code
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
penguins %>% str() # tidy R code
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
How would you write the base R function
head(penguins_raw)
using tidyverse
?
# Check it here, then raise your hand when you know it works!
Before we start learning anything about the shape of our data and statistical results, we often need to process and organise the data.
These will be useful throughout your work in R, but also pretty much
any other programming language you encounter. They return values of TRUE
or FALSE when evaluated. This type of value is called a
boolean
value generally, but R specifically calls it a
logical
value (abbreviated lgl
).
==
equivalent to>
greater than<
less than>=
greater than or equal to<=
less than or equal to!=
NOT equivalent to (“bang-equals”)&
and (conjunction)|
or (disjunction)class(1==1)
## [1] "logical"
If we only want to look at measures taken between 2008-11-01 and
2008-11-09 from penguins_raw
, we can filter the dataset
(which is like subsetting). Think of the function filter
as
saying “create a filter that allows these things through”:
penguins_raw %>%
filter(`Date Egg` >= "2008-11-01", # list in a logical (&) order
`Date Egg` < "2008-11-09")
How can you make a new column?
# `mutate` just means "create", the way a mutated frog might have a "new" third leg
penguins %>%
# formula for area of a triangle is: base * height / 2
mutate(bill_area_mm2 = bill_depth_mm * bill_length_mm / 2)
Pipe the dataset penguins
into a mutate
function that creates a new column called year_fctr
. This
column should contain the exact same content as the the column
year
, but instead of being of class int
(an
integer, or numeric value), turn it into a value of class
fctr
(a factor, or categorical value) using the function
as.factor()
. You may need to search for examples in the
Help documentation or online:
# duplicate the `year` column but change the class from int to fctr
The function case_when()
is simple and powerful because
it allows you to go through the dataset line by line and apply ordered
rules to each line without having to create a loop in your code. This
saves time and computational energy, compared to how we had to do this
in the past.
Here’s an example of how one might create a column that translates
the values in body_mass_g
in penguins
to a
word, which turns our numeric data into categorical data:
penguins %>%
mutate(birdsize = case_when(body_mass_g < 3500 ~ "small", # choose first category to label
# choose another easy-to-delimit category to label
body_mass_g >= 4800 ~ "large",
# label all other cases, especially if they're harder to delimit
TRUE ~ "medium"))
Now, how would you create a column in penguins
that
groups flipper length (flipper_length_mm
) into “short”,
“medium” and “long”? You can find the range of possible values using the
code provided below to get you started.
# find the range of flipper lengths
penguins %>% pull(flipper_length_mm) %>% range(na.rm = TRUE)
## [1] 172 231
# use `mutate` and `case_when` to create a new column with contingent values
# you can copy the code directly from the previous chunk and only change the relevant values
What if we want a fourth category, so that we end up with “short”,
“medium-short”, “medium-long” and “long”? You’ll have to use more
complex logical operations (such as &
) rather than just
listing each term on its own line.
# try creating four categories of flipper length
What if we want to get aggregate values from our dataset, rather than looking at it as a whole?
group_by()
is a function that flags
certain columns for operations applied by category.
summarise()
checks which columns are
flagged and performs operations based on the combination of values in
those columns.
Nothing appears to change when we use group_by
by
itself:
# look at each step of the code by itself to understand what it is doing
penguins %>%
group_by(species,year)
How many observations are there per “species” and “year”? What are the mean body masses per year?
# using `summarise()`
penguins %>%
group_by(species, year) %>%
summarise(.groups = "drop", # this is optional, but 'dropping' groups simplifies the output
counts = n(),
massPerYear = mean(body_mass_g,
na.rm=TRUE)) # what does the `na.rm=TRUE` argument do?
We can use group_by
and summarise
to do a
lot more than just count. Using penguins
, group the dataset
by species and summarise over species to find the mean
(mean()
) and standard deviation (sd()
) of body
mass for each species:
# calculate mean and st dev values of `body_mass_g` for `species` values
# remember to specify that NAs should be removed using the argument demonstrated above
(This is VERY useful for graphing and creating summary statistics tables!)
Due to constraints on time, read through this section on your own.
One type of data processing that can be a huge hassle without a programming language is merging or joining datasets.
In order to illustrate this, I will create two small datasets that imitate a survey.
Here is the demographic data:
participant <- c("John", "Simone", "Aaliyah", "Marcus")
gender <- c("m", "f", "f", "m")
age <- c(24, 18, 38, NA)
demographics <- tibble(participant, gender, age)
demographics # to view the table below
Here is some quantitative survey data that also has some more qualitative responses:
q1 <- c("yes", "yes", "yes", "no") %>% as.factor()
q2 <- c(4, 3, 4, 5) %>% as.factor()
q3 <- c(1, 4, 5, 2) %>% as.factor()
q4 <- c("rarely", "often", "always", "sometimes") %>% as.factor()
q5 <- c(1, 2, 1, 1) %>% as.factor()
q6 <- c(3, 1, 5, 5) %>% as.factor()
survey <- tibble(participant, q1, q2, q3, q4, q5, q6)
survey # to view the table below
Combine the two datasets using the column they have in common:
# save it as `mySurvey`
full_join(demographics, survey,
by = "participant") -> mySurvey
mySurvey # to view the table below
Switch the rows and columns for only numeric survey questions:
# save it as `mySurvey_long` with two new columns: `questions` and `responses`
mySurvey %>%
pivot_longer(cols = c("q1", "q2", "q3", "q4", "q5", "q6"), # you can also use col indicies, e.g. 4:9
names_to = "questions",
values_to = "responses") -> mySurvey_long
mySurvey_long
Why would you want to do this? It’s not as easy for human eyes to read, but it’s much easier to graph, as we’ll see tomorrow.
We can also reverse the operation, if you receive long data and you prefer wide data.
# `pivot_wider` is the function to reverse this, if you prefer each question to be in its own column
mySurvey_long %>%
pivot_wider(names_from = "questions",
values_from = "responses")
Work on these activities on your own or in small groups this afternoon to practice what we’ve learned here.
Putting together the data wrangling and data visualisation sections of this lesson, we can create a bar plot. Bar plots are different from other types of plots because they require some calculation to happen between the raw data set and the plot. We can do this calculation ourselves very easily, and then specify to R that we’ve already done it and we just want R to plot it using the values provided. Here’s how to do that:
penguins
by speciesggplot()
species
so each bar
will represent each speciesspecies
as
wellgeom_bar()
geometry as a new layer in your plot
stat
is
"identity"
:
stat = "identity"
theme_bw()
layer
to make the plot more accessible# try it out here!
# you can copy code from previous chunks AND from any internet searches you do
Below is code for a really useful type of plot called an interaction
plot. There is a lot going on, both in how the data are summarised and
how the plot is constructed. Go through the code line by line and figure
out what each piece does. Write comments to yourself so you can remember
later! (# comments start with a hash symbol
)
penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(
.groups = "drop",
count = n(),
mean_mass = mean(body_mass_g),
sd_mass = sd(body_mass_g),
standard_error = sd_mass / sqrt(count)
) %>%
ggplot(aes(
x = sex,
y = mean_mass,
colour = species,
group = species
)
) +
theme_bw() +
geom_point() +
geom_path() +
geom_errorbar(aes(
ymin = mean_mass - standard_error,
ymax = mean_mass + standard_error
),
width = .1
) +
ggtitle("Interaction of species and sex for aggregate body mass in grams") +
ylab("mean body mass (g)") +
NULL
R for Data Science (R4DS)
is a free book about using R and tidyverse
to do all types
of data science.