## Packages required for this lesson:
#install.packages(c("tidyverse","palmerpenguins"))
library(tidyverse) # Rstudio should prompt you if a package is required to run a notebook but isn't installed.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(palmerpenguins)

1 Visualisation

1.1 Why dataviz?

If you don’t know the shape of your data, you might not draw appropriate conclusions about the statistical tests you perform.

Check out the Datasaurus Dozen, which all have the same x/y means and standard deviations, sometimes called the Simpson Paradox.

Dataviz is also engaging, communicative, creative, efficient, and fun.

1.2 Base R plots

There are some ways to create plots quickly, and it’s good to know a little about that before we get too far.

plot(penguins$bill_length_mm, penguins$bill_depth_mm)

You can learn more about what to do with these plots later, but I want to show you a Normal Quantile-Quantile plot (QQ plot) output so you can see how easy it is to make even before you learn why you’d want to make it. In short: it helps you determine if your data are sufficiently normally distributed to perform statistical tests that assume you are using normally distributed data. If the points (mostly) fall along the line, you’re in luck!

qqnorm(penguins$bill_length_mm)
qqline(penguins$bill_length_mm)

These data are not perfectly normally distributed, but they’re close enough and where they deviate, they don’t deviate drastically. Very rarely will you find data that is perfect!

As an example of data that is not normally distributed, we can look at longitude coordinates or depth value from the quakes dataset (another dataset that is built into R automatically).

# longitude (`long`) is not normal because data was gathered from two islands, so it is bimodal
qqnorm(quakes$long)
qqline(quakes$long)

# depth is not normal because the highest and lowest values deviate extremely from the
#   theoretical Q-Q line
qqnorm(quakes$depth)
qqline(quakes$depth)

You will become a better (visual) judge with practice and exposure.

2 Visual grammar

In the tidyverse, the package for making elegant plots is called ggplot2. It works a lot like how pipes work by building up graphics in layers. This might seem to be a bit of a faff at first, but it ends up being powerful (and easy once you know how the components work). But since it was originally designed as a separate package, it uses + instead of %>%.

To begin, we need to specify what data we’re using.

# this will print a blank plot
penguins %>% 
  ggplot() + 
  theme_bw() + # this line makes it easy to print and more accessible for visual disabilities
  NULL # this line allows us to end each preceding content line with a + regardless of order or comments (#)

       # ending with NULL is completely optional. it's just a nice trick for troubleshooting later on.

This is the background of the plot – a check to see that penguins is something that can be plotted from.

To start adding things to the plot, we need to specify what we want the plot to extract from penguins. That means we have to tell it what the axes are. Make the x-axis bill_length_mm and the y-axis bill_depth_mm.

# this will print a *mostly* blank plot...
penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm)) +
  theme_bw() +
  NULL

Now the plot knows a bit more about what we’re asking, but not enough to show up the data. This is how ggplot() differs from just plot(). By itself, plot() infers how we want to display our data. This is great when it’s correct and not great when it’s wrong (which it often is, without additional specifications). In contrast, ggplot() requires the specifications from the start, but they’re integrated more smoothly.

Now let’s tell it what kind of plot to make. This is called the plot’s “geometry”, abbreviated “geom”.

# make a scatter plot
penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm)) +
  theme_bw() +
  geom_point() +
  NULL

## Warning: Removed 2 rows containing missing values (`geom_point()`).

You can get rid of the message at the top by adding , warning = FALSE after the code chunk identifier:

{r, warning = FALSE}

We don’t care about the warning because it’s just saying we have some empty cells in our dataset. R knows how to deal with empty cells, so there’s really no reason for us to worry.

You can do a lot of other fancy stuff in the {r} space, but you’ll have to look that up on your own time.

2.1 Lines of best fit

Our eyes are not designed to find subtle patterns in data, and our minds will often “see” patterns and differences where none exist. Therefore, we should always confirm what we see with statistics. One way to do that is to draw a line of best fit. This works best with continuous data (as below), but it can also work with categorical and binomial data (in the appendix).

# add a line of best fit
penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm)) +
  theme_bw() +
  geom_point() +
  geom_smooth(method = "lm", 
              formula = "y ~ x") + # line of best fit based on the lm() method
  NULL

This line suggests that as bill length gets longer, bill depth gets smaller, overall. However, there is clearly a lot of variance in this data! Perhaps we could add another variable to the point plot to figure out where that “noise” is coming from.

# fill in plot with colour contigent on species
penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) + # add colour to the base aesthetics
  theme_bw() +
  geom_point() +
  NULL

This gives a very different picture of the relationship between bill length and bill depth! Let’s re-add the smooths to see how those lines of best fit are oriented.

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) + # add a line to plot colour contingent on a mapping
  theme_bw() +
  geom_point() +
  geom_smooth(method = "lm",
              formula = "y ~ x") +
  NULL

Notice how the colour propagates to both the points and to the smooths. This is because any aesthetics specified in the ggplot() function is “inherited” by the rest of the geometries. If you don’t want something to be inherited, you can specify it in only the geometry you want it to be part of.

That said, both “pictures” of the data are correct:

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm)) + 
  theme_bw() +
  geom_point(aes(colour = species)) + # colour by species for points only
  geom_smooth(colour = "black", 
              method = "lm",
              formula = "y ~ x") + # line of best fit regardless of species
  geom_smooth(aes(colour = species), 
              method = "lm",
              formula = "y ~ x") + # line of best fit contingent on species
  NULL

These lines of best fit give us insight into what our statistical models will be telling us, when we run them.

2.2 Types of plots

Let’s take a look at what’s needed to make a histogram or density plot.

ggplot(penguins, 
       aes(x = bill_length_mm,
           fill = species)) +
  theme_bw() +
  #geom_histogram(bins = 40) + # stacked histogram
  #geom_density(alpha = .5) + # smooth density (probability) distribution
  NULL

Another common and useful type of plot is the box and whisker plot.

ggplot(penguins, 
       aes(x = island,
           y = flipper_length_mm,
           fill = species)) +
  theme_bw() +
  geom_boxplot()

Bar plots are sometimes controversial, but they can also be very useful. They take slightly different arguments than other types of plots because of how the bar height is ‘calculated’.

penguins %>% 
  filter(!is.na(flipper_length_mm)) %>% 
  group_by(species) %>% 
  summarise(flipper_length_mm_mean = mean(flipper_length_mm),
            flipper_length_mm_sd = sd(flipper_length_mm),
            flipper_length_mm_se = sd(flipper_length_mm)/sqrt(n())) %>% 
  ggplot(aes(x = species,
             y = flipper_length_mm_mean,
             fill = species)) +
  theme_bw() +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = flipper_length_mm_mean - flipper_length_mm_sd, 
                    ymax = flipper_length_mm_mean + flipper_length_mm_sd),
                width = .2) +
  NULL

You may also want to explore fancier types of plots, or combine types we’ve already encountered. This is easy with ggplot()’s modular construction and visual grammar.

ggplot(penguins, 
       aes(x = island,
           y = flipper_length_mm,
           fill = species,
           shape = species)) + 
  theme_bw() +
  geom_violin() +
  geom_boxplot(position = position_dodge(.9),
               alpha = .3) +
  geom_point(position = position_jitterdodge(jitter.width = .3, 
                                             dodge.width = .9, 
                                             jitter.height = 0),
             alpha = .5) +
  scale_fill_manual(values = c("cadetblue", "salmon", "goldenrod1")) +
  NULL

Notes:

shape is the shape of the point (see more here)
- You can set the shapes you want by using scale_shape_manual()
scale_fill_manual() allows you to choose your own colours (see more here)
- You can choose your own, or use pre-set color palettes, e.g. with the viridis package

3 Visual analysis

Statistics is (like) dark magic: if you have mastered it, it means you’ve dedicated your life (soul) to it and have no time or capacity to do other things. If you don’t know what you’re doing at all but try to do it anyway, it can hurt you or people around you. If you’re somewhere in the middle, it is best to go slow, hedge your bets, use it judiciously, and ask for help often.

Why is statistics something to be wary of?

It’s unintuitive. Human brains were not designed to understand statistics intuitively.
Human brains are too good at detecting patterns – even if none exist.
Human brains like attributing causality to things regardless of underlying mechanisms.
Developing your expertise in your field or subfield of choice is a major undertaking, and statistics is an entirely independent area to develop expertise in as well.
Be honest with yourself: how comfortable are you teaching yourself complex maths?
Statistics is a tool for:
1. telling us what we already know, and/or
2. telling us that our squishy human brains are only human (and wrong about what we think we know)

3.1 Continuous data

Let’s create our own toy dataset:

set.seed(15) # set the "randomness" so we all get the same "random" numbers
x = rnorm(50) # 50 random numbers from a normal distribution
y = 2 * x + 5 # for each x, multiply by 2 (slope) and add 5 (intercept)

Here is what this data look like. Too perfect, everything on a perfect line, even with randomness:

tibble(x, y) %>% # create a table with two columns
  ggplot(aes(x=x, y=y)) + # establish the base of a plot
  geom_point() + # use points to plot the data
  theme_bw() + # use a nice theme
  geom_abline(intercept = 5, slope = 2, colour="red") # add a red line with the specified slope

Let’s add more noise, like any complex system would have:

e = rnorm(50) # random noise

# realistic model
y2 = 2 * x + 5 + e # slope = 2, intercept = 5, random noise ("error" or epsilon) = e

tbl1 <- tibble(x, y2) # combine into a dataset

How does the new noise change the data?

tbl1 %>% # using this dataset
  ggplot(aes(x=x, y=y2)) + # create a base plot with x on the x-axis and y2 on the y axis
  geom_point() + # make it a scatter plot (points)
  theme_bw() + # make it pretty
  geom_abline(intercept = 5, slope = 2, colour="red") # add the red line to indicate intended slope and intercept

Is the red line still the best way to approximate this data?

model <- lm(y2 ~ x) # calculate slope and intercept automatically

What are the calculated slope and intercept?

coef(model) # `coef` stands for coefficients

## (Intercept)           x 
##    4.886943    2.199415

Plot the data with the intended shape of the data (red) and the calculated shape (green):

tbl1 %>% # using the toy dataset
  ggplot(aes(x=x, y=y2)) + # establish the base of the plot
  geom_point() + # draw the data as points
  geom_vline(xintercept=0, colour="black") + # add a vertical black line at the "intercept" (y axis)
  geom_abline(intercept = 5, slope = 2, colour="red") + # add the intended shape of the data (red)
  geom_abline(intercept = coef(model)[1], slope = coef(model)[2], colour="green3") + # add the calculated shape (green)
  theme_bw() # make it pretty

How do the intended shape and calculated shape differ? Why might this be?

Moreover, there’s actually a way to do this calculation within a plot:

tbl1 %>% 
  ggplot(aes(x=x, y=y2)) +
  geom_point() +
  geom_smooth(method="lm",
              formula = 'y ~ x') + # here is where the calculation happens within the plot
  geom_vline(xintercept=0, colour="black") +
  geom_abline(intercept = 5, slope = 2, colour="red") +
  geom_abline(intercept = coef(model)[1], slope = coef(model)[2], colour="green3") +
  theme_bw()

But what happens if we add a LOT more noise?

set.seed(2) # set the randomizer 'seed' so we all have the same 'random' values
y3 = 2 * x + 5 + rnorm(50)*15 # add 15 times as much noise to the y values
model.noise = lm(y3 ~ x)

tibble(x, y3)  %>% 
  ggplot(aes(x=x, y=y3)) +
  geom_point() +
  geom_smooth(method="lm",
              formula = 'y ~ x',
              colour = "green4") + 
  geom_vline(xintercept=0, colour="black") +
  geom_abline(intercept = 5, slope = 2, colour="red") +
  geom_abline(intercept = coef(model.noise)[1], slope = coef(model.noise)[2], colour="green3") +
  theme_bw()

If you have enough noise in your data, you can still predict where the true slope+intercept could be, but not with enough confidence to know where it truly is. This is why the red line (the true value) is within the greyed out area, or our “confidence interval”. However, the confidence interval also includes the possibility for a slope going in the opposite direction. Thus if we didn’t already know what the true value was supposed to be, we wouldn’t even be able to say whether the slope was positive or negative!

4 Workshop activities

We have used geom_point() and geom_smooth() to plot data. R does most of the calculations for us, so we only need to specify certain dimensions of the plot and it does the rest. Now try getting R to plot a geom_boxplot() box and whisker plot with species on the x-axis and one of the numerical columns from penguins on the y-axis. Add the aesthetic fill to colour in the boxes. Try filling the boxes by species, but also try island or year, to see what happens.

# you can copy much of the code from the preceding chunks and edit it to change the geometries and aesthetic mappings.

5 Additional resources:

How to link dataviz to statistical analysis, specifically linear models: (Types of) Linear Models, another lesson I gave at the same summer school. This will be extremely useful to prepare for the following sessions, if you are new to linear models or would like to understand more about how to visualise different types of data and their lines of best fit.
ggplot2: Elegant Graphics for Data Analysis is a free book about using ggplot2 to create concise, informative, and beautiful graphics to communicate your data easily.
Dataviz Cheatsheet PDF

Adventures in R Session 3: Dataviz