You will need to load in (and possibly install) the following packages:
library(tidyverse)
library(viridis)
You will also need the following datasets:
And here is the .Rmd file to follow along with me.
Visualising data is more than just turning numbers into pictures. It’s telling a story, creating a narrative, and providing analysis all at once. To visualise data well, there are a few aspects you must consider:
Read in a data file to form the following plots.
spdata <- read.csv("data/simpson-paradox-data.csv")
First, plot the data as simply and in as raw a form as possible to see if there are any clear problems or internal structures.
plot(spdata$y~spdata$x)
You may also want to look at the raw data to get a sense of how it’s structured or whether there are any major problems.
head(spdata)
A summary of the data can also be useful in getting a sense of its shape.
summary(spdata)
x y z
Min. : 48.13 Min. : 28.28 Length:304
1st Qu.:187.58 1st Qu.:127.53 Class :character
Median :243.09 Median :193.93 Mode :character
Mean :244.71 Mean :198.23
3rd Qu.:307.36 3rd Qu.:270.22
Max. :479.85 Max. :374.08
We could make our own summaries.
spdata %>%
group_by() %>%
summarise(number=n(),
x.mean=mean(x),
x.sd=sd(x),
x.se=x.sd/sqrt(number),
y.mean=mean(y),
y.sd=sd(y),
y.se=y.sd/sqrt(number))
We could look at statistical tests.
cor.test(spdata$x, spdata$y)
Pearson's product-moment correlation
data: spdata$x and spdata$y
t = 7.8973, df = 302, p-value = 5.332e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3159325 0.5028124
sample estimates:
cor
0.4137212
And that could be it.
However, without doing more detailed and principled exploration, you might miss something in your data that would cause a Simpson’s Paradox.
ggplot(spdata, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method="lm") +
theme_bw()
Above we see what looks like a clear and significant positive correlation. But from looking at the data, we should know that there is another column of interest that could affect the shape. (Column z
, which contains letters.)
ggplot(spdata, aes(x=x, y=y)) +
geom_point(aes(colour=z)) +
geom_smooth(aes(colour=z), method="lm") +
theme_bw()
To make this even more clear, we can overlay the groups’ regression lines with the overall regression line.
ggplot(spdata, aes(x=x, y=y)) +
geom_point(aes(colour=z)) +
geom_smooth(method="lm") +
geom_smooth(aes(colour=z), method="lm") +
theme_bw()
Each of these tells us something true, but they aren’t all equally useful for informing us about the shape of our data.
For more examples: Datasaurus Dozen
By the same token, not ever graph gives you the same information.
Read in the next data file.
bpdata <- read.csv("data/boxplot-problems-data.csv")
If we plot this data using a boxplot, we get a lot of information about the median and quartile distribution of the data.
bpdata %>%
ggplot(aes(y=y,x=z,fill=z)) +
geom_boxplot()
But these four categories look to have approximately the same median. The d
category has a narrow distribution, but a
and b
look pretty similar and c
is only slightly more broadly distributed. It’s possible that there are no statistical differences between these categories!
# Bonferroni-corrected alpha for three comparisons:
0.05/3
[1] 0.01666667
So, any p-values below 0.0167 can be considered statistically significant.
bpdata %>% # AB, AC, AD
mutate(z=z %>% as.factor) %>%
lm(y ~ z, data=.) %>%
summary() # A is different to D at alpha=0.05
Call:
lm(formula = y ~ z, data = .)
Residuals:
Min 1Q Median 3Q Max
-282.440 -104.122 1.443 94.873 308.170
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 237.711 7.994 29.736 <2e-16 ***
zb 4.621 10.040 0.460 0.6454
zc 10.712 13.755 0.779 0.4362
zd 24.290 10.588 2.294 0.0219 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138 on 1357 degrees of freedom
Multiple R-squared: 0.004832, Adjusted R-squared: 0.002632
F-statistic: 2.196 on 3 and 1357 DF, p-value: 0.08675
bpdata %>% # BC, BD
mutate(z=z %>% as.factor,
z=fct_relevel(z, "b")) %>%
lm(y ~ z, data=.) %>%
summary() # B is different to D at alpha=0.05
Call:
lm(formula = y ~ z, data = .)
Residuals:
Min 1Q Median 3Q Max
-282.440 -104.122 1.443 94.873 308.170
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 242.333 6.075 39.890 <2e-16 ***
za -4.621 10.040 -0.460 0.6454
zc 6.091 12.735 0.478 0.6325
zd 19.669 9.226 2.132 0.0332 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138 on 1357 degrees of freedom
Multiple R-squared: 0.004832, Adjusted R-squared: 0.002632
F-statistic: 2.196 on 3 and 1357 DF, p-value: 0.08675
bpdata %>% # CD
mutate(z=z %>% as.factor,
z=fct_relevel(z, "c")) %>%
lm(y ~ z, data=.) %>%
summary() # All ns
Call:
lm(formula = y ~ z, data = .)
Residuals:
Min 1Q Median 3Q Max
-282.440 -104.122 1.443 94.873 308.170
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 248.423 11.193 22.194 <2e-16 ***
za -10.712 13.755 -0.779 0.436
zb -6.091 12.735 -0.478 0.633
zd 13.578 13.172 1.031 0.303
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 138 on 1357 degrees of freedom
Multiple R-squared: 0.004832, Adjusted R-squared: 0.002632
F-statistic: 2.196 on 3 and 1357 DF, p-value: 0.08675
We’ve run some simple stats and taken a look at the data, but we can’t see the differences, and it’s not very convincing that they’re really there, especially after correcting for multiple comparisons. So what’s going on?
bpdata %>%
ggplot(aes(y=y,fill=z,x=z)) +
geom_violin()
Oh these all have different shape density distributions! We couldn’t see that in the boxplot because it only showed us information about the quartile distribution.
But we still don’t know enough. For instance, we can see there are fewer points around 200 in category c
than in the other categories, but what does this translate to? Are there no points there or just very few?
bpdata %>%
ggplot(aes(y=y,fill=z,x=z)) +
geom_violin() +
geom_point(position=position_jitter(.2), alpha=.5)
It turns out that c
has no points in the middle and that it’s a strongly bimodal distribution. What does using a violin plot imply to your audience, then?
Create the data set.
tibble(c("a","a","b","b"),c("c","d","c","d"),c(121, 120, 122, 125)) %>%
rename(factor="c(\"a\", \"a\", \"b\", \"b\")",
level="c(\"c\", \"d\", \"c\", \"d\")",
value="c(121, 120, 122, 125)") -> toydata
There are many things being communicated, even in a simple plot for dataset such as this.
toydata %>%
ggplot(aes(x=factor,y=value,colour=level)) +
geom_point()+
geom_path(aes(group=level))
What is this plot telling us?
c
and d
, whereby the increase from a
to b
is larger for d
than for c
c
and d
a
coming before b
(left to right)c
and d
, since they appear overlaidtoydata %>%
ggplot(aes(x=factor,y=value,fill=level)) +
geom_bar(position="dodge",stat="identity")
What is this plot telling us?
c
and d
, whereby the increase from a
to b
is larger for d
than for c
a
and b
are independent, as are conditions c
and d
Busyness can confuse and mislead your audience, but over-simplicity can as well. In order to balance presenting enough data against presenting too much data, consider how many “dimensions” you are showing, and how much of the data and how many plot elements you display at once.
Let’s create a new dataset to play with.
set.seed(3); spdata %>%
mutate(rValX = rnorm(x),
rValY = rnorm(y) %>% abs() %>% log()) -> vizdata
What’s wrong with this graph?
vizdata %>%
ggplot(aes(x=rValX,
y=rValY,
colour=z)) +
geom_point()
vizdata %>%
ggplot(aes(x=rValX,
y=rValY,
fill=x, # add dimension for column `x`
size=y, # add dimension for column `y`
colour=z)) +
geom_point(shape=21) + # change 'point character' to allow for colour + fill
scale_fill_gradientn(colours = rainbow(7)) + # add visually striking gradient scale
theme_bw() + # make background more plain
# add the labels:
xlab("Randomized normal distribution of `x`") +
ylab("Log of the absolute value of a\nrandomized normal distribution of `y`") +
ggtitle("My cool plot", subtitle = "UKCLC2020")
viridis
)From Intro to Viridis:
These color scales are designed to be:
- Colorful, spanning as wide a palette as possible so as to make differences easy to see,
- Perceptually uniform, meaning that values close to each other have similar-appearing colors and values far away from each other have more different-appearing colors, consistently across the range of values,
- Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as in grey scale printing, and
- Pretty, oh so pretty
vizdata %>%
ggplot(aes(x=rValX,
y=rValY,
fill=x,
size=y,
shape=z, # add shape
colour=z)) +
geom_point(alpha=.75) + # add transparency
scale_shape_manual(values=c(21, 22, 23, 24)) + # manually choose which shapes to plot
scale_fill_viridis(option="plasma") + # choose a better palette
theme_bw() +
xlab("Randomized normal distribution of `x`") +
ylab("Log of the absolute value of a\nrandomized normal distribution of `y`") +
ggtitle("My cool plot", subtitle = "UKCLC2020")
vizdata %>%
ggplot(aes(x=rValX,
y=rValY,
fill=x,
size=y,
shape=z,
colour=z)) +
geom_point(alpha=.75) +
scale_shape_manual(values=c(21, 22, 23, 24)) +
scale_fill_viridis(option="plasma") +
theme_bw() +
xlab("Randomized normal distribution of `x`") +
ylab("Log of the absolute value of a\nrandomized normal distribution of `y`") +
ggtitle("My cool plot", subtitle = "UKCLC2020") +
facet_grid(~z) # facet the data by column `z` category
A picture is worth a thousand words. The purpose of a graph is to support the “story” you are telling about your data. It should be true, but not everything true about it is relevant.
In our vizdata
dataset, x
and y
are meaningful, and as it turns out, rValX
and rValY
are… random. So those values might not be as relevant to our audience. We will want to visualise them for ourselves as we explore the shape of the data in preparation for analysis, but we don’t necessarily need to include every possible dimension.
Our eyes are more sensitive to spacial location than to colour gradients, so let’s make the colours the less important properties of the data.
Shape and (outline) colour are doubly redundant now that we’re faceting the data by column z
category, so we can get rid of those dimensions altogether.
Since the correlation between x and y is known to be important, we can add a smooth to draw our audience’s attention to this property.
vizdata %>%
ggplot(aes(x=x,
y=y,
colour=rValX,
size=rValY)) +
geom_point(alpha=.5) +
scale_colour_viridis(option="plasma") + # change to 'colour' to fill default point shapes
geom_smooth(method = "lm", fill=NA, colour="black", lwd=0.5) + #add visually distinctive smooth
theme_bw() +
xlab("Column `x` value") +
ylab("Column `y` value") +
ggtitle("My cool plot", subtitle = "UKCLC2020") +
facet_grid(~z)
This graph crams ALL of the information in our dataset into a single plot. But it’s not necessary to do so. After all, our original plot showing the four categories and their smooths gives us the same relevant information.
vizdata %>%
ggplot(aes(x=x,
y=y)) +
geom_point(alpha=.5) +
geom_smooth(method = "lm", fill=NA, colour="black", lwd=0.5) + #add visually distinctive smooth
theme_bw() +
xlab("Randomized normal distribution of `x`") +
ylab("Log of the absolute value of a\nrandomized normal distribution of `y`") +
ggtitle("My cool plot", subtitle = "UKCLC2020") +
facet_grid(~z)
Now, this plot has a lot less information that the previous, but it’s very easy to print and definitely has no colourblindness accessibility issues. There are many ways to consider your audience’s priorities.
During this time, attendees will learn about how the structure of a dataset can be changed to best suit their analytical purposes. I will then introduce ggplot2
and other plotting methods in R, with a focus on what types of graphs communicate what information.
First, let’s load in the dataset you’ll be using for the group challenge.
challenge <- read_csv("data/simulated-data.csv")
You can read more about the dataset and its structure in the next section.
head(challenge)
The basic tools for data manipulation are mutate
, filter
, group_by
/summarise
, and , pivot_longer
/pivot_wider
. There are many, many, others, but these four will get you far.
mutate
To mutate is to create a new column. It’s not the best name for the function, but it gets the job done.
Let’s say we want to turn our subj
column into a factor:
challenge %>%
mutate(subj = as.factor(subj))
This overwrites the column that’s already named subj
.
To create a brand new column, name it something else:
challenge %>%
mutate(region.fct = as.factor(region))
filter
This is more straightforward. Just as in acoustics, if you have a low-pass filter, only low frequencies will get through, the filter
function lets the defined properties pass through.
challenge %>%
filter(word == "the",
age <= 40,
rating != 1 &
rating != 2)
group_by
/summarise
This pair of functions lets you perform descriptive statistics on your whole dataframe all at once, based on the groups you define.
challenge %>%
group_by(gram, freq) %>%
summarise(mean = mean(rt),
sd = sd(rt),
se = sd/sqrt(n()))
pivot_longer
We’ll only talk about pivot_longer
, since long data is generally better for ggplot
than wide data, but they operate on the same principles.
challenge %>%
# same code as above
group_by(gram, freq) %>%
summarise(mean = mean(rt),
sd = sd(rt),
se = sd/sqrt(n())) %>%
# new code
pivot_longer(cols = c("mean","sd","se"), # cols = 3:5
names_to = "statistic",
values_to = "value")
ggplot2
The ggplot
package works by layering plots, as we have seen. Now I’ll step through each basic layer so you can understand what the previous code was doing.
The first layer is the base plot. It simply wakes up ggplot
, which is now listening for more information to add.
ggplot(data = spdata)
#spdata %>%
# ggplot()
The first information to add is what the axes of our plot will be. This sort of information is ‘inherited’ by all following plot layers, so we want to define it here. Because these defined elements influence the visual properties of the plot, they’re called “aesthetics”, abbreviated aes
.
ggplot(data = spdata,
aes(x=x,
y=y,
colour=z))
These properties will, by default, be inherited by all following layers, when appropriate. You can override this with inherit.aes = FALSE
, or you can assign aesthetics specifically to some layers and not others within the geometry layers.
We haven’t plotted the data yet because R doesn’t know what form the data should take (e.g., points, bars, box and whisker, histogram, etc). We tell R what geometry (abbreviated geom
) to use with a function we add (+
) after the base plot.
Here we’re adding two geometries: geom_point
to make the scatterplot and geom_smooth
to draw the regression lines.
ggplot(data = spdata,
aes(x=x,
y=y,
colour=z)) +
geom_point() +
geom_smooth(method="lm")
As you can see, geom_smooth
has an argument that isn’t within an aes()
function. This means it applies to the properties of the layer without regard to the source dataset (spdata
). In this case, the method for drawing the regression is “lm”, which is irrelevant to the original dataset, but we can use arguments outside of aes()
to “hard-code” properties in the graph as well.
ggplot(data = spdata,
aes(x=x,
y=y,
colour=z)) +
geom_point(shape=15) + # also called "pch" for "point character"
geom_smooth(method="lm",
lty="dashed", # lty = "Line TYpe"
lwd=.5) # lwd = Line WiDth"
There are a mind-blowing number of extras you can add to your plots, as we saw above. Some are specific to ggplot2
and some are specific to other packages.
ggplot(data = spdata,
aes(x=x,
y=y,
colour=z)) +
geom_point(shape=15) + # also called "pch" for "point character"
geom_smooth(method="lm",
lty="dashed", # lty = "Line TYpe"
lwd=.5, # lwd = Line WiDth"
fill="pink") +
theme_classic() +
scale_colour_manual(name="Category Z:",
values = c("red","gold","turquoise","blueviolet")) +
theme(legend.position = "bottom")
For the next 45 minutes, you can work on your own or in a small group to create a graph from the simulated-data.csv file.
This challenge brings together the skills learned in the previous part of the workshop.
When you have finished your graph, save it (or screenshot it) and send me the image file at lauren[DOT]ackerman[AT]ncl[DOT]ac[DOT]uk. Make sure that the file name is something by which I can identify you (e.g., your name or a “team name”).
We will return for the final 20 minutes to look at each others’ creations and compete for the title of DataViz Champion! (I know it’s silly, but it’s also a lot of fun.)
The “Simulated Data” file is a comma separated value (CSV) file, which R can read in easily, either using the read.csv()
function or by clicking “Import dataset”.
This dataset is simulated experimental data for a reading time, rating, and comprehension experiment, which means there are several types of responses. Each column is described below.
There are two crossed factors, each with two levels: * gram = yes / no * freq = high / low
This means there are four conditions: 1. High frequency grammatical 2. Low frequency grammatical 3. High frequency ungrammatical 4. Low frequency ungrammatical
Sentence: “The old VERB the boat.”
VERB | Frequency | Grammaticality |
---|---|---|
man | low | yes |
steer | high | yes |
put | high | no |
stare | low | no |
This experiment doesn’t really ask a question, because it’s not a real experiment. Instead, it sets up a paradigm that will allow you to compare across conditions using three different types of data:
Since each item contains five regions, but rating and accuracy data are per item rather than per region, you will have to take a subset of the total dataset in order to properly graph and analyse rating or accuracy data. If you don’t you’ll be graphing or analysing five copies of this data at once.