Main function for a pairwise correlation is cor.test
.
Have a look at the example. Note that we have used the apa
library to turn this into a nice output.
df = read_csv("WHR_2017.csv")
Rows: 155 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, region
dbl (11): happiness_rank, happiness_score, whisker_high, whisker_low, gdp_per_capita, family, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Okay, we have reported this, but it doesn’t look very “publication-ready”. R has a range of packages which convert the results of analyses into publication-ready formats.
cor_apa(cor)
r(153) = .81, p < .001
Often, you will need to report your results “in-line” (i.e. in the body of your text). To do this in R you use backticks. There are 3 backticks for a block, but only 1 for in-line code (and no curly brackets).
For example… the correlation between GDP per capita and happiness scores is .
Explore other correlations in the data!!
This is a bit more tricky. The best approach is to create a dataframe
containing only those variables that you wish to conduct the pairwise
correlations on. The cor
function from base R creates a
correlation matrix. But results are not very publication-ready. They can
be tidied up via the kable
package (in knitr
).
There is also the fantastic corrplot
which will give you a
great graphical representation.
df = read_csv("WHR_2017.csv")
mod_hs_gdp <- lm(happiness_score ~ gdp_per_capita, data = df)
# "mod" = model.
# "hs_gdp" = hs is predicted by gdp
mod_hs_gdp # This shows you the key statistics, but it's not very informative
summary(mod_hs_gdp) # This is much more informative but it's not pretty to look at
tidy(summary(mod_hs_gdp)) # `tidy` from the `broom` package makes it into a very nice table
# or to make this a bit more "publication-ready"
knitr::kable(tidy(summary(mod_hs_gdp)))
The table looks nice, but often, at least in Psychology, results are reported in-line. A good way to do this is to create your own function.
report_coef = function(mod){
sm = summary(mod)
results_vector = rep("", length(mod$coefficients))
for (i in 1:length(mod$coefficients)){
results_vector[i] = paste0("β = ", round(sm$coefficients[i, 1], 2), ", t = ", round(sm$coefficients[i, 3], 2), ", p = ", sprintf("%.3f", round(sm$coefficients[i, 4], 3)))
}
return(results_vector)
}
There was a significant effect of GDP per capita: β = 3.2, t = 23.62, p = 0.000
NB the code to create the function is tricky. It took me 20 minutes to write with a lot of googling. A newbie would spend much longer. However, if you are writing a thesis which involves the in-line reporting of the results of a large number of linear models, this could save you a lot of time in the long run.
But is it a good model for the data. Let’s do a plot to find out.
Residuals are close to normally distributed, according to both the histograms and QQ-plot. The relationship between residuals and fitted values does not suggest a problem.
Now switch the dependent variable to life expectancy. Do you think that the relationship will be linear? Why? Why not?
Test your assumptions by
You can use the above code as a crib, but try to refer to it as little as possible. The answer is below, but try not to look at it when doing this exercise.
df = read_csv("WHR_2017.csv")
Rows: 155 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, region
dbl (11): happiness_rank, happiness_score, whisker_high, whisker_low, gdp_per_capita, family, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mod_le_gdp = lm(life_expectancy ~ gdp_per_capita, data = df)
# "mod" = model.
# "hs_gdp" = hs is predicted by gdp
mod_le_gdp # This shows you the key statistics, but it's not very informative
Call:
lm(formula = life_expectancy ~ gdp_per_capita, data = df)
Coefficients:
(Intercept) gdp_per_capita
0.08361 0.47499
summary(mod_le_gdp) # This is much more informative but it's not pretty to look at
Call:
lm(formula = life_expectancy ~ gdp_per_capita, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.44149 -0.06442 0.02448 0.08073 0.21928
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08361 0.02622 3.189 0.00173 **
gdp_per_capita 0.47499 0.02450 19.391 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1279 on 153 degrees of freedom
Multiple R-squared: 0.7108, Adjusted R-squared: 0.7089
F-statistic: 376 on 1 and 153 DF, p-value: < 2.2e-16
tidy(summary(mod_le_gdp)) # `tidy` from the `broom` package makes it into a very nice table
NA
residuals = resid(mod_le_gdp)
hist(residuals, breaks = 20)
plot(fitted(mod_le_gdp), residuals); abline(0,0)
qqnorm(residuals); qqline(residuals)
g = ggplot(data = df, aes(x = gdp_per_capita, y = life_expectancy))
g = g + geom_point()
g = g + geom_smooth()
g
g = ggplot(data = df, aes(x = gdp_per_capita, y = life_expectancy, label = country))
g = g + geom_point(colour = "light blue")
g = g + geom_smooth(se = FALSE)
g = g + geom_text(size = 2.5, nudge_y = -0.02, alpha = 0.6)
g
So, the relationship is fairly linear, but there are a couple of issues. First of all there is a natural ceiling to life expectancy so that it does not improve beyond a certain GDP. Secondly, there are some countries where life expectancy is much lower than their GDP would predict. Can you guess which countries fall into this category?
You can add piping within models / plots, e.g.
g = ggplot(data = df %>% filter(region != "Africa"),
aes(x = gdp_per_capita, y = life_expectancy))
g = g + geom_point()
g = g + geom_smooth()
g
NA
NA
A variety of models which are designed for data where distributions depart radically from the normal distribution.
We’re going to look at logistic regression.
Therefore we need to use a special mathematical transformation to turn it into a straight line. This transformation is “built into” the model.
We’re going to turn one of our variables into a dichotomous variable using a “median-split”
as.numeric(df$happiness_score >= median(df$happiness_score))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[47] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[93] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[139] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now we’re going to run our regression model. We’re going to look at whether GDP influences whether happiness is above or below the mdedian.
summary(mod)
Call:
glm(formula = happiness_score_ms ~ scale(gdp_per_capita), family = "binomial",
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.07237 -0.39310 0.07613 0.49431 2.06556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2485 0.2514 -0.988 0.323
scale(gdp_per_capita) 2.8926 0.4629 6.248 4.15e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 214.87 on 154 degrees of freedom
Residual deviance: 109.48 on 153 degrees of freedom
AIC: 113.48
Number of Fisher Scoring iterations: 6
Because the coefficient is hard to interpret, we need to convert it
into odds ratios. This value will be 1 if there is no relationship, >
1 if there is a positive relationship and < 1 if there is a negative
relationship. We used the exp
function to do this.
exp(coef(mod))
(Intercept) scale(gdp_per_capita)
0.7800031 18.0402769
exp(mod$coefficients)
(Intercept) scale(gdp_per_capita)
0.7800031 18.0402769
NB This odds ratio is far too large. A possible issue is that we have not “scaled” our independent variable. Scaling will give a mean of zero and a standard deviation of plus or minus one.
mod = glm(happiness_score_ms ~ scale(gdp_per_capita), data = df, family = "binomial")
exp(coef(mod))
This is a more “sensible” odds ratio.