Lecture 09 - Class Activity - Correlation Linear Regression

Author

bill perry

In class activity 9: Correlation and Linear Regression

Introduction

This document demonstrates key concepts in correlation and regression analysis using ecological examples, focusing on:

Understanding correlation vs. regression
Calculating and interpreting correlation coefficients
Testing correlation assumptions
Performing simple linear regression
Checking regression assumptions
Interpreting regression output and ANOVA tables

We’ll work with real ecological datasets to practice these concepts.

Part 1: Load Required Packages and Data

# Load required packages
library(tidyverse)  # For data manipulation and visualization
library(patchwork)  # For combining plots
library(car)        # For regression diagnostics
library(broom)      # For tidy model output

# Set seed for reproducible results
set.seed(123)

# Create the datasets from the lecture
# Lion data from Example 17.1
l_df <- tibble(
  prop_black = c(0.21, 0.14, 0.11, 0.13, 0.12, 0.13, 0.12, 0.18, 0.23, 0.22, 
                0.20, 0.17, 0.15, 0.27, 0.26, 0.21, 0.30, 0.42, 0.43, 0.59, 
                0.60, 0.72, 0.29, 0.10, 0.48, 0.44, 0.34, 0.37, 0.34, 0.74, 0.79, 0.51),
  age_yr = c(1.1, 1.5, 1.9, 2.2, 2.6, 3.2, 3.2, 2.9, 2.4, 2.1, 
          1.9, 1.9, 1.9, 1.9, 2.8, 3.6, 4.3, 3.8, 4.2, 5.4, 
          5.8, 6.0, 3.4, 4.0, 7.3, 7.3, 7.8, 7.1, 7.1, 13.1, 8.8, 5.4)
)

# Booby data from Example 16.1
b_df <- tibble(
  visits = c(1, 7, 15, 4, 11, 14, 23, 14, 9, 5, 4, 10, 
            13, 13, 14, 12, 13, 9, 8, 18, 22, 22, 23, 31),
  aggression = c(-0.80, -0.92, -0.80, -0.46, -0.47, -0.46, -0.23, -0.16, 
                 -0.23, -0.23, -0.16, -0.10, -0.10, 0.04, 0.13, 0.19, 
                 0.25, 0.23, 0.15, 0.23, 0.31, 0.18, 0.17, 0.39)
)

# Prairie stability data from Example 17.3
p_df <- tibble(
  spp_n = rep(c(1, 2, 4, 8, 16), times = c(32, 32, 32, 32, 33)),
  log_stability = 1.20 + 0.033 * spp_n + rnorm(161, 0, 0.35)
)

Package Overview

tidyverse: Collection of packages for data science
patchwork: Combine multiple ggplot2 plots easily
car: Companion to Applied Regression (diagnostic tools)
broom: Convert statistical objects into tidy data frames

Part 2: Correlation Analysis

Correlation Analysis: Data Types and Assumptions

Data Types Required:

X variable: Continuous numerical
Y variable: Continuous numerical - Both variables should be measured (not manipulated)

Assumptions for Pearson Correlation:

Random sampling from the population
Bivariate normality (both variables normally distributed)
Linear relationship between variables
No extreme outliers

Calculating Correlation Coefficients

Let’s start with the Nazca booby data to explore correlation:

# Calculate Pearson correlation coefficient
cor(b_df$visits, b_df$aggression)

[1] 0.5337225

# Perform correlation test
cor.test(b_df$visits, b_df$aggression)


    Pearson's product-moment correlation

data:  b_df$visits and b_df$aggression
t = 2.9603, df = 22, p-value = 0.007229
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1660840 0.7710999
sample estimates:
      cor 
0.5337225

# Calculate R-squared (variance explained)
booby_corr^2

[1] 0.2848597

Visualizing the Correlation

# Create scatterplot with correlation
b_df %>% 
  ggplot(aes(x = visits, y = aggression)) +
  geom_point(size = 3, alpha = 0.7)

Activity 1: Interpret the Correlation

Based on the output above, answer these questions:

Direction: Is the correlation positive or negative? What does this mean biologically?
- Your answer: __________________________
Strength: How would you classify this correlation (weak, moderate, strong)?
- Your answer: __________________________
Significance: Is the correlation statistically significant? What is the p-value?
- Your answer: __________________________
Variance explained: What percentage of variance in adult aggression is explained by nestling visits?
- Your answer: __________________________

Testing Correlation Assumptions

# Test normality of each variable
shapiro.test(b_df$visits)


    Shapiro-Wilk normality test

data:  b_df$visits
W = 0.95783, p-value = 0.3965

shapiro.test(b_df$aggression)


    Shapiro-Wilk normality test

data:  b_df$aggression
W = 0.91575, p-value = 0.04709

# Create diagnostic plots
p1 <- ggplot(b_df, aes(x = visits)) +
  geom_histogram(bins = 10, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Visits", x = "Visits as Nestling", y = "Count") +
  theme_minimal()

p2 <- ggplot(b_df, aes(x = aggression)) +
  geom_histogram(bins = 10, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Aggression", x = "Future Aggression", y = "Count") +
  theme_minimal()

p3 <- ggplot(b_df, aes(sample = visits)) +
  stat_qq() + stat_qq_line() +
  labs(title = "Q-Q Plot: Visits", x = "Theoretical", y = "Sample") +
  theme_minimal()

p4 <- ggplot(b_df, aes(sample = aggression)) +
  stat_qq() + stat_qq_line() +
  labs(title = "Q-Q Plot: Aggression", x = "Theoretical", y = "Sample") +
  theme_minimal()

# Combine plots
(p1 + p2) / (p3 + p4)

When Assumptions Are Violated

If normality assumptions are violated (p < 0.05 in Shapiro-Wilk test), consider:

Spearman’s rank correlation (non-parametric alternative)
Data transformation (log, square root, etc.)
Removing outliers (if justified)

Let’s try Spearman’s correlation:

# Calculate Spearman's rank correlation
cor.test(b_df$visits, 
         b_df$aggression, 
         method = "spearman")

Warning in cor.test.default(b_df$visits, b_df$aggression, method = "spearman"):
Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  b_df$visits and b_df$aggression
S = 1213.5, p-value = 0.01976
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.472374

Part 3: Simple Linear Regression

Now let’s move from correlation to regression using the lion nose data.

Linear Regression: Data Types and Assumptions

Data Types Required:

X variable (predictor): Continuous numerical
Y variable (response): Continuous numerical - X can be fixed/controlled, Y is the outcome of interest

Assumptions for Linear Regression:

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed
No influential outliers

Fitting a Linear Regression Model

# Fit linear regression model
lion_model <- lm(age_yr ~ prop_black, data = l_df)

# Get model summary
summary(lion_model)


Call:
lm(formula = age_yr ~ prop_black, data = l_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5449 -1.1117 -0.5285  0.9635  4.3421 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8790     0.5688   1.545    0.133    
prop_black   10.6471     1.5095   7.053 7.68e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared:  0.6238,    Adjusted R-squared:  0.6113 
F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

Activity 2: Interpret the Regression Output

From the regression output above:

Regression equation: Write the equation in the form: age = __________________________ + __________________________ × prop_black
- Your answer: __________________________
Slope interpretation: What does the slope value mean in biological terms?
- Your answer: __________________________
R-squared: What percentage of variation in age is explained by nose blackness?
- Your answer: __________________________
Significance: Is the relationship statistically significant? How do you know?
- Your answer: __________________________

Visualizing the Regression

# Create regression plot with confidence interval
l_df %>% 
  ggplot(aes(x = prop_black, y = age_yr)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink", alpha = 0.3)

`geom_smooth()` using formula = 'y ~ x'

Confidence vs. Prediction Intervals

Confidence Interval: Range for the mean age of ALL lions with that nose blackness
Prediction Interval: Range for an INDIVIDUAL lion with that nose blackness
Prediction intervals are always wider than confidence intervals

Part 4: Testing Regression Assumptions

Diagnostic Plots

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(lion_model)

par(mfrow = c(1, 1))

Interpreting Diagnostic Plots

Understanding Regression Diagnostic Plots

Residuals vs Fitted:
- Look for: Random scatter around horizontal line at 0
- Problems: Patterns indicate non-linearity or heteroscedasticity
Q-Q Plot:
- Look for: Points following the diagonal line
- Problems: Deviations indicate non-normal residuals
Scale-Location:
- Look for: Random scatter with horizontal trend line
- Problems: Increasing spread indicates heteroscedasticity
Residuals vs Leverage:
- Look for: Points within Cook’s distance lines
- Problems: Points outside indicate influential observations

Formal Tests of Assumptions

# Test for normality of residuals
shapiro_residuals <- shapiro.test(residuals(lion_model))
shapiro_residuals


    Shapiro-Wilk normality test

data:  residuals(lion_model)
W = 0.93879, p-value = 0.0692

# Test for homoscedasticity (Breusch-Pagan test)
library(lmtest)
bp_test <- bptest(lion_model)
bp_test


    studentized Breusch-Pagan test

data:  lion_model
BP = 6.8946, df = 1, p-value = 0.008646

Activity 3: Assess Assumption Violations

Based on the diagnostic plots and tests:

Linearity: Does the relationship appear linear? (Check Residuals vs Fitted plot)
- Your answer: __________________________
Normality: Are the residuals normally distributed? (Check Q-Q plot and Shapiro test)
- Your answer: __________________________
Homoscedasticity: Is the variance constant? (Check Scale-Location plot and BP test)
- Your answer: __________________________
Influential points: Are there any concerning influential observations?
- Your answer: __________________________

Part 5: ANOVA for Regression

Understanding Variance Partitioning

# Get ANOVA table for regression
anova_table <- anova(lion_model)
anova_table

Analysis of Variance Table

Response: age_years
                 Df  Sum Sq Mean Sq F value    Pr(>F)    
proportion_black  1 138.544 138.544   49.75 7.677e-08 ***
Residuals        30  83.543   2.785                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Calculate sums of squares manually to understand partitioning
ss_total <- sum((l_df$age_yr - mean(l_df$age_yr))^2)
ss_residual <- sum(residuals(lion_model)^2)
ss_regression <- ss_total - ss_residual

print("Manual calculation of sums of squares:")

[1] "Manual calculation of sums of squares:"

print(paste("SS Total:", round(ss_total, 2)))

[1] "SS Total: 222.09"

print(paste("SS Regression:", round(ss_regression, 2)))

[1] "SS Regression: 138.54"

print(paste("SS Residual:", round(ss_residual, 2)))

[1] "SS Residual: 83.54"

print(paste("SS Regression + SS Residual:", round(ss_regression + ss_residual, 2)))

[1] "SS Regression + SS Residual: 222.09"

Visualizing Variance Components

# Create a plot showing variance components
# Get predicted values
l_df$predicted <- predict(lion_model)
mean_age <- mean(l_df$age_yr)

# Select one point to illustrate
example_point <- 10

# Create the visualization
variance_plot <- ggplot(l_df, aes(x = prop_black, y = age_yr)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", size = 1) +
  geom_hline(yintercept = mean_age, linetype = "dashed", color = "darkgreen") +
  # Add lines for one example point
  geom_segment(aes(x = prop_black[example_point], 
                   y = age_yr[example_point],
                   xend = prop_black[example_point], 
                   yend = predicted[example_point]),
               color = "red", size = 1) +
  geom_segment(aes(x = prop_black[example_point], 
                   y = predicted[example_point],
                   xend = prop_black[example_point], 
                   yend = mean_age),
               color = "darkgreen", size = 1) +
  # Add labels
  annotate("text", x = 0.15, y = mean_age + 0.5, 
           label = "Mean", color = "darkgreen") +
  annotate("text", x = l_df$prop_black[example_point] + 0.05, 
           y = (l_df$age_yr[example_point] + l_df$predicted[example_point])/2,
           label = "Residual", color = "red") +
  annotate("text", x = l_df$prop_black[example_point] + 0.05, 
           y = (l_df$predicted[example_point] + mean_age)/2,
           label = "Regression", color = "darkgreen") +
  labs(title = "Variance Components in Regression",
       subtitle = "Total variation = Regression + Residual",
       x = "Proportion Black", y = "Age (years)") +
  theme_minimal()

variance_plot

Part 6: Comparing Multiple Datasets

Let’s practice regression with the prairie biodiversity data:

# Fit regression for prairie data
prairie_model <- lm(log_stability ~ spp_n, data = p_df)

# Get summary
summary(prairie_model)


Call:
lm(formula = log_stability ~ spp_n, data = p_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8146 -0.2165 -0.0094  0.2228  0.7780 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.222902   0.039094  31.281  < 2e-16 ***
spp_n       0.028881   0.004694   6.153 5.94e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3271 on 159 degrees of freedom
Multiple R-squared:  0.1923,    Adjusted R-squared:  0.1872 
F-statistic: 37.86 on 1 and 159 DF,  p-value: 5.94e-09

# Create plot
prairie_plot <- ggplot(p_df, aes(x = spp_n, y = log_stability)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

prairie_plot

`geom_smooth()` using formula = 'y ~ x'

Activity 4: Compare the Two Regressions

Compare the lion and prairie regression models:

Which model explains more variance? (Compare R² values)
- Your answer: __________________________
Which has a stronger relationship? (Compare standardized slopes or correlation)
- Your answer: __________________________
Which has more precise estimates? (Compare standard errors relative to estimates)
- Your answer: __________________________

Summary and Key Takeaways

What We Learned Today

Correlation vs. Regression:
- Correlation: Measures association between two variables
- Regression: Predicts one variable from another
Assumptions Matter:
- Always check assumptions before interpreting results
- Use appropriate alternatives when assumptions are violated
Interpretation:
- R² tells us proportion of variance explained
- Slopes tell us rate of change
- P-values tell us if relationships are statistically significant
Practical Considerations:
- Correlation ≠ Causation
- Outliers can have major impacts
- Sample size affects power to detect relationships

Common Mistakes to Avoid

Using correlation when you mean regression (or vice versa)
Ignoring assumption violations
Extrapolating beyond the range of data
Confusing confidence and prediction intervals
Over-interpreting R² values
Forgetting about biological significance vs. statistical significance

Additional Resources

Whitlock & Schluter Chapter 16 (Correlation)
Whitlock & Schluter Chapter 17 (Regression)
R for Data Science: https://r4ds.had.co.nz/
Quick-R Regression: https://www.statmethods.net/stats/regression.html

Other Formats

In class activity 9: Correlation and Linear Regression

Introduction

Part 1: Load Required Packages and Data

Part 2: Correlation Analysis

Calculating Correlation Coefficients

Visualizing the Correlation

Testing Correlation Assumptions

Part 3: Simple Linear Regression

Fitting a Linear Regression Model

Visualizing the Regression

Part 4: Testing Regression Assumptions

Diagnostic Plots

Interpreting Diagnostic Plots

Formal Tests of Assumptions

Part 5: ANOVA for Regression

Understanding Variance Partitioning

Visualizing Variance Components

Part 6: Comparing Multiple Datasets

Summary and Key Takeaways

Additional Resources