Covered
This lecture covers two fundamental statistical techniques in biology: correlation and regression analysis. Based on Chapters 16-17 from Whitlock & Schluter’s The Analysis of Biological Data (3rd edition), we’ll explore:
Correlation Analysis:
Regression Analysis:
Correlation analysis measures the strength and direction of a relationship between two numerical variables:
The Pearson correlation coefficient (r) is defined as:
\[r = \frac{\sum_{i}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i}(X_i - \bar{X})^2 \sum_{i}(Y_i - \bar{Y})^2}}\]
This can be simplified as:
\[r = \frac{\text{Covariance}(X, Y)}{s_X \cdot s_Y}\]
Where \(s_X\) and \(s_Y\) are the standard deviations of X and Y.
Nazca boobies (Sula granti) - Do aggressive behaviors as a chick predict future aggressive behavior as an adult?
For a Pearson correlation coefficient (r) of 0.53372:
[1] 0.5337225
Pearson's product-moment correlation
data: booby_data$visits_as_nestling and booby_data$future_aggression
t = 2.9603, df = 22, p-value = 0.007229
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1660840 0.7710999
sample estimates:
cor
0.5337225
Interpretation: The correlation coefficient of r = 0.534 suggests that Nazca boobies who experienced more visits from non-parent adults as nestlings tend to display more aggressive behavior as adults. This supports the hypothesis that early experiences influence adult behavior patterns in this species.
Standard Error:
Need to be sure relationship is not curved - note below
As described in Section 16.3, correlation analysis has key assumptions:
Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:
Shapiro-Wilk normality test
data: lion_data$proportion_black
W = 0.88895, p-value = 0.003279
Shapiro-Wilk normality test
data: lion_data$age_years
W = 0.87615, p-value = 0.001615
As described in Section 16.3, correlation analysis has key assumptions:
Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:
Transform one or both variables (log, square root, etc.)
Use non-parametric correlation (Spearman’s rank correlation) or Kendall’s tau 𝛕
Examine the data for outliers or influential points
To understand the amount of variation explained, you can square the Spearman’s rho value.
For your value of 0.74485:
ρ² = 0.74485² = 0.5548
This means approximately 55.48% of the variance in ranks of one variable can be explained by the ranks of the other variable. This is similar to how R² works in linear regression, but specifically for ranked data.
Spearman's rank correlation rho
data: lion_data$proportion_black and lion_data$age_years
S = 1392.1, p-value = 1.013e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7448561
The correlation coefficient depends on the range
Measurement error affects correlation
Correlation vs. Causation
Correlation significance test
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The population regression model \[Y = \alpha + \beta X + \varepsilon\]
Where:
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The population regression model is:
\[Y = \alpha + \beta X + \varepsilon\]
Where:
Y is the response variable
X is the predictor variable
α (alpha) is the intercept (value of Y when X=0)
β (beta) is the slope (change in Y per unit change in X)
ε (epsilon) is the error term (random deviation from the line)
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
\(\hat{Y}\) is the predicted value of Y
a is the estimate of α (intercept)
b is the estimate of β (slope)
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
From Example 17.1 in the textbook the regression line for the lion data is:
This means: - When a lion has no black on its nose (proportion = 0), its predicted age is 0.88 years - For each 0.1 increase in the proportion of black, age increases by 1.065 years - The slope (10.65) indicates that lions with more black on their noses tend to be older
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
The calculation for slope (b) is:
\[b = \frac{\sum_i(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_i(X_i - \bar{X})^2}\]
Given: - \(\bar{X} = 0.3222\) - \(\bar{Y} = 4.3094\) - \(\sum_i(X_i - \bar{X})^2 = 1.2221\) - \(\sum_i(X_i - \bar{X})(Y_i - \bar{Y}) = 13.0123\)
b = 13.0123 / 1.2221 = 10.647
Intercept (a): \(a = \bar{Y} - b\bar{X} = 4.3094 - 10.647(0.3222) = 0.879\)
Making predictions:
To predict the age of a lion with 0.50 proportion of black on its nose:
\[\hat{Y} = 0.88 + 10.65(0.50) = 6.2 \text{ years}\]
Confidence intervals vs. Prediction intervals:
Both intervals are narrowest near \(\bar{X}\) and widen as X moves away from the mean.
# A tibble: 6 × 2
species_number log_stability
<dbl> <dbl>
1 1 0.763
2 1 1.45
3 1 1.51
4 1 0.747
5 1 0.983
6 1 1.12
Call:
lm(formula = log_stability ~ species_number, data = prairie_data)
Residuals:
Min 1Q Median 3Q Max
-0.82774 -0.25344 -0.00426 0.27498 0.75240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.252629 0.041023 30.535 < 2e-16 ***
species_number 0.025984 0.004926 5.275 4.28e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3433 on 159 degrees of freedom
Multiple R-squared: 0.149, Adjusted R-squared: 0.1436
F-statistic: 27.83 on 1 and 159 DF, p-value: 4.276e-07
[1] "rsquared is: 0.148953385305455"
Analysis of Variance Table
Response: log_stability
Df Sum Sq Mean Sq F value Pr(>F)
species_number 1 3.2792 3.2792 27.829 4.276e-07 ***
Residuals 159 18.7358 0.1178
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis test asks whether the slope equals zero:
With df = n - 2 = 161 - 2 = 159
Interpretation:
The slope estimate is 0.033, indicating that log stability increases by 0.033 units for each additional plant species in the plot.
The p-value is very small (2.73e-10), providing strong evidence to reject the null hypothesis that species number has no effect on ecosystem stability.
R² = 0.222, meaning that approximately 22.2% of the variation in log stability is explained by the number of plant species.
This supports the biodiversity-stability hypothesis: more diverse plant communities have more stable biomass production over time.
linear regression has four key assumptions:
Let’s check these assumptions for the lion regression model:
Assume that error 𝞮 is \(e_i = y_i - \hat{y}_i\)
normally distributed for each xi
has the same variance
has a mean of 0 at each xi
linear regression has four key assumptions:
Let’s check these assumptions for the lion regression model:
Assume that error 𝞮 is - estimated as the residuals: \(e_i = y_i - \hat{y}_i\)
linear regression has four key assumptions:
Let’s check these assumptions for the lion regression model:
linear regression has four key assumptions:
Let’s check these assumptions for the lion regression model:
linear regression has four key assumptions:
Let’s check these assumptions for the lion regression model:
Shapiro-Wilk normality test
data: residuals(lion_model)
W = 0.93879, p-value = 0.0692
linear regression has four key assumptions:
If assumptions are violated: 1. Transform the data (Section 17.6) 2. Use weighted least squares for heteroscedasticity 3. Consider non-linear models (Section 17.8)
Estimates of standard error and confidence intervals for slow and intercept to determine confidence bands
the 95% confidence band will contain the true population line 95/100 under repeated sampling
this is usually done in R
In addition to getting estimates of population parameters (β0 , β1), want to test hypotheses about them
Total variation in Y is “partitioned” into 3 components:
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Total variation in Y is “partitioned” into 3 components:
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Total variation in Y is “partitioned” into 3 components:
Sums of Squares and degress of freedome are:
\(SS_{regression} +SS_{residual} = SS_{total}\)
\(df_{regression}+df_{residual} = df_{total}\)
Sums of Squares converted to Mean Squares