Covered
Multiple Linear Regression model
What if more than one predictor (X) variable?
Independent variable | ||
---|---|---|
Dependent variable | Continuous | Categorical |
Continuous | Regression | ANOVA |
Categorical | Logistic regression |
Abundance of ants can be modeled as function of
Instead of line, modeled with (hyper)plane
Used in similar way to simple linear regression:
Crawley 2012: “Multiple regression models provide some of the most profound challenges faced by the analyst”:
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
yi: value of Y for ith observation X1 = xi1, X2 = xi2,…, xp = xip
β0: population intercept, the mean value of Y when X1= 0, X2 = 0,…, Xp = 0
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
β1: partial regression slope, change in Y per unit change in X1 holding other X-vars constant
β2: partial regression slope, change in Y per unit change in X2 holding other X-vars constant
βp: partial regression slope, change in Y per unit change in Xp holding other X-vars constant
What is partial - the relationship between a predictor variable and the response variable while holding all other predictor variables constant. It tells you the isolated effect of one variable, controlling for the influence of others.
model: ant_spp ~ elevation + latitude
1-m increase in elevation - ant spp decrease by 0.012 spp, holding latitude constant
Or 100-meter increase in elevation, lose 1.2 spp
1-degree increase in latitude, ant spp decrease by 2 species, holding elevation constant
Moving north = fewer ant spp, even when comparing sites at same elevation
Call:
lm(formula = ant_spp ~ elevation + latitude, data = ant_df)
Residuals:
Min 1Q Median 3Q Max
-6.1180 -2.3759 0.3218 1.9070 5.8369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.49651 26.50701 3.716 0.00147 **
elevation -0.01226 0.00411 -2.983 0.00765 **
latitude -2.00981 0.61956 -3.244 0.00427 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.022 on 19 degrees of freedom
Multiple R-squared: 0.5543, Adjusted R-squared: 0.5074
F-statistic: 11.82 on 2 and 19 DF, p-value: 0.000463
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
Regression equation can be used for prediction by subbing new values for predictor (X) variables
Variance - SStotal partitioned into SSregression and SSresidual
SSregression is variance in Y explained by model
SSresidual is variance not explained by model
Source of variation | SS | df | MS | Interpretation |
---|---|---|---|---|
Regression | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(p\) | \(\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{p}\) | Difference between predicted observation and mean |
Residual | \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) | \(n-p-1\) | \(\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-p-1}\) | Difference between each observation and predicted |
Total | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(n-1\) | Difference between each observation and mean |
Source of variation | SS | df | MS |
---|---|---|---|
Regression | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(p\) | \(\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{p}\) |
Residual | \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) | \(n-p-1\) | \(\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-p-1}\) |
Total | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(n-1\) |
\[F_{w,n-p} = \frac{MS_{Extra}}{FULL\ MS_{Residual}} \] Can also use t-test (R provides this value)
Explained variance (r2) is calculated the same way as for simple regression:
\[r^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]
Ideally at least 10x observations than predictors to avoid “overfitting” 10:1
Added Variable (AV) plots (also called partial regression plots)
These show the relationship between Y and X₁ after removing the linear effects of all other predictors
Create AV plots for all predictors - avPlots(model)
Or for just one predictor - avPlot(model, variable = "proportion_black")
Regression of Y vs. each X does not consider effect of other predictors:
want to know shape of relationship while holding other predictors constant
elevation latitude ant_spp
elevation 1.0000000 0.1787454 -0.5545244
latitude 0.1787454 1.0000000 -0.5879407
ant_spp -0.5545244 -0.5879407 1.0000000
\[y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \epsilon_i \quad \text{vs.} \quad y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i1}*X_{i2} \epsilon_i\]
This leads to “Curvature” of the regression (hyper)plane
Interaction terms lead to “Curvature” of the regression (hyper)plane
Adding interactions:
Multiple Linear Regression accommodates continuous and categorical variables (gender, vegetation type, etc.) Categorical vars as “dummy vars”, n of dummy variables = n-1 categories
Fertility | fert1 | fert2 |
---|---|---|
Low | 0 | 0 |
Med | 1 | 0 |
High | 0 | 1 |
Fertility | fert1 | fert2 |
---|---|---|
Low | 0 | 0 |
Med | 1 | 0 |
High | 0 | 1 |
\[\text{Adjusted } r^2 = 1 - \frac{SS_{\text{Residual}}/(n - (p + 1))}{SS_{\text{Total}}/(n - 1)}\] \[\text{Akaike Information Criterion (AIC)} = n[\ln(SS_{\text{Residual}})] + 2(p + 1) - n\ln(n)\]
Can fit all possible models
Automated forward (and backward) stepwise procedures: start w no terms (all terms), add (remove) terms w largest (smallest)
========== FULL MODEL (Both Variables) ==========
Call:
lm(formula = ant_spp ~ elevation + latitude, data = ant_df)
Residuals:
Min 1Q Median 3Q Max
-6.1180 -2.3759 0.3218 1.9070 5.8369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.49651 26.50701 3.716 0.00147 **
elevation -0.01226 0.00411 -2.983 0.00765 **
latitude -2.00981 0.61956 -3.244 0.00427 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.022 on 19 degrees of freedom
Multiple R-squared: 0.5543, Adjusted R-squared: 0.5074
F-statistic: 11.82 on 2 and 19 DF, p-value: 0.000463
========== ELEVATION ONLY MODEL ==========
Call:
lm(formula = ant_spp ~ elevation, data = ant_df)
Residuals:
Min 1Q Median 3Q Max
-4.9010 -2.6947 -0.9502 2.9657 7.6363
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.589088 1.385615 9.086 1.55e-08 ***
elevation -0.014641 0.004913 -2.980 0.0074 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.671 on 20 degrees of freedom
Multiple R-squared: 0.3075, Adjusted R-squared: 0.2729
F-statistic: 8.881 on 1 and 20 DF, p-value: 0.0074
========== LATITUDE ONLY MODEL ==========
Call:
lm(formula = ant_spp ~ latitude, data = ant_df)
Residuals:
Min 1Q Median 3Q Max
-6.2223 -2.1188 0.0599 2.1267 6.4990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.8532 30.9803 3.546 0.00203 **
latitude -2.3401 0.7199 -3.251 0.00401 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.569 on 20 degrees of freedom
Multiple R-squared: 0.3457, Adjusted R-squared: 0.313
F-statistic: 10.57 on 1 and 20 DF, p-value: 0.004006
========== MODEL COMPARISON TABLE ==========
Model Adjusted_R2 AIC R2_vs_Full AIC_vs_Full
1 Both Variables 0.5074180 115.8646 0.0000000 0.000000
2 Elevation Only 0.2728722 123.5608 -0.2345459 7.696164
3 Latitude Only 0.3129580 122.3132 -0.1944600 6.448613
========== KEY FINDINGS ==========
Full Model AIC: 115.86 (Adjusted R²: 0.5074)
Elevation Only AIC: 123.56 (Adjusted R²: 0.2729)
Latitude Only AIC: 122.31 (Adjusted R²: 0.3130)
--- Differences from Full Model ---
Removing latitude: AIC increases by 7.70, Adj R² decreases by -0.2345
Removing elevation: AIC increases by 6.45, Adj R² decreases by -0.1945
========== CONCLUSION ==========
✓ Full model has LOWEST AIC (best fit)
✓ Full model has HIGHEST Adjusted R² (explains most variance)
✗ Removing either variable worsens model performance
Both elevation and latitude are important predictors of ant species diversity!
Usually want to know relative importance of predictors to explaining Y
Using F-tests (or t-tests) on partial regression slopes:
Using coefficient of partial determination:
\[r_{X_j}^2 = \frac{SS_{\text{Extra}}}{\text{Reduced }SS_{\text{Residual}}}\]
SSextra
Standardize all vars (mean = 0, sd= 1)
Scales are identical and larger PRS mean more important variable
pEta_sqr (Partial Eta Squared):
Elevation: 0.3189 (31.9%) - Elevation explains 31.9% of the variance after controlling for latitude
Latitude: 0.3564 (35.6%) - Latitude explains 35.6% of the variance after controlling for elevation
Both are important predictors! Latitude is slightly stronger
Elevation: 0.2087 (20.9%)
Latitude: 0.2468 (24.7%)
If you removed latitude from the model, R² would drop by 24.7%
== Method 2: Type II ANOVA (car package) ==
Anova Table (Type II tests)
Response: ant_spp
Sum Sq Df F value Pr(>F)
elevation 81.224 1 8.8955 0.007652 **
latitude 96.085 1 10.5231 0.004272 **
Residuals 173.487 19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
== Coefficients Table ==
Coefficients SSR df pEta_sqr dR_sqr
elevation 81.22 1 0.3189 0.2087
latitude 96.09 1 0.3564 0.2468
Residuals 173.49 19 0.5000 0.4457
== Summary Statistics ==
Sum of squared errors (SSE): 173.5
Sum of squared total (SST): 389.3
Model R²: 0.5543