Analysis of variance: single and to come multi-factor designs
What if response continuous and predictor(s) categorical?
| Independent variable | ||
|---|---|---|
| Dependent variable | Continuous | Categorical | 
| Continuous | Regression | ANOVA | 
| Categorical | Logistic regression | 
ANOVA as Regression
With one categorical variable, ANOVA is equivalent to regression with dummy variables.
In fact when we will run ANOVAs we will use he same code as for regression!
regression: model <- lm(response ~ predictor, data = df)
anova: model <- lm(response ~ factor, data = df)
Analysis of variance - the most powerful approach known for simultaneously testing if the means of k groups are equal - works by assessing whether individuals chosen from different groups are, on average, more different than individuals chosen from the same group.
The null hypothesis of ANOVA is that the population means μᵢ are the same for all treatments.
Rejecting H₀ in ANOVA
Even if all groups had the same true mean, the data would likely show different sample means for each group due to sampling error.
The key insight of ANOVA is that we can estimate how much variation among group means ought to be present from sampling error alone if the null hypothesis is true.
ANOVA lets us determine whether there is more variance among the sample means than we would expect by chance alone. If so, then we can infer that there are real differences among the population means.
Two key measures of variation are calculated and compared:
The comparison is done with an F-ratio:
\[F = \frac{MS_{groups}}{MS_{error}}\]
The total variation in Y can be expressed as a sum of squares:
\(SS_{total} = \sum_{i=1}^{a}\sum_{j=1}^{n}(Y_{ij} - \bar{Y})^2\)
This can be partitioned into two components:
Among Groups (Treatment): \(SS_{among} = \sum_{i=1}^{a}\sum_{j=1}^{n}(\bar{Y}_i - \bar{Y})^2 = n\sum_{i=1}^{a}(\bar{Y}_i - \bar{Y})^2\)
Within Groups (Error): \(SS_{within} = \sum_{i=1}^{a}\sum_{j=1}^{n}(Y_{ij} - \bar{Y}_i)^2\)
These components are additive: \(SS_{total} = SS_{among} + SS_{within}\)
Call:
lm(formula = phase_shift ~ treatment, data = circ_data)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.27857 -0.36125  0.03857  0.61147  1.06571 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.30875    0.24888  -1.241  0.22988   
treatmentEyes  -1.24268    0.36433  -3.411  0.00293 **
treatmentKnees -0.02696    0.36433  -0.074  0.94178   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7039 on 19 degrees of freedom
Multiple R-squared:  0.4342,    Adjusted R-squared:  0.3746 
F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472Analysis of Variance Table
Response: phase_shift
          Df Sum Sq Mean Sq F value   Pr(>F)   
treatment  2 7.2245  3.6122  7.2894 0.004472 **
Residuals 19 9.4153  0.4955                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Key Connection to Regression
This is the same partitioning we saw in regression analysis:
Where:
Both measure how much variation is explained by our model vs. unexplained (error).
The ANOVA table organizes all computations leading to a test of the null hypothesis of no differences among population means.
Example: For a one-way ANOVA with 3 groups and 4 replicates per group:
Call:
lm(formula = phase_shift ~ treatment, data = circ_data)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.27857 -0.36125  0.03857  0.61147  1.06571 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.30875    0.24888  -1.241  0.22988   
treatmentEyes  -1.24268    0.36433  -3.411  0.00293 **
treatmentKnees -0.02696    0.36433  -0.074  0.94178   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7039 on 19 degrees of freedom
Multiple R-squared:  0.4342,    Adjusted R-squared:  0.3746 
F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472Analysis of Variance Table
Response: phase_shift
          Df Sum Sq Mean Sq F value   Pr(>F)   
treatment  2 7.2245  3.6122  7.2894 0.004472 **
Residuals 19 9.4153  0.4955                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# A tibble: 3 × 4
  treatment   Mean    SD     N
  <fct>      <dbl> <dbl> <int>
1 Control   -0.309 0.618     8
2 Eyes      -1.55  0.706     7
3 Knees     -0.336 0.791     7Comparing ANOVA and Regression Tables
An ANOVA table compared to a regression table:
| Source | df | SS | MS | F | p | 
|---|---|---|---|---|---|
| Treatment | a-1 | SS_treatment | MS_treatment | F | p | 
| Error | a(n-1) | SS_error | MS_error | ||
| Total | a*n-1 | SS_total | 
Is equivalent to an ANOVA table from a regression model:
| Source | df | SS | MS | F | p | 
|---|---|---|---|---|---|
| Regression | k | SS_regression | MS_regression | F | p | 
| Error | n-k-1 | SS_residual | MS_residual | ||
| Total | n-1 | SS_total | 
# A tibble: 2 × 2
  Metric                Value
  <chr>                 <dbl>
1 F-observed             7.29
2 F-critical (α = 0.05)  3.52The t-statistic for comparing two independent groups is: \(t = \frac{\bar{X}_1 - \bar{X}_2}{SE_{\text{difference}}}\)
Measuring # standard errors apart the two means are and can be + / -
F-statistic in ANOVA is: \(F = \frac{\text{Mean Square Between Groups}}{\text{Mean Square Within Groups}} = \frac{MS_B}{MS_W}\)
ratio of variance between groups to variance within groups
Unlike \(t\), \(F\) is always non-negative (it’s a ratio of squared quantities).
numerators capture difference between groups (signal)
denominator captures variability within groups (noise)
\(MS_B\) is directly related to \((\bar{X}_1 - \bar{X}_2)^2\)
\(MS_W\) is related to the pooled variance
t-test: \(df = n_1 + n_2 - 2\)
F-test (two groups): \(df_1 = 1\) (numerator), \(df_2 = n_1 + n_2 - 2\) (denominator)
F-distribution with \(df_1 = 1\) is the square of a t-distribution with \(df_2\)
Shows ANOVA is actually a generalization of the t-test
When \(a = 2\), you get the same result either way
but ANOVA extends naturally to comparing three or more groups,
can’t do that directly with a t-test (without running into multiple comparison problems).
Let’s verify this relationship with a simple example in R:
Using circ_data (Control vs Knees groups):
 t-statistic: 0.0741 
 t^2: 0.0055 
 F-statistic: 0.0055 
 Difference (should be ~0): 0 The F-ratio is calculated as: \(F = \frac{MS_{among}}{MS_{error}}\) - Under the null hypothesis (all means equal): - F-ratio should be approximately 1 - Larger F-ratios suggests among-group variance exceeds chance - With the circadian rhythm data: F2,19 = 7.29, p = 0.004
- We reject the null hypothesis
- F-ratio follows an F-distribution with (a - 1) and (a(n - 1)) df
Call:
lm(formula = phase_shift ~ treatment, data = circ_data)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.27857 -0.36125  0.03857  0.61147  1.06571 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.30875    0.24888  -1.241  0.22988   
treatmentEyes  -1.24268    0.36433  -3.411  0.00293 **
treatmentKnees -0.02696    0.36433  -0.074  0.94178   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7039 on 19 degrees of freedom
Multiple R-squared:  0.4342,    Adjusted R-squared:  0.3746 
F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472Anova Table (Type II tests)
Response: phase_shift
          Sum Sq Df F value   Pr(>F)   
treatment 7.2245  2  7.2894 0.004472 **
Residuals 9.4153 19                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R² summarizes the contribution of group differences to total variation:
\[R^2 = \frac{SS_{among}}{SS_{total}}\]
This is interpreted as the “fraction of the variation in Y that is explained by groups.”
For the circadian rhythm data: \[R^2 = \frac{7.224}{16.639} = 0.43\]
43% of the total variation in phase shift is explained by differences in light treatment, with the remaining 57% being unexplained variation.
This is exactly the same calculation as R² in regression: \[R^2 = \frac{SS_{regression}}{SS_{total}}\]
Transform Y (e.g., log, square root)
Use robust or non-parametric alternatives
Use generalized linear models (GLMs)
Levene’s test of homogeneity of variance
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  0.1586 0.8545
      19               Shapiro-Wilk Normality Test Null Hypothesis is that they are normally distributed So you want a non significant result here
    Shapiro-Wilk normality test
data:  residuals(circ_model)
W = 0.95893, p-value = 0.468Shared Assumptions with Regression
ANOVA and regression share virtually identical assumptions because they are both linear models:
| Assumption | ANOVA | Regression | 
|---|---|---|
| Linearity | Each group has its own mean; effects are additive (no interaction in one-way ANOVA) | Relationship between X and Y is linear | 
| Normality | Residuals are normal | Residuals are normal | 
| Equal variance | Variance is the same across all groups | Variance is the same across all X values | 
| Independence | Observations are independent | Observations are independent | 
When ANOVA rejects H₀, we need to determine which groups differ
 contrast        estimate    SE df t.ratio p.value
 Control - Eyes     1.243 0.364 19   3.411  0.0079
 Control - Knees    0.027 0.364 19   0.074  0.9970
 Eyes - Knees      -1.216 0.376 19  -3.231  0.0117
P value adjustment: tukey method for comparing a family of 3 estimates  treatment emmean    SE df lower.CL upper.CL .group
 Eyes      -1.551 0.266 19    -2.25   -0.855  a    
 Knees     -0.336 0.266 19    -1.03    0.361   b   
 Control   -0.309 0.249 19    -0.96    0.343   b   
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 3 estimates 
P value adjustment: tukey method for comparing a family of 3 estimates 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same. When ANOVA rejects H₀, we need to determine which groups differ
[1] "Control" "Eyes"    "Knees"   contrast         estimate    SE df t.ratio p.value
 control vs eyes     1.243 0.364 19   3.411  0.0029
 control vs knees    0.027 0.364 19   0.074  0.9418The Main Post-Hoc Tests
The key tradeoff is between power (ability to detect real differences) and Type I error control (avoiding false positives). More conservative tests = better error control but less power.
# ============================================
# TUKEY-KRAMER (HSD)
# ============================================
# All pairwise comparisons
tukey_result <- emmeans(circ_model, "treatment") |> 
  pairs(adjust = "tukey")
# tukey_result
# # With compact letter display
# tukey_cld <- emmeans(circ_model, "treatment") |> 
#   multcomp::cld(Letters = letters, adjust = "tukey")
# # tukey_cld
# ============================================
# BONFERRONI
# ============================================
# All pairwise comparisons with Bonferroni adjustment
bonferroni_result <- emmeans(circ_model, "treatment") |> 
  pairs(adjust = "bonferroni")
# bonferroni_result
# # With compact letter display
# bonferroni_cld <- emmeans(circ_model, "treatment") |> 
#   multcomp::cld(Letters = letters, adjust = "bonferroni")
# # bonferroni_cld
# ============================================
# ŠIDÁK (SIDAK)
# ============================================
# All pairwise comparisons with Šidák adjustment
sidak_result <- emmeans(circ_model, "treatment") |> 
  pairs(adjust = "sidak")
# sidak_result
# # With compact letter display
# sidak_cld <- emmeans(circ_model, "treatment") |> 
#   multcomp::cld(Letters = letters, adjust = "sidak")
# # sidak_cld
# ============================================
# DUNNETT'S TEST
# ============================================
# Compare all treatments to control
# Need to specify control group as reference
dunnett_result <- emmeans(circ_model, "treatment") |> 
  contrast(method = "trt.vs.ctrl", ref = 1, adjust = "dunnett")
# dunnett_result
# ============================================
# COMPARISON TABLE
# ============================================
# Create a comparison tibble showing p-values from different methods
# First, get the actual comparison names from your results
comparison_summary <- tibble(
  Comparison = summary(tukey_result)$contrast,
  Tukey = summary(tukey_result)$p.value,
  Bonferroni = summary(bonferroni_result)$p.value,
  Sidak = summary(sidak_result)$p.value
)
# ============================================
# SUMMARY TABLE WITH ALL RESULTS
# ============================================
# Create a comprehensive comparison
all_methods <- bind_rows(
  summary(tukey_result) |> 
    as_tibble() |> 
    select(contrast, estimate, SE, p.value) |> 
    mutate(Method = "Tukey"),
  
  summary(bonferroni_result) |> 
    as_tibble() |> 
    select(contrast, estimate, SE, p.value) |> 
    mutate(Method = "Bonferroni"),
  
  summary(sidak_result) |> 
    as_tibble() |> 
    select(contrast, estimate, SE, p.value) |> 
    mutate(Method = "Sidak")
)
# Pivot wider for easy comparison
all_methods |> 
  select(contrast, Method, p.value) |> 
  pivot_wider(names_from = Method, values_from = p.value) |> 
  arrange(Tukey)# A tibble: 3 × 4
  contrast          Tukey Bonferroni   Sidak
  <chr>             <dbl>      <dbl>   <dbl>
1 Control - Eyes  0.00787    0.00879 0.00877
2 Eyes - Knees    0.0117     0.0132  0.0131 
3 Control - Knees 0.997      1       1.000  When ANOVA rejects H₀, we need to determine which groups differ.
When ANOVA rejects H₀, we need to determine which groups differ.
Formal scientific writing example:
“The effect of light treatment on circadian rhythm phase shift was analyzed using a one-way ANOVA. There was a significant effect of treatment on phase shift (F(2, 19) = 7.29, p = 0.004, η² = 0.43). Post-hoc comparisons using Tukey’s HSD test indicated that the mean phase shift for the Eyes treatment (M = -1.55 h, SD = 0.71) was significantly different from both the Control treatment (M = -0.31 h, SD = 0.62) and the Knees treatment (M = -0.34 h, SD = 0.79). However, the Control and Knees treatments did not significantly differ from each other. These results suggest that light exposure to the eyes, but not to the knees, impacts circadian rhythm phase shifts.”
Non-parametric tests make fewer assumptions about the data distribution
Consider non-parametric tests when:
Trade-offs:
Much more relaxed than parametric ANOVA:
Let’s create a modified dataset with clear violations:
# Create data with outliers and skewness
set.seed(42)
violated_circ_df <- circ_data %>%
  mutate(
    phase_shift_violated = case_when(
      treatment == "Control" ~ phase_shift,
      treatment == "Knees" ~ phase_shift * 3.25,
      treatment == "Eyes" ~ {
        n <- length(phase_shift)
        c(phase_shift[1:(n-2)], phase_shift[(n-1):n] - 7)
      }
    )
  )
# Fit model with violated assumptions
violated_model <- lm(phase_shift_violated ~ treatment, 
                     data = violated_circ_df)What we changed:
Clear violation: p < 0.05, points deviate from line
    Shapiro-Wilk normality test
data:  resid(violated_model)
W = 0.89473, p-value = 0.02335Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.5618 0.2355
      19               Even though assumptions are met, let’s compare results:
# Kruskal-Wallis test on original data
kruskal_original_result <- kruskal.test(
  phase_shift ~ treatment, 
  data = circ_data)
# Display results
kruskal_original_result
    Kruskal-Wallis rank sum test
data:  phase_shift by treatment
Kruskal-Wallis chi-squared = 9.4231, df = 2, p-value = 0.008991Interpretation: - H = test statistic (chi-squared approximation) - df = degrees of freedom (k - 1 = 3 - 1 = 2) - p-value = 0.0135 (significant at α = 0.05) - Conclusion: At least one group differs from others
Compare to parametric ANOVA:
Analysis of Variance Table
Response: phase_shift
          Df Sum Sq Mean Sq F value   Pr(>F)   
treatment  2 7.2245  3.6122  7.2894 0.004472 **
Residuals 19 9.4153  0.4955                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1How Kruskal-Wallis works:
Rank Sums:
# A tibble: 3 × 4
  treatment mean_rank sum_rank     n
  <fct>         <dbl>    <dbl> <int>
1 Control       14.6       117     8
2 Eyes           5.29       37     7
3 Knees         14.1        99     7Running Kruskal-Wallis on data with assumption violations:
# Kruskal-Wallis test on violated data
kruskal_violated_result <- kruskal.test(
  phase_shift_violated ~ treatment, 
  data = violated_circ_df)
kruskal_violated_result
    Kruskal-Wallis rank sum test
data:  phase_shift_violated by treatment
Kruskal-Wallis chi-squared = 6.8453, df = 2, p-value = 0.03263Analysis of Variance Table
Response: phase_shift_violated
          Df  Sum Sq Mean Sq F value Pr(>F)  
treatment  2  41.806 20.9029  2.8361 0.0836 .
Residuals 19 140.037  7.3704                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Just like ANOVA, we need post-hoc tests to identify which groups differ:
    Pairwise comparisons using Wilcoxon rank sum exact test 
data:  nonparam_circ_df$phase_shift and nonparam_circ_df$treatment 
      Control Eyes  
Eyes  0.0037  -     
Knees 1.0000  0.0787
P value adjustment method: bonferroni "bonferroni": Most conservative"holm": Less conservative than Bonferroni"BH": Benjamini-Hochberg (controls false discovery rate)"none": No adjustment (not recommended)# Post-hoc tests on violated data
pairwise_violated_result <- pairwise.wilcox.test(
  x = violated_circ_df$phase_shift_violated,
  g = violated_circ_df$treatment,
  p.adjust.method = "bonferroni"
)
pairwise_violated_result
    Pairwise comparisons using Wilcoxon rank sum exact test 
data:  violated_circ_df$phase_shift_violated and violated_circ_df$treatment 
      Control Eyes  
Eyes  0.0037  -     
Knees 1.0000  1.0000
P value adjustment method: bonferroni Comparison table:
Original Data Pairwise p-values:
   Control vs Eyes:  0.024 *
   Control vs Knees: 1.000
   Eyes vs Knees:    0.078Violated Data Pairwise p-values:
   Control vs Eyes:  0.015 *
   Control vs Knees: 1.000
   Eyes vs Knees:    0.015 *Dunn’s test is specifically designed for Kruskal-Wallis post-hoc comparisons:
# Dunn's test on original data
dunn_original_result <- dunnTest(
  phase_shift ~ treatment,
  data = circ_data,
  method = "bonferroni")
dunn_original_result       Comparison          Z     P.unadj      P.adj
1  Control - Eyes  2.7789287 0.005453849 0.01636155
2 Control - Knees  0.1434629 0.885924641 1.00000000
3    Eyes - Knees -2.5517789 0.010717452 0.03215236# Dunn's test on violated data
dunn_violated_result <- dunnTest(
  phase_shift_violated ~ treatment,
  data = violated_circ_df,
  method = "bonferroni"
)
dunn_violated_result       Comparison         Z     P.unadj      P.adj
1  Control - Eyes  2.614212 0.008943349 0.02683005
2 Control - Knees  1.126449 0.259975463 0.77992639
3    Eyes - Knees -1.440520 0.149720243 0.44916073- Large |Z| values indicate bigger differences
- Z > 1.96 approximately equivalent to p < 0.05
- Sign indicates direction of difference
Essential elements to include:
Example write-up: “A Kruskal-Wallis test was conducted to compare the effect of light treatment on circadian rhythm phase shift across three groups: Control (n = 8), Knees (n = 7), and Eyes (n = 7). Median phase shifts were -0.48 hours (IQR = 0.84) for Control, -0.29 hours (IQR = 1.04) for Knees, and -1.48 hours (IQR = 0.66) for Eyes. The test revealed a statistically significant difference in phase shift between the groups, H(2) = 8.66, p = 0.013, ε² = 0.37. Post-hoc pairwise comparisons using Dunn’s test with Bonferroni correction indicated that the Eyes treatment differed significantly from Control (p = 0.024), but no other pairwise differences were significant. These results suggest that light exposure to the eyes has a different effect on circadian rhythm than light to other body parts.”
In-text citation format:
The Kruskal-Wallis test revealed significant differences between treatment groups, H(2) = 8.66, p = .013.
Table format:
Table 1. Descriptive Statistics for Phase Shift by Treatment
 
Note. IQR = interquartile range.
 Kruskal-Wallis H(2) = 8.66, p = .013# A tibble: 3 × 6
  treatment     n Median   IQR   Min   Max
  <fct>     <int>  <dbl> <dbl> <dbl> <dbl>
1 Control       8 -0.485 0.89  -1.27  0.53
2 Eyes          7 -1.48  0.675 -2.83 -0.78
3 Knees         7 -0.29  0.93  -1.61  0.73Common mistakes to avoid: