# Load required packages
library(car)          # For ANOVA tests
library(emmeans)      # For estimated marginal means
library(mvnormtest)   # For multivariate normality test
library(biotools)     # For Box's M test
library(candisc)      # For canonical discriminant analysis
library(heplots)      # For multivariate plots
# library(MASS)         # For linear discriminant analysis
library(broom)        # For model summaries
library(patchwork)    # For combining plots
library(janitor)      # For cleaning names
library(tidyverse)    # For data manipulation and visualization
# # Set options
# options(scipen = 999)Lecture 16 - Class Activity MANOVA
Lecture 16: Multivariate Analysis of Variance (MANOVA)
What is MANOVA?
MANOVA (Multivariate Analysis of Variance) extends ANOVA to multiple response variables: - Compares group centroids in multivariate space - Tests whether groups differ on multiple dependent variables simultaneously - Controls family-wise error rate - Accounts for correlations between dependent variables
When to Use MANOVA
Use MANOVA when you have: - Response variables: Multiple continuous variables (correlated) - Predictor variable: One or more categorical variables (factors/groups) - Goal: Test for group differences across all response variables simultaneously
Key Assumptions of MANOVA
- Independence of observations
- Multivariate normality within groups
- Homogeneity of covariance matrices (Box’s M test)
- No extreme multivariate outliers
- Linear relationships among dependent variables
- No multicollinearity (but some correlation is expected)
Always check multivariate normality and homogeneity of covariance matrices before proceeding with MANOVA. These assumptions are more stringent than univariate ANOVA.
Part 1: Iris Data Analysis
Data Overview
We’ll analyze morphological measurements of three iris species: - Iris setosa - Iris versicolor
- Iris virginica
We have four measurements: sepal length, sepal width, petal length, and petal width. MANOVA will test whether species differ across all four measurements simultaneously.
# Load the iris data from the data subdirectory
iris_df <- read_csv("data/iris.csv")Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (4): sepal_length, sepal_width, petal_length, petal_width
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.# Clean names and prepare data
iris_df <- iris_df %>% clean_names()
# View data structure
head(iris_df)# A tibble: 6 × 5
  sepal_length sepal_width petal_length petal_width species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa # Check sample sizes by species
iris_df %>% 
  count(species)# A tibble: 3 × 2
  species        n
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50# Create long format for visualization
iris_long_df <- iris_df %>%
  pivot_longer(
    cols = -species,
    names_to = "variable",
    values_to = "measure"
  )
# Plot all variables by species
iris_boxplot <- iris_long_df %>% 
  ggplot(aes(species, measure, fill = species)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y") +
  theme_minimal() +
  labs(title = "Iris Measurements by Species",
       x = "Species",
       y = "Measurement Value") +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))
iris_boxplotStep 1: Visualize Relationships Between Variables
Before running MANOVA, let’s examine the correlations between our response variables.
# Calculate means by species to show centroids
mean_points_df <- iris_df %>% 
  group_by(species) %>% 
  summarise(
    mean_petal_length = mean(petal_length),
    mean_petal_width = mean(petal_width),
    .groups = 'drop'
  )
# Create scatterplot with centroids
iris_centroid_plot <- iris_df %>% 
  ggplot(aes(x = petal_length, y = petal_width, color = species)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_point(data = mean_points_df, 
             aes(x = mean_petal_length, y = mean_petal_width, fill = species), 
             shape = 23, color = "black", stroke = 1.2, alpha = 0.7,
             size = 6) +
  theme_minimal() +
  labs(title = "Petal Measurements with Group Centroids",
       x = "Petal Length (cm)",
       y = "Petal Width (cm)") +
  theme(legend.position = "bottom")
iris_centroid_plotStep 2: Test Assumptions
Multivariate Normality
# Extract response variables as matrix
response_matrix <- iris_df %>% 
  dplyr::select(sepal_length, sepal_width, petal_length, petal_width) %>%
  as.matrix()
# Multivariate Shapiro-Wilk test
mshapiro.test(t(response_matrix))
    Shapiro-Wilk normality test
data:  Z
W = 0.97935, p-value = 0.02342Interpretation: If p < 0.05, the assumption of multivariate normality is violated. MANOVA is fairly robust to moderate violations with large sample sizes.
Homogeneity of Covariance Matrices
# Prepare response variables
response_vars_df <- iris_df %>% 
  dplyr::select(sepal_length, sepal_width, petal_length, petal_width)
# Box's M test for homogeneity of covariance matrices
iris_box_m_model <- boxM(response_vars_df, iris_df$species)
iris_box_m_model
    Box's M-test for Homogeneity of Covariance Matrices
data:  response_vars_df
Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16Interpretation: If p < 0.05, covariance matrices differ between groups. MANOVA is robust to this violation with equal sample sizes.
Visual Assessment of Normality
# Create Q-Q plots for each variable by species
iris_qq_plot <- iris_long_df %>%
  ggplot(aes(sample = measure)) +
  geom_qq() +
  geom_qq_line() +
  facet_grid(variable ~ species, scales = "free") +
  labs(title = "Q-Q Plots by Species and Variable") +
  theme_minimal()
iris_qq_plotStep 3: Fit MANOVA Model
# Fit MANOVA model
iris_manova_model <- manova(cbind(sepal_length, sepal_width, petal_length, petal_width) ~ species, 
                           data = iris_df)
# View MANOVA results
summary(iris_manova_model)           Df Pillai approx F num Df den Df    Pr(>F)    
species     2 1.1919   53.466      8    290 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Interpretation: - Pillai’s trace is the default test statistic (most robust) - Large F-value and small p-value indicate significant group differences - The null hypothesis (all species have same multivariate means) is rejected
Step 4: Alternative Test Statistics
# Wilks' Lambda
summary(iris_manova_model, test = "Wilks")           Df    Wilks approx F num Df den Df    Pr(>F)    
species     2 0.023439   199.15      8    288 < 2.2e-16 ***
Residuals 147                                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Hotelling-Lawley Trace
summary(iris_manova_model, test = "Hotelling-Lawley")           Df Hotelling-Lawley approx F num Df den Df    Pr(>F)    
species     2           32.477   580.53      8    286 < 2.2e-16 ***
Residuals 147                                                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Roy's Largest Root
summary(iris_manova_model, test = "Roy")           Df    Roy approx F num Df den Df    Pr(>F)    
species     2 32.192     1167      4    145 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Step 5: Follow-up Univariate ANOVAs
Since MANOVA is significant, we examine which variables contribute to group differences.
# Extract univariate ANOVA results
summary.aov(iris_manova_model) Response sepal_length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 Response sepal_width :
             Df Sum Sq Mean Sq F value    Pr(>F)    
species       2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals   147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 Response petal_length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 Response petal_width :
             Df Sum Sq Mean Sq F value    Pr(>F)    
species       2 80.413  40.207  960.01 < 2.2e-16 ***
Residuals   147  6.157   0.042                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Step 6: Post-hoc Comparisons
# Sepal Length ANOVA and comparisons
sepal_length_model <- aov(sepal_length ~ species, data = iris_df)
summary(sepal_length_model)             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Pairwise comparisons
sepal_length_emm <- emmeans(sepal_length_model, ~ species)
pairs(sepal_length_emm) contrast               estimate    SE  df t.ratio p.value
 setosa - versicolor      -0.930 0.103 147  -9.033  <.0001
 setosa - virginica       -1.582 0.103 147 -15.366  <.0001
 versicolor - virginica   -0.652 0.103 147  -6.333  <.0001
P value adjustment: tukey method for comparing a family of 3 estimates # Sepal Width ANOVA and comparisons
sepal_width_model <- aov(sepal_width ~ species, data = iris_df)
summary(sepal_width_model)             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  11.35   5.672   49.16 <2e-16 ***
Residuals   147  16.96   0.115                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Pairwise comparisons
sepal_width_emm <- emmeans(sepal_width_model, ~ species)
pairs(sepal_width_emm) contrast               estimate     SE  df t.ratio p.value
 setosa - versicolor       0.658 0.0679 147   9.685  <.0001
 setosa - virginica        0.454 0.0679 147   6.683  <.0001
 versicolor - virginica   -0.204 0.0679 147  -3.003  0.0088
P value adjustment: tukey method for comparing a family of 3 estimates # Petal Length ANOVA and comparisons
petal_length_model <- aov(petal_length ~ species, data = iris_df)
summary(petal_length_model)             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  437.1  218.55    1180 <2e-16 ***
Residuals   147   27.2    0.19                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Pairwise comparisons
petal_length_emm <- emmeans(petal_length_model, ~ species)
pairs(petal_length_emm) contrast               estimate     SE  df t.ratio p.value
 setosa - versicolor       -2.80 0.0861 147 -32.510  <.0001
 setosa - virginica        -4.09 0.0861 147 -47.521  <.0001
 versicolor - virginica    -1.29 0.0861 147 -15.012  <.0001
P value adjustment: tukey method for comparing a family of 3 estimates # Petal Width ANOVA and comparisons
petal_width_model <- aov(petal_width ~ species, data = iris_df)
summary(petal_width_model)             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  80.41   40.21     960 <2e-16 ***
Residuals   147   6.16    0.04                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Pairwise comparisons
petal_width_emm <- emmeans(petal_width_model, ~ species)
pairs(petal_width_emm) contrast               estimate     SE  df t.ratio p.value
 setosa - versicolor       -1.08 0.0409 147 -26.387  <.0001
 setosa - virginica        -1.78 0.0409 147 -43.489  <.0001
 versicolor - virginica    -0.70 0.0409 147 -17.102  <.0001
P value adjustment: tukey method for comparing a family of 3 estimates Step 7: Canonical Discriminant Analysis
To understand how the groups differ in multivariate space, we perform canonical discriminant analysis.
# Perform canonical discriminant analysis
iris_candisc_model <- candisc(iris_manova_model)
iris_candisc_model
Canonical Discriminant Analysis for species:
   CanRsq Eigenvalue Difference  Percent Cumulative
1 0.96987   32.19193     31.907 99.12126     99.121
2 0.22203    0.28539     31.907  0.87874    100.000
Test of H0: The canonical correlations in the 
current row and all that follow are zero
  LR test stat approx F numDF denDF   Pr(> F)    
1      0.02344  199.145     8   288 < 2.2e-16 ***
2      0.77797   13.794     3   145 5.794e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Eigenvalues
iris_candisc_model$eigenvalues[1]  3.219193e+01  2.853910e-01 -7.801056e-17 -1.398429e-15# Proportion of variance explained
prop_variance <- iris_candisc_model$eigenvalues / sum(iris_candisc_model$eigenvalues)
prop_variance[1]  9.912126e-01  8.787395e-03 -2.402001e-18 -4.305862e-17# Cumulative proportion
cumsum(prop_variance)[1] 0.9912126 1.0000000 1.0000000 1.0000000Step 8: Linear Discriminant Analysis
# Perform LDA
iris_lda_model <- lda(species ~ sepal_length + sepal_width + petal_length + petal_width, 
                      data = iris_df)
iris_lda_modelCall:
lda(species ~ sepal_length + sepal_width + petal_length + petal_width, 
    data = iris_df)
Prior probabilities of groups:
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 
Group means:
           sepal_length sepal_width petal_length petal_width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026
Coefficients of linear discriminants:
                    LD1         LD2
sepal_length  0.8293776 -0.02410215
sepal_width   1.5344731 -2.16452123
petal_length -2.2012117  0.93192121
petal_width  -2.8104603 -2.83918785
Proportion of trace:
   LD1    LD2 
0.9912 0.0088 # Get predictions
iris_lda_pred <- predict(iris_lda_model)
# Create dataframe with LDA scores
lda_scores_df <- data.frame(
  LD1 = iris_lda_pred$x[, 1],
  LD2 = iris_lda_pred$x[, 2],
  species = iris_df$species
)Step 9: Visualize Canonical Space
# Calculate group centroids in canonical space
centroids_df <- lda_scores_df %>%
  group_by(species) %>%
  summarise(
    LD1_mean = mean(LD1),
    LD2_mean = mean(LD2),
    .groups = 'drop'
  )
# Create canonical plot
canonical_plot <- lda_scores_df %>%
  ggplot(aes(x = LD1, y = LD2, color = species)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_point(data = centroids_df,
             aes(x = LD1_mean, y = LD2_mean, fill = species),
             shape = 23, color = "black", size = 8, stroke = 2) +
  labs(title = "Canonical Discriminant Analysis",
       subtitle = "Iris Species in Optimal Multivariate Space",
       x = "Linear Discriminant 1",
       y = "Linear Discriminant 2") +
  theme_minimal() +
  theme(legend.position = "bottom")
canonical_plotStep 10: Effect Size
# Calculate Wilks' Lambda for effect size
manova_wilks <- summary(iris_manova_model, test = "Wilks")
wilks_lambda <- manova_wilks$stats[1, "Wilks"]
wilks_lambda[1] 0.02343863# Calculate partial eta-squared
partial_eta_squared <- 1 - wilks_lambda
partial_eta_squared[1] 0.9765614Summary Checklist for MANOVA
When conducting MANOVA, always follow these steps:
- Visualize your data - boxplots and scatterplots by groups
- Check assumptions
- Multivariate normality (Shapiro-Wilk test)
- Homogeneity of covariance matrices (Box’s M test)
- Visual assessment with Q-Q plots
 
- Fit MANOVA model - response variables ~ grouping factor
- Examine test statistics - Pillai’s, Wilks’, Hotelling-Lawley, Roy’s
- Follow-up analyses if MANOVA is significant
- Univariate ANOVAs for each variable
- Post-hoc pairwise comparisons
 
- Canonical analysis - understand multivariate patterns
- Calculate effect size - partial eta-squared from Wilks’ Lambda
- Visualize results - canonical plots showing group separation
Key Points to Remember
- MANOVA controls Type I error when testing multiple dependent variables
- More powerful than separate ANOVAs when variables are correlated
- Tests group centroids in multivariate space, not individual means
- Canonical variates show optimal linear combinations for group separation
- Effect sizes can be very large when groups are well-separated
- Assumptions are more stringent than univariate ANOVA
- Check multivariate assumptions first - normality and homogeneity of covariances
- MANOVA tests the global hypothesis - do groups differ on any combination of variables?
- Follow-up tests identify specific differences - which variables drive group separation
- Canonical analysis reveals patterns - how variables work together to discriminate groups
- Visualize in reduced space - canonical plots show multivariate relationships clearly
- Interpret effect sizes - Wilks’ Lambda tells us proportion of variance explained
- Consider biological meaning - what do the multivariate patterns tell us about the organisms?
Remember: MANOVA is ideal when you expect groups to differ on multiple correlated traits that reflect an integrated biological system!