Multivariate data vs. multivariate analysis
We’ve already seen multivariate data in multiple regression and multi-factor ANOVA
Now we’ll look at cases with multiple response variables.
Functional methods: - Clear response and predictor variables - Goal: relate Y’s to X’s - Examples: MANOVA, PERMANOVA
Structural methods: - Find patterns/structure in data - Often no clear predictors - Examples: PCA, NMDS, Cluster Analysis
Example 1:
Example 2:
Sometimes combine both…
Can divide ecological MV methods into “functional” and “structural”
Functional methods: clear response variable(s) and predictor variables. Goal is to relate Ys to Xs (regression, MANOVA, ANOSIM, PERMANOVA).
Structural methods: concerned with finding structure /pattern in the data. Often no clear predictor variables (PCA, NMDS, Cluster analysis).
Scaling/Ordination Methods: - Reduce dimensions with new derived variables - Summarize patterns in data - Examples: PCA, CCA
Dissimilarity-Based Methods: - Measure dissimilarity between objects - Visualize relationships between objects - Examples: NMDS, Cluster Analysis
Structural methods can be divided further into:
Methods based on scaling or ordination
Goal: reduce number of vars by deriving new variables that summarize data.
Examples include PCA, CCA
Structural methods can be divided further into:
Methods based on dissimilarity measurements
Goal: measure and graphically show degree of dissimilarity between objects.
Examples include (N)MDS and cluster analysis
How to think about the new values
Key concept
Eigenvalues (λ) represent the amount of variation explained by each new derived variable, while eigenvectors contain the coefficients showing how original variables contribute to each component.
Derived variables are found so that:
Eigenvalues (latent roots) represent amount of variation in data explained by the new k= 1 to p derived variables (λ1, λ2 …λp).
Eigenvalues are population parameters and are estimated using ML to get sample statistics (l1, l2…lp)
Eigenvectors are lists of coefficients (c) that show contribution of original variables to new, derived variables
Each new variable has an eigenvalue and an eigenvector
New variables (components) are derived from a p x p covariance or correlation matrix of original variables
Dissimilarity is often represented as a dissimilarity matrix
Data transformation is common and useful in MV analyses
Transformations: - Log transformation for skewed data - Root transformations for count data
- Fourth-root for species abundance data
Standardization: - Centering: subtract mean (mean = 0) - Standardization: divide by SD (SD = 1) - Crucial for variables with different units - May not be appropriate for species data
Why standardize?
Standardization ensures all variables contribute equally to the analysis regardless of their original units or scales of measurement. Without it, variables with larger values or variances would dominate the results.
Multivariate Outliers: - Objects with unusual patterns across variables - Detected with Mahalanobis distance (d²) - Test against χ² distribution with p df
Missing Observations: - Common approaches: - Deletion: remove affected object or variable - Imputation: estimate missing values - Maximum likelihood methods - Multiple imputation
Morphometric measurements on n=150 flowers
Response vars: Septal length + width, petal length + width
Predictor variable: species
Question: are there differences bw species?
Morphometric measurements on n=150 flowers
Response vars: Septal length + width, petal length + width
Predictor variable: species
Question: are there differences bw species?
iris_df <- iris %>% clean_names()
iris_long_df <- iris_df %>% pivot_longer(cols = -species,
names_to = "variable",
values_to = "measure")
write_csv(iris_df, "data/iris.csv")
head(iris_df)
sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
One approach:
series of 1-way ANOVAs
for example —>
But:
Anova Table (Type III tests)
Response: sepal_length
Sum Sq Df F value Pr(>F)
(Intercept) 1253.00 1 4728.16 < 2.2e-16 ***
species 63.21 2 119.26 < 2.2e-16 ***
Residuals 38.96 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Instead of means compare centroids
but for all of the variables not just two
mean_points <- iris_df %>%
group_by(species) %>%
summarise(mean_length = mean(petal_length),
mean_width = mean(petal_width),
.groups = 'drop')
iris_plot<-iris_df %>%
ggplot(aes(x=petal_length, y=petal_width, color=species)) +
geom_point(alpha = 0.85, size = 2) +
geom_point(data=mean_points,
aes(x=mean_length, y=mean_width, fill=species),
shape=23, color="black", stroke=1.2,alpha = .7,
size=6) +
theme_light()
iris_plot
Several test statistics can be determined:
MANOVA Assumptions
iris_manova_model <- manova(cbind(sepal_length, sepal_width, petal_length, petal_width) ~ species, data = iris_df)
summary(iris_manova_model)
Df Pillai approx F num Df den Df Pr(>F)
species 2 1.1919 53.466 8 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Assumption test
# Test multivariate normality for all data together
response_matrix <- iris_df %>%
dplyr::select(sepal_length, sepal_width, petal_length, petal_width) %>%
as.matrix()
# Multivariate Shapiro-Wilk test for entire dataset
mshapiro.test(t(response_matrix))
Shapiro-Wilk normality test
data: Z
W = 0.97935, p-value = 0.02342
Assumption test
response_vars <- iris_df %>%
dplyr::select(sepal_length, sepal_width, petal_length, petal_width)
# Box's M test for equality of covariance matrices
box_m_result <- boxM(response_vars, iris_df$species)
print(box_m_result)
Box's M-test for Homogeneity of Covariance Matrices
data: response_vars
Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16
Assumption test
# Create Q-Q plots for each variable by species
iris_long <- iris_df %>%
pivot_longer(cols = c(sepal_length, sepal_width, petal_length, petal_width),
names_to = "variable",
values_to = "value")
iris_long %>%
ggplot(aes(sample = value)) +
geom_qq() +
geom_qq_line() +
facet_grid(variable ~ species, scales = "free") +
labs(title = "Q-Q Plots by Species and Variable") +
theme_light()
# Univariate ANOVAs for each response variable
# Sepal Length ANOVA
sepal_length_aov <- aov(sepal_length ~ species, data = iris_df)
summary(sepal_length_aov)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Sepal Width ANOVA
sepal_width_aov <- aov(sepal_width ~ species, data = iris_df)
summary(sepal_width_aov)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 11.35 5.672 49.16 <2e-16 ***
Residuals 147 16.96 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Petal Length ANOVA
petal_length_aov <- aov(petal_length ~ species, data = iris_df)
summary(petal_length_aov)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 437.1 218.55 1180 <2e-16 ***
Residuals 147 27.2 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Petal Width ANOVA
petal_width_aov <- aov(petal_width ~ species, data = iris_df)
summary(petal_width_aov)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 80.41 40.21 960 <2e-16 ***
Residuals 147 6.16 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "Sepal Length comparisons"
contrast estimate SE df t.ratio p.value
setosa - versicolor -0.930 0.103 147 -9.033 <.0001
setosa - virginica -1.582 0.103 147 -15.366 <.0001
versicolor - virginica -0.652 0.103 147 -6.333 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
[1] "Sepal Width comparisons"
contrast estimate SE df t.ratio p.value
setosa - versicolor 0.658 0.0679 147 9.685 <.0001
setosa - virginica 0.454 0.0679 147 6.683 <.0001
versicolor - virginica -0.204 0.0679 147 -3.003 0.0088
P value adjustment: tukey method for comparing a family of 3 estimates
[1] "Petal Length comparisons"
contrast estimate SE df t.ratio p.value
setosa - versicolor -2.80 0.0861 147 -32.510 <.0001
setosa - virginica -4.09 0.0861 147 -47.521 <.0001
versicolor - virginica -1.29 0.0861 147 -15.012 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
[1] "Petal Width comparisons"
contrast estimate SE df t.ratio p.value
setosa - versicolor -1.08 0.0409 147 -26.387 <.0001
setosa - virginica -1.78 0.0409 147 -43.489 <.0001
versicolor - virginica -0.70 0.0409 147 -17.102 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
# Perform canonical discriminant analysis
iris_candisc <- candisc(iris_manova_model)
# Display eigenvalues and canonical correlations
cat("Canonical Discriminant Analysis Results:\n\n")
Canonical Discriminant Analysis Results:
Canonical Discriminant Analysis for species:
CanRsq Eigenvalue Difference Percent Cumulative
1 0.96987 32.19193 31.907 99.12126 99.121
2 0.22203 0.28539 31.907 0.87874 100.000
Test of H0: The canonical correlations in the
current row and all that follow are zero
LR test stat approx F numDF denDF Pr(> F)
1 0.02344 199.145 8 288 < 2.2e-16 ***
2 0.77797 13.794 3 145 5.794e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Eigenvalues:
[1] 3.219193e+01 2.853910e-01 -7.801056e-17 -1.398429e-15
# Calculate proportion of variance explained
prop_variance <- eigenvalues / sum(eigenvalues)
cat("\nProportion of variance explained by each canonical variate:\n")
Proportion of variance explained by each canonical variate:
[1] 9.912126e-01 8.787395e-03 -2.402001e-18 -4.305862e-17
# Cumulative proportion
cumulative_prop <- cumsum(prop_variance)
cat("\nCumulative proportion of variance explained:\n")
Cumulative proportion of variance explained:
[1] 0.9912126 1.0000000 1.0000000 1.0000000
Raw Canonical Coefficients (Eigenvectors):
Can1 Can2
sepal_length 0.8293776 0.02410215
sepal_width 1.5344731 2.16452123
petal_length -2.2012117 -0.93192121
petal_width -2.8104603 2.83918785
Standardized Canonical Coefficients:
Can1 Can2
sepal_length 0.4269548 0.01240753
sepal_width 0.5212417 0.73526131
petal_length -0.9472572 -0.40103782
petal_width -0.5751608 0.58103986
# Structure coefficients (correlations between original variables and canonical variates)
cat("\nStructure Coefficients (Variable-Canonical Variate Correlations):\n")
Structure Coefficients (Variable-Canonical Variate Correlations):
Can1 Can2
sepal_length -0.7918878 0.21759312
sepal_width 0.5307590 0.75798931
petal_length -0.9849513 0.04603709
petal_width -0.9728120 0.22290236
Multivariate Visualization
# Extract canonical scores using the correct method
# Alternative simpler approach - use lda from MASS package
# library(MASS)
iris_lda <- MASS::lda(species ~ sepal_length + sepal_width + petal_length + petal_width, data = iris_df)
lda_pred <- predict(iris_lda)
# Create dataframe with LDA scores (equivalent to canonical scores)
canonical_df_plot <- data.frame(
Can1 = lda_pred$x[, 1],
Can2 = lda_pred$x[, 2],
species = iris_df$species
)
# Calculate group centroids
centroids_plot <- canonical_df_plot %>%
group_by(species) %>%
summarise(Can1_mean = mean(Can1),
Can2_mean = mean(Can2),
.groups = 'drop')
# Create ggplot
canonical_df_plot %>%
ggplot(aes(x = Can1, y = Can2, color = species)) +
geom_point(size = 3, alpha = 0.7) +
geom_point(data = centroids_plot,
aes(x = Can1_mean, y = Can2_mean, fill = species),
shape = 23, color = "black", size = 8, stroke = 2) +
labs(title = "Canonical Discriminant Analysis",
subtitle = "Iris Species in Optimal Multivariate Space",
x = "Linear Discriminant 1",
y = "Linear Discriminant 2",
color = "Species",
fill = "Species") +
theme_light() +
theme(legend.position = "bottom")
Pillai’s Trace (1.1919): This is large, indicating substantial group differences across the multivariate space.
F-statistic (53.466): Very large F-value indicates strong evidence against the null hypothesis.
P-value: Essentially zero, meaning we reject the null hypothesis that all three species have the same multivariate means.
Conclusion: The three iris species are significantly different when considering all four morphological measurements simultaneously in multivariate space.
Wilks’ Lambda (0.023439): Very small value (close to 0) indicates:
Effect Size: Partial η² ≈ 1 - 0.023439 = 0.977 (very large effect size)
F-statistic (199.15): Much larger than Pillai’s F-value because Wilks’ Lambda is often more powerful when assumptions are met
Conclusion: The three iris species show extremely large multivariate differences - they are very well separated in the 4-dimensional morphological space, with species explaining nearly 98% of the multivariate variance.
Wilks’ vs Pillai’s: Wilks’ Lambda is generally preferred when assumptions are met, while Pillai’s trace is more robust to assumption violations.
Meaning: Approximately 97.7% of the total multivariate variance is explained by species differences.
Effect Size Guidelines: - Small effect: η² ≈ 0.01 (1% of variance explained) - Medium effect: η² ≈ 0.06 (6% of variance explained) - Large effect: η² ≈ 0.14 (14% of variance explained) - Our result: η² = 0.977 (extremely large effect)
Practical Interpretation: - Species are almost perfectly separated in multivariate morphological space - Only 2.3% of the variation in the four measurements is due to within-species differences - Species membership explains nearly all the multivariate variation - This represents one of the strongest group separations possible in real biological data
Conclusion: The iris species show dramatically different morphological profiles - they are essentially non-overlapping in the 4-dimensional space of sepal/petal measurements. This effect size indicates that species is an extremely powerful predictor of morphological characteristics.
Element [1] = 0.9912: - First canonical variate explains 99.12% of the between-group variance - This dimension captures almost all the multivariate group differences
Element [2] = 0.0088: - Second canonical variate explains 0.88% of the between-group variance - This dimension captures the remaining small group differences
Dimensionality: The group differences are essentially one-dimensional - 99% of separation occurs along the first canonical axis.
Biological Meaning: There’s one primary “direction” in morphological space that best separates the three iris species, with a very minor secondary pattern.
Practical Implication: You could visualize almost all the group separation using just the first canonical variate, though plotting both dimensions shows the complete picture.
Group means: - Setosa: Smallest overall, widest sepals, tiny petals - Versicolor: Medium-sized in most dimensions - Virginica: Largest overall, especially in petal dimensions - Clear size progression: setosa < versicolor < virginica
Coefficients of linear discriminants: - LD1: Positive weights for sepal measurements, negative for petal measurements - Separates small-petaled from large-petaled species - LD2: Mainly contrasts sepal width vs petal width - Fine-tunes separation between versicolor and virginica
Proportion of trace: - LD1: Explains 99.12% of between-group discrimination - LD2: Explains 0.88% of between-group discrimination - Confirms the separation is essentially one-dimensional (petal vs sepal contrast)
Call:
lda(species ~ sepal_length + sepal_width + petal_length + petal_width,
data = iris_df)
Prior probabilities of groups:
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Group means:
sepal_length sepal_width petal_length petal_width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Coefficients of linear discriminants:
LD1 LD2
sepal_length 0.8293776 -0.02410215
sepal_width 1.5344731 -2.16452123
petal_length -2.2012117 0.93192121
petal_width -2.8104603 -2.83918785
Proportion of trace:
LD1 LD2
0.9912 0.0088