Key point: Distance should use standardized values so all variables contribute equally
##
## Distance between plants 1 and 2 (standardized):
## 3.53 standard deviations
##
##
## Note: Standardization uses mean and SD from ALL plants,
## not just these two plants!
Standardizing the Data
Step 2: Z-score Transformation
Visualizing Standardized Data
Understanding Correlations
Why Correlations Matter for PCA
PCA works best when variables are correlated:
If variables are uncorrelated → PCA provides no benefit
If variables are highly correlated → PCA can reduce dimensions effectively
In data, expect correlations because:
Larger plants have larger measurements overall
Wing dimensions likely correlate with each other
Mass measurements correlate with size measurements
The correlation matrix shows relationships between all pairs of variables.
Strong correlations (positive or negative) suggest PCA will be useful!
Strong positive correlations (dark blue) suggest variables measure similar aspects of plant size/shape.
This is called ordination - ordering objects along axes.
The ordination
Different visualization of PCA Scores
PCA Biplot: Combining Scores and Loadings
Reading a Biplot
A biplot shows both:
Points = plants (PC scores)
Arrows = original variables (loadings)
How to interpret arrows:
Length = how important the variable is
Direction = how the variable relates to PCs
Angle between arrows = correlation
Small angle = positively correlated
90° = uncorrelated
180° = negatively correlated
How to interpret points:
Plants in the direction of an arrow are high in that variable
Plants opposite an arrow are low in that variable
The plot
Biplot Interpretation
Key Patterns in the Biplot
Variable relationships:
Most morphological and biomass variables point together (right and slightly down):
Height, mouth, masses - These are highly correlated
All contribute to PC1 (size)
Wing variables (wing1_length, wing2_length, wingspread) point more upward:
Contribute positively to PC2
Represent the “wing dimension” of shape
Site separation:
HD (blue): Lower PC1, plants are smaller overall
DG (orange): Higher PC1, plants are larger
TJH (yellow) and LEH (green): Intermediate and variable
Biological Interpretation
What determines plant morphology?
Size (PC1, 45.8% variance)
Primary source of variation
Most measurements scale together
May reflect resource availability or age
Shape trade-offs (PC2, 16.7% variance)
Tall narrow vs. short wide with large wings
May reflect different trapping strategies
Or developmental/environmental constraints
Tube dimensions (PC3, 14.8% variance)
Independent aspect of morphology
May affect prey capture efficiency
Sites differ primarily in overall size, with some shape variation.
Testing Site Differences with ANOVA
Step 4: Statistical Analysis
Now that we have PC scores, we can test if sites differ:
Result: Sites differ significantly in PC1 (F = 9.26, P < 0.001)
# Test for site differences on PC1 (size)anova_pc1 <-aov(PC1 ~ site, data = pc_scores)summary(anova_pc1)## Df Sum Sq Mean Sq F value Pr(>F) ## site 3 124.3 41.44 11.29 2.75e-06 ***## Residuals 83 304.6 3.67 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Calculate means by sitepc_scores %>%group_by(site) %>%summarize(mean_PC1 =mean(PC1),sd_PC1 =sd(PC1),n =n() )## # A tibble: 4 × 4## site mean_PC1 sd_PC1 n## <chr> <dbl> <dbl> <int>## 1 DG -1.02 1.38 25## 2 HD 2.63 0.857 12## 3 LEH -0.659 2.61 25## 4 TJH 0.423 1.90 25
ANOVA for PC2 and PC3
Testing Shape and Tube Differences
Results:
PC2 (shape): F = 5.36, P = 0.002 (significant)
PC3 (tube girth): F = 1.45, P = 0.234 (not significant)
# Test for site differences on PC2 (shape)anova_pc2 <-aov(PC2 ~ site, data = pc_scores)summary(anova_pc2)## Df Sum Sq Mean Sq F value Pr(>F) ## site 3 21.63 7.209 4.595 0.00502 **## Residuals 83 130.22 1.569 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Test for site differences on PC3 (tube girth)anova_pc3 <-aov(PC3 ~ site, data = pc_scores)summary(anova_pc3)## Df Sum Sq Mean Sq F value Pr(>F) ## site 3 43.68 14.561 13.8 2.2e-07 ***## Residuals 83 87.58 1.055 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Summary statisticspc_scores %>%group_by(site) %>%summarize(mean_PC2 =mean(PC2),mean_PC3 =mean(PC3))## # A tibble: 4 × 3## site mean_PC2 mean_PC3## <chr> <dbl> <dbl>## 1 DG 0.604 0.542## 2 HD -0.863 -1.26 ## 3 LEH 0.155 -0.555## 4 TJH -0.345 0.618
Visualizing Site Differences
Box Plots of PC Scores
HD plants are significantly smaller than plants at other sites.
TJH plants have different shapes
taller and narrower with smaller wings.
Post-hoc Comparisons
Which Sites Differ?
## contrast estimate SE df t.ratio p.value
## DG - HD -3.651 0.673 83 -5.427 <.0001
## DG - LEH -0.366 0.542 83 -0.675 0.9064
## DG - TJH -1.447 0.542 83 -2.671 0.0442
## HD - LEH 3.285 0.673 83 4.883 <.0001
## HD - TJH 2.204 0.673 83 3.276 0.0082
## LEH - TJH -1.082 0.542 83 -1.996 0.1977
##
## P value adjustment: tukey method for comparing a family of 4 estimates
## site emmean SE df lower.CL upper.CL .group
## DG -1.024 0.383 83 -1.786 -0.262 a
## LEH -0.659 0.383 83 -1.421 0.103 ab
## TJH 0.423 0.383 83 -0.339 1.185 b
## HD 2.626 0.553 83 1.526 3.726 c
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
Distance Between Sites
Calculating Euclidean Distance Between Groups
Calculate the distance between site centroids (mean PC scores):
# 1. Prepare PC Scores with Sitepc_scores_for_dist <-as.data.frame(pca_result$x) %>%mutate(site = darlingtonia$site)# 2. Calculate the mean PC scores for each site (the site "centroids")site_centroids <- pc_scores_for_dist %>%group_by(site) %>%summarise(across(starts_with("PC"), mean)) %>%ungroup()# 3. Set site names as row names BEFORE calculating distancessite_centroids_matrix <- site_centroids %>%column_to_rownames("site") # This keeps site names as row names# 4. Calculate the Euclidean distance matrix between site centroidsdist_matrix <-dist(site_centroids_matrix, method ="euclidean")# Display the distance matrixdist_matrix## DG HD LEH## HD 4.346434 ## LEH 1.621198 3.659711 ## TJH 1.843120 3.021386 1.960547# Visualize the distance matrixfviz_dist(dist_matrix, gradient =list(low ="#00AFBB", high ="#FC4E07"))
Summary: PCA Results for Darlingtonia
What We Learned
Dimension Reduction Success
This makes data easier to visualize, interpret, and analyze.
Biological Insights - Three major axes of morphological variation: - PC1 (45.8%): Overall size
Larger vs. smaller plants
HD site has smallest plants - PC2 (16.7%): Shape contrast
Tall/narrow vs. short/wide with large wings
TJH site has taller, narrower plants - PC3 (14.8%): Tube girth
Fat vs. skinny tubes
No significant site differences
Statistical Findings
Sites differ significantly in size (PC1)
Sites differ significantly in shape (PC2)
Sites do not differ in tube girth (PC3)
PCA Advantages
Key Take-Home Messages
PCA good for exploratory analysis of multivariate data
Always standardize when different units/scales
Loadings reveal variables contributions to each component
Scores used in subsequent analyses (ANOVA, regression)
Biplots visualize both samples and variables simultaneously
When to Use PCA
PCA is Appropriate When:
PCA Assumptions:
Linearity: relationships between variables are linear
Adequate correlation: variables must be correlated
No severe outliers: can distort results
Sample size: need enough observations
Always Check:
Correlation matrix (are variables correlated?)
Scree plot (how many components needed?)
Loadings (biological interpretation makes sense?)
Outliers (check scores plots)
When to Consider Alternatives:
Different data types:
Categorical variables → MCA (Multiple
Correspondence Analysis) - Species composition data → CA (Correspondence Analysis)
Factor Analysis: - Similar to PCA but with different goals - Assumes latent “factors” cause observed variables - Uses rotation to improve interpretation - More common in social sciences
Robust PCA: - Less sensitive to outliers - Uses robust covariance estimators - Useful with messy ecological data
Sparse PCA: - Forces many loadings to exactly zero - Easier to interpret - Better for very high-dimensional data
Functional PCA: - For time series or spatial data - Treats curves as observations - Common in physiology
Probabilistic PCA: - Includes uncertainty estimation - Can handle missing data - Maximum likelihood framework
Kernel PCA: - Handles non-linear relationships - Uses kernel trick from machine learning - More flexible but harder to interpret
These are beyond our scope but worth knowing exist!
Practical Tips for PCA
Best Practices
Before PCA:
Explore your data
Check distributions
Identify outliers
Understand variable scales
Check correlations
Make correlation plot
Ensure variables are correlated
Consider removing redundant variables
Standardize appropriately
Use scale = TRUE for different units
Consider centering only for same units
Check sample size
Need at least 3× variables
More is better (50+ ideal)
During PCA:
Examine scree plot
Look for elbow
Consider cumulative variance
Usually keep 2-5 components
Interpret loadings
Look for patterns
Consider biological meaning
Ignore very small loadings
Check scores plots
Look for outliers
Identify patterns/groups
Consider site/treatment effects
After PCA:
Report clearly
State why you used PCA
Report variance explained
Describe component interpretation
Use results appropriately
PC scores can go into other analyses
Don’t over-interpret small components
Remember: exploratory tool!
Reporting PCA Results
What to Include in Papers/Theses
Methods Section:
“We performed principal component analysis (PCA) on 10 morphological and biomass variables measured on 89 Darlingtonia californica plants from four sites. Variables were standardized (mean = 0, SD = 1) prior to analysis to account for different measurement scales. PCA was conducted using the prcomp() function in R. We retained principal components with eigenvalues > 1 and that cumulatively explained >70% of variance.”
Results Section:
“The first three principal components explained 77.3% of variance (PC1: 45.8%, PC2: 16.7%, PC3: 14.8%). PC1 represented overall plant size, with high positive loadings for most morphological and biomass variables…”
Essential Figures:
Scree plot - shows variance explained
Loadings plot - shows variable contributions
Scores plot or biplot - shows samples in PC space
Essential Tables:
Variance explained by each PC
Loadings for retained PCs (usually PC1-PC3)
ANOVA results if testing groups
Don’t Report:
All 10 PCs if only 3 are meaningful
Tiny loadings (< 0.3)
Every individual PC score
Raw eigenvalues (report % variance instead)
Common PCA Mistakes to Avoid
Mistakes Students Make:
❌ Not standardizing data with different units
→ ✓ Always use scale = TRUE for different units
❌ Forgetting to check correlations
→ ✓ Make correlation plot first
❌ Using too many components
→ ✓ Use scree plot, keep components above elbow
❌ Over-interpreting small loadings
→ ✓ Focus on loadings > |0.3|
❌ Ignoring signs on loadings
→ ✓ Sign can flip; look at patterns
❌ Treating PC scores as original data
→ ✓ Remember PCs are derived variables
❌ Using PCA on categorical data
→ ✓ Use appropriate method (CA, MCA)
❌ Not checking for outliers
→ ✓ Plot scores, check for extreme values
❌ Assuming PCs have biological meaning
→ ✓ Interpret carefully, components are mathematical
❌ Using PCA with too few samples
→ ✓ Need n > 3p (preferably n > 50)
❌ Not reporting variance explained
→ ✓ Always report % variance for each PC
❌ Forgetting that PCA is exploratory
→ ✓ Use for patterns, not hypothesis testing
Practice Exercise
Exercise 1: Subset Analysis
Try running PCA on just the morphological variables (exclude biomass):
# Select only morphological variablesdarl_morph <- darlingtonia %>%select(height, mouth_diam, tube_diam, keel_diam, wing1_length, wing2_length, wingsprea)# Run PCApca_morph <-prcomp(darl_morph, scale. =TRUE)# Examine resultssummary(pca_morph)# How do results differ from full PCA?
Exercise 2: Site-Specific PCA
Run PCA on just one site (e.g., DG):
# Filter to one sitedarl_dg <- darlingtonia %>%filter(site =="DG") %>%select(-site)# Run PCApca_dg <-prcomp(darl_dg, scale. =TRUE)# Compare to overall PCAsummary(pca_dg)# Do the same patterns emerge?
Exercise 3: Interpretation
What does PC1 represent in each analysis?
How much variance is explained?
Do you see the same patterns?
Some other ways
# Create dataset with row namesdarl_pca_data <- darlingtonia %>%select(height, mouth_diam, tube_diam, keel_diam, wing1_length, wing2_length, wingsprea, hoodmass_g, tubemass_g)# PCApca_facto <-PCA(darl_pca_data, graph =FALSE, scale.unit =TRUE)# Create grouping factor with same lengthsite_groups <-factor(darlingtonia$site)# Plot with groupingfviz_pca_biplot(pca_facto, col.ind = site_groups, # Use col.ind insteadpalette ="jco",addEllipses =TRUE,ellipse.level =0.95,repel =TRUE)
# install.packages("RVAideMemoire")library(RVAideMemoire)# Pairwise PERMANOVApairwise.perm.manova(vegdist(darl_scaled, method ="euclidean"), darlingtonia$site,test ="Wilks",nperm =999,p.method ="bonferroni")## ## Pairwise comparisons using permutation MANOVAs on a distance matrix ## ## data: vegdist(darl_scaled, method = "euclidean") by darlingtonia$site## 999 permutations ## ## DG HD LEH ## HD 0.006 - - ## LEH 0.114 0.006 - ## TJH 0.012 0.006 0.018## ## P value adjustment method: bonferroni
results come from different statistical tests being used. Let me break down what each is doing: 1. pairwise.adonis2() - PERMANOVA on subsets What it does: For each pair, it subsets the data to just those 2 sites and runs a separate PERMANOVA Results: All p-values ≤ 0.016 (5 of 6 are p = 0.001) Interpretation: Tests if centroids (mean positions) differ between site pairs 2. pairwise.perm.manova() - Different test entirely What it does: Uses a MANOVA-based approach with permutations on the full distance matrix Results: More conservative - DG vs LEH now p = 0.078 (not significant) Interpretation: Also tests centroid differences but uses a different statistical framework
Why the differences matter: Location (Centroid) Tests:
pairwise.adonis2() and pairwise.perm.manova() both test if site means differ They give slightly different p-values due to different algorithms, but similar conclusions Use these to answer: “Are sites morphologically different?”