Common goal of MV analysis is variable reduction: can we derive new variables (based on linear combinations of “original” variables) that explain variation in data?
For data set with i= 1 to n objects and j = 1 to p original variables we seek new variables (principal components) using the equation:
First derived variable explains most of the variation in the data
Second most of the remaining variation
And so on…
As many derived variables as original variables (p)
Derived variables are uncorrelated with each other
Review: Eigenvalues and Eigenvectors
Eigenvectors, eigenvalues and components
Eigenvalues (latent roots) represent amount of variation in data explained by the new k= 1 to p derived variables (λ1, λ2 …λp).
Eigenvalues are population parameters and are estimated using ML to get sample statistics (l1, l2…lp)
Eigenvectors are lists of coefficients (c) that show contribution of original variables to new, derived variables
Each new variable has an eigenvalue and an eigenvector
New variables (components) are derived from a p x p covariance or correlation matrix of original variables
Lecture 17: PCA Goals and Introduction
Common goals of MV data analysis are variable reduction (finding derived variables that summarize data) and exploration of patterns in data (scaling/ordination)
Can use association (correlation/ covariance) matrices (PCA) or dissimilarity measures (MDS)
In PCA: take p old variables and transform them into p “new/derived” uncorrelated variables (principal components)
Data for PCA Analysis
# Load the iris datasetiris_df <- iris %>%clean_names() %>%mutate(ind =row_number()) %>%mutate(species_ind =paste(species, ind, sep="_"))# get values onlyiris_data <- iris_df %>%select(-species, -ind, -species_ind)# Keep species for later visualizationiris_species <- iris_df %>%select(species, ind, species_ind)# pivot to long formt for viewingiris_long_df <- iris_df %>%pivot_longer(cols =-c(species, ind, species_ind),names_to ="variable",values_to ="values")iris_df
Without standardization, PCA would be dominated by variables with larger numbers (like petal length) simply because they have bigger values, not because they’re more important biologically
What does standardization do?
Converts each variable to have:
Mean = 0 (centered at zero)
Standard deviation = 1 (same spread)
This gives all variables equal weight in the analysis
How to interpret standardized values: Example: A sepal length of 5.1 cm might become -0.9 after standardization, meaning it’s 0.9 standard deviations below the average sepal length
Principal Component Analysis finds new variables (called components) that capture the most variation in your data. Think of it as finding the “best viewing angles” to see differences between flowers.
The mathematics (simplified):
PCA rotates your data to find the direction with maximum spread (PC1)
Then finds the next direction with maximum spread perpendicular to PC1 (PC2)
Continues until it has as many components as original variables (4 in our case)
Why center = FALSE and scale = FALSE?
We already standardized our data in Step 3, so we tell R not to do it again: - center = FALSE: Don’t subtract the mean (we already did) - scale = FALSE: Don’t divide by standard deviation (we already did)
What the summary shows:
Standard deviation: How much variation each component captures
Proportion of Variance: Percentage of total variation explained by each component
Cumulative Proportion: Running total of variance explained
# Perform PCA on standardized datairis_pca <-prcomp(iris_scaled, center =FALSE, scale. =FALSE)# Note: center and scale are FALSE because we already standardized# Alternative using vegan packageiris_pca_vegan <-rda(iris_scaled)# Summary of PCA resultssummary(iris_pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
Deriving Components: 2D Visualization
How are new uncorrelated components derived? One way to think it is in terms of axis rotation Consider a 2-variable dataset:
Component Derivation: Axis Rotation
Goal is to “rotate the axes” around center of the data “cloud” in such a way that most of the variation lies along the first axis Then find second axis that explains the second-most variation AND is orthogonal to first axis
Component Derivation: Multivariate Extension
Easy to picture in 2D (or even 3D), but harder in multivariate space Practically, components are “extracted” from a covariance of correlation matrix among original variables Will extract as many principal components as original variables
Component Information: Eigenvalues and Eigenvectors
Get two important pieces of information from PCA: eigenvectors and eigenvalues
Eigenvalues (latent roots)- how much of the variation is explained by each component?
Eigenvectors- list of coefficients for original variables. There are p coefficients in an eigenvector and p eigenvectors
Correlation bw original variables will result in fewer components explaining more variance; variable reduction will fail if original variables are not correlated
Step 4: Understanding Eigenvalues and Variance
Understanding Eigenvalues and Variance
What are eigenvalues?
Eigenvalues tell us how much variation each principal component captures.
Larger eigenvalues = more important components.
Key terms explained:
Eigenvalue: The amount of variance captured by each component (always positive)
Proportion of Variance: What percentage of total variation this component explains
Cumulative Variance: Running total - helps us decide how many components we need
How to read the results:
If PC1 has eigenvalue = 2.9, it captures 2.9 “units” of variance
If Prop_Variance = 0.728, PC1 explains 72.8% of all variation in the data
If Cumsum_Variance = 0.959 at PC2, the first 2 components together explain 95.9% of variation
Why this matters:
This table helps us decide how many components to keep.
If 2 components explain 95% of variance, we’ve successfully reduced 4 variables to 2
We only lose 5% of information without including the other variables!
# Extract eigenvalues (variance explained by each component)eigenvalues <- iris_pca$sdev^2prop_variance <- eigenvalues /sum(eigenvalues)cumsum_variance <-cumsum(prop_variance)# Create a summary tablepca_summary <-data.frame(Component =paste0("PC", 1:length(eigenvalues)),Eigenvalue = eigenvalues,Prop_Variance = prop_variance,Cumsum_Variance = cumsum_variance)print("PCA Summary:")
[1] "PCA Summary:"
kable(pca_summary, digits =3)
Component
Eigenvalue
Prop_Variance
Cumsum_Variance
PC1
2.918
0.730
0.730
PC2
0.914
0.229
0.958
PC3
0.147
0.037
0.995
PC4
0.021
0.005
1.000
Step 5: Determine Number of Components - Scree Plot
What is a Scree Plot?
A scree plot shows how much variance each component explains, helping us decide how many components we need. The name comes from the geological term “scree” - loose rocks at the base of a cliff - because the plot often looks like a steep cliff followed by rubble.
How to read a Scree Plot:
Y-axis: Percentage of variance explained by each component
X-axis: Component number (PC1, PC2, etc.)
The pattern: Usually shows a steep drop followed by a leveling off
The “Elbow Method”:
Look for where the line “bends” or forms an elbow:
Components before the elbow = important (steep slope)
Components after the elbow = less important (gentle slope)
Keep components up to and including the elbow
What to look for in our plot:
If PC1 explains 70% and PC2 explains 20%, but PC3 only explains 5%, the elbow is at PC2
This suggests keeping the first 2 components
The dramatic drop from PC1 to PC2, then gentle decline after, confirms our dimension reduction worked well
# Components explaining at least 80% of variancecomponents_80_percent <-which(cumsum_variance >=0.80)[1]print(paste("Components needed for 80% variance:", components_80_percent))
[1] "Components needed for 80% variance: 2"
Step 6: Interpret the Components - Loadings
What are Component Loadings
Loadings tell us how much each original variable contributes to each principal component. Think of them as “recipes” that show how to mix your original measurements to create the new components.
How to read the loadings table
Values range from -1 to +1 (like correlations)
Large positive values (e.g., 0.8): This variable contributes strongly in the positive direction
Large negative values (e.g., -0.8): This variable contributes strongly in the negative direction
Values near 0: This variable doesn’t contribute much to this component
Interpreting the patterns:
If all loadings have similar signs: Component represents overall size (all measurements increase/decrease together)
If loadings have mixed signs: Component represents shape or proportions (some measurements increase while others decrease)
Dominant variables: Variables with the largest absolute loadings drive that component’s meaning
Example interpretation:
If PC1 has all negative loadings around -0.5, it means:
Flowers with high PC1 scores have small values for ALL measurements
This component captures “overall flower size”
The negative sign just indicates direction (could flip signs and interpretation)
# Component loadings (how much each original variable contributes)loadings_df <-data.frame(Variable =rownames(iris_pca$rotation),PC1 = iris_pca$rotation[, 1],PC2 = iris_pca$rotation[, 2],PC3 = iris_pca$rotation[, 3],PC4 = iris_pca$rotation[, 4])print("Component Loadings:")
Unit length: Each eigenvector has length 1 (sum of squares = 1)
Orthogonal: Eigenvectors are perpendicular to each other (dot product = 0)
Ordered by importance: First eigenvector (PC1) explains most variance
The complete picture:
Eigenvectors = The directions (loadings)
Eigenvalues = The importance of each direction (variance explained)
Together they fully describe the PCA transformation
Step 6b: Visualization of Component Loadings
What does this plot show?
This is a visual representation of the loadings table, showing how each original variable contributes to PC1 and PC2. It’s like a map of how your original measurements relate to the new principal components.
How to read the plot:
Arrows represent your original variables (sepal_length, sepal_width, etc.)
Arrow direction shows which PC the variable contributes to
Arrow length indicates the strength of contribution (longer = stronger)
Arrow color shows the overall contribution magnitude (red = highest, blue = lowest)
The circle represents the maximum possible contribution
Key interpretations from this plot:
PC1 (horizontal axis, 73% variance):
All arrows point roughly left (negative direction)
All variables contribute almost equally to PC1
This confirms PC1 represents “overall flower size”
PC2 (vertical axis, 22.9% variance):
Sepal_width points down (negative)
Other variables point slightly up (positive)
This creates a contrast: sepal width vs. everything else
PC2 captures “flower shape” - wide sepals vs. long petals
Loading Plot Interpretation
What the arrow positions tell us:
Variables pointing in same direction = positively correlated
PC1 Interpretation: Overall flower size (with a twist)
Note: The output says “all variables have similar negative loadings” but the actual values show mostly positive loadings. This is likely due to a sign flip - PCA signs can be arbitrary. Let’s interpret based on the actual values shown:
Three variables (sepal length, petal length, petal width) have similar positive loadings (~0.52-0.58)
Sepal width has a negative loading (-0.269)
This means PC1 captures flowers where length and width measurements (except sepal width) vary together
cat("\n- Negative loadings: petal length and width, sepal length")
- Negative loadings: petal length and width, sepal length
cat("\n- Higher PC2 = wider sepals relative to petal size")
- Higher PC2 = wider sepals relative to petal size
cat("\n- Lower PC2 = longer/wider petals relative to sepal width")
- Lower PC2 = longer/wider petals relative to sepal width
PC2 Interpretation: Flower Shape Contrast
PC2 Interpretation: Correcting the output
Note: The output says “Positive loadings: sepal width” but the actual value is -0.923 (negative). All loadings are actually negative, with sepal width being the most strongly negative.
What PC2 actually represents:
All variables have negative loadings, but sepal width is dominant (-0.923)
Petal measurements contribute very little (-0.024 and -0.067)
This component is primarily driven by sepal width, with some contribution from sepal length
What PC2 scores mean:
Higher PC2 values = Smaller measurements overall, especially narrow sepals
Lower PC2 values = Larger measurements overall, especially wide sepals
Since sepal width has the strongest loading, PC2 primarily captures sepal width variation
Biological interpretation:
PC2 helps distinguish:
Flowers with narrow sepals and smaller overall size (positive PC2)
Flowers with wide sepals and larger overall size (negative PC2)
This dimension helps separate species that have similar PC1 scores but different sepal proportions
Step 9: How Well Does PCA Work?
# Calculate total variance explained by first 2 componentsvariance_explained_2pc <-sum(prop_variance[1:2])caption_pca <-paste("Variance explained by first 2 components:", round(variance_explained_2pc *100, 1), "%")# This means we reduced 4 variables to 2 components while retaining most information!# Create a summary plot showing dimension reduction successtibble(Component =factor(paste0("PC", 1:4), levels =paste0("PC", 1:4)),Variance = prop_variance *100,Cumulative = cumsum_variance *100) %>%ggplot(aes(x = Component)) +geom_col(aes(y = Variance), fill ="lightblue", alpha =0.7) +geom_line(aes(y = Cumulative, group =1), color ="red", size =1) +geom_point(aes(y = Cumulative), color ="red", size =3) +labs(title ="PCA Dimension Reduction Success",subtitle ="Blue bars = individual variance, Red line = cumulative variance",caption = caption_pca,x ="Principal Component", y ="Percentage of Variance Explained") +theme_minimal()