PCA (Principal Component Analysis) is a technique to: - Reduce the number of variables in your dataset - Find patterns in high-dimensional data - Create new uncorrelated variables (principal components) from correlated original variables - Visualize complex relationships in multivariate data
When to Use PCA
Use PCA when you have: - Multiple continuous variables that may be correlated - Too many variables to analyze or visualize easily - Need to reduce dimensionality while retaining most information - Want to explore patterns in multivariate data
Key Assumptions of PCA
Linear relationships between variables
No extreme outliers (can distort results)
Variables should be correlated (if not, PCA won’t reduce dimensions effectively)
Adequate sample size (generally n > 50)
Consider standardization when variables have different scales
Critical First Step
Always standardize your data when variables are measured on different scales. This prevents variables with larger values from dominating the analysis.
Part 1: Iris Data Analysis
Data Overview
We’ll analyze the famous iris dataset with measurements from three species: - Iris setosa - Iris versicolor - Iris virginica
Each flower has 4 measurements: sepal length, sepal width, petal length, and petal width.
# Load the iris dataset from CSViris_df <-read.csv("data/iris.csv") %>%clean_names() %>%mutate(ind =row_number()) %>%mutate(species_ind =paste(species, ind, sep="_"))# View data structurehead(iris_df)
# Get numeric values only for PCAiris_data_df <- iris_df %>% dplyr::select(sepal_length, sepal_width, petal_length, petal_width)# Keep species info for lateriris_species_df <- iris_df %>% dplyr::select(species, ind, species_ind)
# Create long format for visualizationiris_long_df <- iris_df %>%pivot_longer(cols =c(sepal_length, sepal_width, petal_length, petal_width),names_to ="variable",values_to ="values" )# Overview plotoverview_plot <- iris_long_df %>%ggplot(aes(species, values, color = species)) +geom_boxplot() +facet_wrap(~variable, scales ="free") +labs(title ="Iris Measurements by Species",x ="Species",y ="Measurement Value") +theme_minimal()overview_plot
# Create boxplots to check for outliersoutlier_plot <- iris_data_df %>%pivot_longer(everything(), names_to ="variable", values_to ="value") %>%ggplot(aes(x = variable, y = value)) +geom_boxplot() +labs(title ="Check for Outliers in Iris Variables",x ="Variable", y ="Value") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))outlier_plot
Step 2: Standardize the Data
Since our variables have different scales (e.g., petal width ranges 0.1-2.5 while sepal length ranges 4-8), we need to standardize.
# Standardize the data (mean = 0, sd = 1)iris_scaled <-scale(iris_data_df)# Convert back to data frameiris_scaled_df <-as.data.frame(iris_scaled)# Check standardization workedcolMeans(iris_scaled_df)