# install packages -----
# install.packages("readxl")
# install.packages("tidyverse")
# # we will install a few new libraries
# install.packages("skimr")
02 Class Activity
In-class Activity 2: Data Visualization
Recap from Activity 1
- Collected pine needle samples from windward and leeward sides of trees
- Identified independent variable (wind exposure) and dependent variable (needle length)
- Measured needle lengths and recorded data
- Created basic visualizations
- Saved our data for further analysis
Today’s Objectives
- Implement data pipeline best practices
- Apply controlled vocabulary and naming conventions
- Create effective tables and visualizations
- Customize plots for publication quality
- Combine multiple plots into composite figures
Part 1: Setting Up Your Environment
- What data do we have
- what is the controlled vocabulary?
- are there units?
- What is the directory structure?
- Do we have a metadata file?
- Is the data entered in a tidy format?
- What are we missing?
Now lets create a new quarto file
- note I usually use this sort of system in an r_projects directory
- I have redone it for the class to organize all of the terms data
- you should try making some of your own projects
In RStudio:
- click
file
-open project
and select the2025_UMD_BioStats_Student_Code.Rproj
file or double click on it in the finder or data explorer. - your screen will now change as RStudio knows where home is
Note that in the upper right you will see
2025_UMD_BioStats_Student_Code
so you know you are in the right spotNow click File - New File - Quarto File
Create a file that starts with
02_
and then something that will help you know what is going on like02_class_activity_in_class.qmd
Now this file thinks this is home.
So I usually copy stuff for the header from another file as its just too hard to remember all this…
---
: "Title of your file" # Title of the file
title: "Your Name" # who you are
author-files:
metadata- ../../_templates/lectures.yml
-files:
metadata- _templates/activities.yml
:
format:
html: false
freeze: false
toc-file: "02_02_class_activity.html"
output: true
default-resources: true
embed-contained: true
self: css/activity.css
css: true
default: false
toc-depth: 3
toc-sections: false
number-style: github
highlight-doc: ms_templates/custom-reference.docx
reference: msword.css
css-resources: true
embed---
Exercise 1: Now to load the libraries
Each script you run from then on you will load the libraries from within the package.
# Load the libraries ----
library(readxl) # allows to read in excel files
library(tidyverse) # provides utilities seen in console
library(skimr) # provide summary stats
library(janitor) # it cleans ; )
library(patchwork)
Exercise 2: Loading and Examining Data
Now like we did before with x and y we will do this with a spreadsheet from a CSV file or excel file
We are going to work with the same data we did in the last class.
# load file -----
# this file is in the data sub directory
# below put cursor between "" and click tab
# allows to to select the directory
# tab again and select the file
<- read_csv("data/pine_needles.csv") # reads in csv file p_df
Rows: 48 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): date, group, n_s, wind
dbl (2): tree_no, length_mm
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# dataframe stored by "<-" reading in csv file in quotes
Exercise 3: Examining Data
# Load the pine needle data
<- read_csv("data/pine_needles.csv", na = "NA") p_df
Rows: 48 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): date, group, n_s, wind
dbl (2): tree_no, length_mm
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Examine the data structure
glimpse(p_df)
Rows: 48
Columns: 6
$ date <chr> "3/20/25", "3/20/25", "3/20/25", "3/20/25", "3/20/25", "3/20…
$ group <chr> "cephalopods", "cephalopods", "cephalopods", "cephalopods", …
$ n_s <chr> "n", "n", "n", "n", "n", "n", "s", "s", "s", "s", "s", "s", …
$ wind <chr> "lee", "lee", "lee", "lee", "lee", "lee", "wind", "wind", "w…
$ tree_no <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ length_mm <dbl> 20, 21, 23, 25, 21, 16, 15, 16, 14, 17, 13, 15, 19, 18, 20, …
# View the first few rows
head(p_df)
# A tibble: 6 × 6
date group n_s wind tree_no length_mm
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 3/20/25 cephalopods n lee 1 20
2 3/20/25 cephalopods n lee 1 21
3 3/20/25 cephalopods n lee 1 23
4 3/20/25 cephalopods n lee 1 25
5 3/20/25 cephalopods n lee 1 21
6 3/20/25 cephalopods n lee 1 16
# Get a statistical summary
summary(p_df)
date group n_s wind
Length:48 Length:48 Length:48 Length:48
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
tree_no length_mm
Min. :1.00 Min. :12.00
1st Qu.:1.75 1st Qu.:15.00
Median :2.50 Median :17.50
Mean :2.50 Mean :17.67
3rd Qu.:3.25 3rd Qu.:20.25
Max. :4.00 Max. :25.00
Questions to Consider:
- What variables are in our dataset?
- What are their data types?
- Are there any missing values?
- Do the variable names follow consistent conventions?
- How might we improve the data organization?
Part 2: Basic Data Visualization
Let’s create some simple visualizations to explore our data:
Exercise 2: Creating a Histogram
# Create a basic histogram
%>%
p_df ggplot(aes(x = length_mm)) +
geom_histogram(bins = 15)
# Create a histogram with color grouping
%>%
p_df ggplot(aes(x = length_mm, fill = wind)) +
geom_histogram(binwidth = 2, alpha = 0.7, position = "dodge")
Key Insights from Histograms:
The histogram helps us understand: - The overall distribution of needle lengths - Potential differences between windward and leeward needles - Presence of any unusual values or outliers
Exercise 3: Creating Multiple Plot Types
Let’s explore different ways to visualize the same data:
# Box plot
<- p_df %>%
box_plot ggplot(aes(x = wind, y = length_mm, fill = wind)) +
geom_boxplot()
# Violin plot
<- p_df %>%
violin_plot ggplot(aes(x = wind, y = length_mm, fill = wind)) +
geom_violin()
# Dot plot
<- p_df %>%
dot_plot ggplot(aes(x = wind, y = length_mm, color = wind)) +
geom_jitter(width = 0.2, alpha = 0.7)
# Display all plots using patchwork
+ violin_plot + dot_plot box_plot
Questions to Consider:
- Which plot type best reveals patterns in our data?
- What are the advantages and disadvantages of each plot type?
- How might we combine elements from different plot types?
Part 3: Building Complex Visualizations Layer by Layer
Now let’s build more sophisticated visualizations by adding layers one at a time:
Exercise 4: Building a Layered Plot
# Start with a basic plot
<- p_df %>%
p1 ggplot(aes(x = wind, y = length_mm, fill = wind))
# Add boxplot layer
<- p1 +
p2 geom_boxplot(alpha = 0.7)
# Add individual data points
<- p2 +
p3 geom_jitter(width = 0.2, alpha = 0.5, color = "gray30")
# Add mean indicators
<- p3 +
p4 stat_summary(fun = mean, geom = "point", shape = 23, size = 5, fill = "white")
# Create a 2x2 grid of the progressive plot building
| p2) / (p3 | p4) (p1
Discussion Points:
- How does each layer contribute to the story our data is telling?
- Why might we want to show individual data points alongside summary statistics?
- How does transparency (alpha) help when overlaying multiple elements?
Part 4: Customizing Plots for Publication
Exercise 5: Adding customization
# Create a fully customized plot
<- p_df %>%
color_plot ggplot(aes(x = wind, y = length_mm, fill = wind)) +
# Add violin plots for distribution
geom_violin(alpha = 0.4) +
# Add boxplots for key statistics
geom_boxplot(width = 0.2, alpha = 0.7, outlier.shape = NA) +
# Add individual data points
geom_jitter(width = 0.1, alpha = 0.5, color = "gray30", size = 2) +
# Add mean points
# Customize colors with a colorblind-friendly palette
scale_fill_manual(
values = c(
"wind" = "#1b9e77",
"lee" = "#d95f02"
),labels = c(
"wind" = "Windward",
"lee" = "Leeward"
+
)) # Apply a clean theme
theme_classic()
# Display the publication-ready plot
color_plot
Let’s create a publication-quality figure by customizing colors, labels, and themes:
Exercise 6: Creating a Publication-Ready Plot
# Create a fully customized plot
<- p_df %>%
publication_plot ggplot(aes(x = wind, y = length_mm, fill = wind)) +
# Add violin plots for distribution
geom_violin(alpha = 0.4) +
# Add boxplots for key statistics
geom_boxplot(width = 0.2, alpha = 0.7, outlier.shape = NA) +
# Add individual data points
geom_jitter(width = 0.1, alpha = 0.5, color = "gray30", size = 2) +
# Add mean points
stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
# Add informative labels
labs(
x = "Tree Side",
y = "Needle Length (mm)"
+
) # Customize colors with a colorblind-friendly palette
scale_fill_manual(
values = c("wind" = "#1b9e77", "lee" = "#d95f02"),
labels = c("wind" = "Windward", "lee" = "Leeward")
+
) # Apply a clean theme
theme_classic()
# Display the publication-ready plot
publication_plot
Customization Elements:
- Plot Elements:
- Violin plots to show distribution
- Boxplots to show quartiles and median
- Individual points for transparency
- Mean indicators for central tendency
- Visual Design:
- Colorblind-friendly color palette
- Thoughtful use of transparency
- Clear, informative title and subtitle
- Professional typography and spacing
- Accessibility Considerations:
- Sufficient contrast
- Redundant encoding (position and color)
- Clear labels with units
Part 5: Creating Complex Multi-Panel Figures
Finally, let’s create a publication-ready multi-panel figure:
+
color_plot
+
publication_plot plot_layout(ncol = 2) +
plot_annotation(tag_levels = "A", tag_suffix = ")")
# we can add this to remove things
# why do this?
# + theme(
# axis.text.y = element_blank(), # Removes x-axis labels
# axis.title.y = element_blank() # Removes x-axis title
Summary and Key Takeaways
In this activity, we’ve learned how to:
- Load and examine data properly
- Create basic visualizations to explore patterns
- Build complex plots layer by layer using ggplot2’s grammar
- Customize plots for clear communication and visual appeal
- Add statistical information to support data interpretation
- Combine multiple plots into publication-ready figures
Best Practices for Data Visualization:
- Start simple, then add complexity as needed
- Focus on the story your data is telling
- Use appropriate plot types for your data structure
- Minimize chart junk and maximize data-ink ratio
- Create clear, informative labels
- Use color purposefully and with accessibility in mind
- Include both individual data points and summary statistics when possible
- Consider your audience when designing visualizations