Lecture 02

Author

Bill Perry

Review of Lecture 1

Covered

Inductive vs deductive reasoning
Formulating research questions
Accuracy vs precision
Data types and classifications
Setting up R projects
Installing and loading libraries
Reading files into R
Creating basic graphs

Lecture 2: Project Design & Data Visualization

The objectives:

Design a well-organized project
Implement good naming conventions
- Controlled vocabulary
- Including units in names
Create and use metadata effectively
Build tidy, well-structured spreadsheets
Understand data repositories
Create effective visualizations with ggplot2

Project Design: Step 1

Data: the raw material of science
Wide variety of formats, sizes, complexity
Data management and curation often under emphasized
Good data management: owe it to our funding agencies, colleagues, supervisors, and study systems

Lecture 2: Project Design: Step 1

Determine data types you’ll collect
Establish controlled vocabulary
- Example: do_mgl for dissolved oxygen in mg/L
- Example: drp_ugl for dissolved reactive phosphorus in μg/L
Plan your data flow from collection to analysis
Organize your project structure (folders, files)
Enter data promptly after collection
Save in multiple formats (Excel and CSV)
Ensure tidy data principles from the start

Note

See Hadley Wickham’s Tidy Data principles for best practices

Project Design: Step 2

Create a Metadata Sheet that includes:

Variable descriptions
Units of measurement
Collection methods
Instrument details
Dates and locations
Any other relevant contextual information

Practice Exercise 1: Pine Data Organization

Let’s examine our pine needle data:

- What naming conventions did you choose?
- How did you organize the data?
- How can you verify data formats (numeric vs categorical)?
- What’s your plan for organizing outputs and figures?

# Code to read and examine data
library(tidyverse)
library(patchwork)
library(flextable)

pine_df <- read_csv("data/pine_needles.csv")
pine_df

# A tibble: 48 × 6
   date    group       n_s   wind  tree_no length_mm
   <chr>   <chr>       <chr> <chr>   <dbl>     <dbl>
 1 3/20/25 cephalopods n     lee         1        20
 2 3/20/25 cephalopods n     lee         1        21
 3 3/20/25 cephalopods n     lee         1        23
 4 3/20/25 cephalopods n     lee         1        25
 5 3/20/25 cephalopods n     lee         1        21
 6 3/20/25 cephalopods n     lee         1        16
 7 3/20/25 cephalopods s     wind        1        15
 8 3/20/25 cephalopods s     wind        1        16
 9 3/20/25 cephalopods s     wind        1        14
10 3/20/25 cephalopods s     wind        1        17
# ℹ 38 more rows

Lecture 2: Data Management: Step 3

Storage and Backup Strategy:

Store raw data and metadata securely
- Save in both Excel and CSV formats
- Consider write-protecting raw data files
Implement the 3-2-1 backup rule:
- 3 total copies of data
- 2 different storage media
- 1 offsite location (cloud storage)
Establish a regular backup schedule

Lecture 2: Data Management: Step 4

Initial Data Inspection:

Examine data in the Environment tab
Run summary() and glimpse() functions
Create exploratory visualizations
Check for outliers, errors, and missing data

# Specify how to handle missing values during import
pine_df <- read_csv("data/pine_needles.csv", 
                    na = c("", "NA", "N/A", "missing", "null"))

# Get a quick summary
summary(pine_df)

Practice Exercise 2: Try plotting a histogram

Practice Exercise 2: Try plotting a histogram of your data

Create a histogram of pine needle lengths to check the distribution:

# Write your code here to make a plot
# How do you examine the data - what are the ways you think and lets try it!

Lecture 2: Data Management: Step 5

Data Cleaning:

Correct errors and inconsistencies
Replace missing values with proper NA codes
Document all changes made to raw data
Save a clean, master version (consider making read-only)
Keep notes on data cleaning procedures

Lecture 2: Data Management: Step 6

Analysis and Visualization Workflow:

Create exploratory visualizations
Summarize and transform data as needed
Document all analysis steps
Save outputs systematically

A useful way to organize script files is number them in the order they get run.

Lecture 2: Effective Data Visualization

Why make plots?

Get in a group and discuss

What is the purpose of a data visualization?
What elements are essential in an effective plot?
What characteristics define a “good” plot?
What common mistakes make plots ineffective?

Napoleon’s Disastrous Invasion of Russia Detailed in an 1869 Data Visualization: It’s Been Called “the Best Statistical Graphic Ever Drawn”

Lecture 2: Tables vs. Visualizations

How readable are tables?

We will get to what these number mean and how to make them in the next lecture.

Tables
- are they useful in a presentation?

wind	n	mean_mm	sd_mm	min_mms	max_mm
lee	24	20	2.45	16	25
wind	24	15	1.91	12	19

Lecture 2: Displaying data

how does a table compare to a plot?
Does this help?
What is this plot?
- if you don’t explain does the audience know?

Lecture 2: Principles of Effective Graphics

According to Tufte (2001), good scientific graphics:

Show the data without distortion
Maximize data-ink ratio (minimize non-data elements)
Make large datasets coherent and understandable
Encourage comparison between elements
Reveal multiple layers of information
Serve a clear purpose in telling your story
Integrate with statistical methods appropriately

# A tibble: 4 × 7
  group       mean_mm sd_mm     n se_mm conf_low conf_high
  <chr>         <dbl> <dbl> <int> <dbl>    <dbl>     <dbl>
1 cephalopods    18    3.86    12 1.11      15.5      20.5
2 crayfish       18    3.86    12 1.11      15.5      20.5
3 salmon         16.3  3.94    12 1.14      13.8      18.8
4 snail          18.3  2.27    12 0.655     16.9      19.8

Lecture 2: Creating Effective Graphics

According to Tufte (2001), good scientific graphics:

To implement these principles:
- Focus on the data, not decorative elements
- Ensure proportional representation of numbers
- Provide clear and informative labels
- Remove unnecessary elements (“chart junk”)
- Revise and refine visualizations iteratively

Lecture 2: Displaying data - Good Graphics

To make good graphics:

Above all, focus on data
Do not distort data
Graphical representation of numbers → directly proportional to numbers
Strive for clarity through labeling
Maximize data-ink ratio
- Remove non-data ink
- Reduce redundant data ink
Revise and redraw

Lecture 2: Displaying data - Poor Example

What do you think?

Does this -

Focus on data
Distort data
Is it directly proportional to numbers
Is labeling clear
Maximize data-ink ratio
- Remove non-data ink
- Reduce redundant data ink
Revise and redraw

Lecture 2: Displaying data - Better Example

What do you think?

Does this -

Focus on data
Distort data
Is it directly proportional to numbers
Is labeling clear
Maximize data-ink ratio
- Remove non-data ink
- Reduce redundant data ink
Revise and redraw

What is one of the most common plots you make all the time?

Lecture 2: Displaying data - Common Problems

Common Visualization Problems

Data distortion:
- Non-zero baselines on bar charts
- 3D effects that skew perspective
- Inappropriate scales
Excessive “chart junk”:
- Too many grid lines
- Unnecessary decorative elements
- Redundant information
Poor color choices:
- Too many colors
- Non-colorblind-friendly palettes
- Colors that don’t print well in grayscale
Misleading representations:
- Pie charts with too many categories
- Dual y-axes with different scales
- Truncated axes without clear indication

Practice Exercise 3: Basic Plots with Pine Data

Practice Exercise 3: Lets try some plots with pine data first

Lets try to make some basic plots

# Write your code here to make a dot plot or X y plot
# How do you examine the data - what are the ways you think and lets try it!
# what is missing - hwo do you tell the effect of wind?

Practice Exercise 4: Colors, Shapes, and Fills

Practice Exercise 4: OK we are closer but what about colors or shape or fills

Lets try to make some more basic plots

This is free time - we will free code this….

Below are some examples of code you will need for the future

# Write your code here to make a dot plot or X y plot
# How do you examine the data - what are the ways you think and lets try it!
# what is missing - hwo do you tell the effect of wind?

Lecture 2: Introduction to the Grammar of Graphics - ggPLOT

We will learn the anatomy of a GGplot is layers

ggplot2 uses a layered grammar of graphics approach:
1. Data: The dataset you’re visualizing
2. Aesthetics: Mapping variables to visual properties
3. Geometries: The visual elements representing data
4. Facets: Splitting visualization into subplots
5. Statistics: Statistical transformations of the data
6. Coordinates: The space in which data is plotted
7. Themes: Overall visual style of the plotWe have aesthetics

Lecture 2: Building a ggplot Visualization

Key Components:

Aesthetics (aes) map variables to visual properties:
- x and y positions
- color, fill, shape, size, alpha
- group, linetype
**Geometries (geom_*)** determine how data is displayed:
- geom_point(): Scatter plots
- geom_line(): Line graphs
- geom_boxplot(): Box-and-whisker plots
- geom_violin(): Violin plots
- geom_histogram(): Histograms
- geom_bar(): Bar charts
Position adjustments control how elements are arranged:
- position_dodge(): Side-by-side elements
- position_jitter(): Add random noise to points
- position_stack(): Stack elements on top of each other
Labels and annotations provide context:
- labs(): Title, subtitle, caption, axis labels
- annotate(): Add text, shapes, etc.

Lecture 2: Fine-tuning your visualizations

Colors, fills, and shapes:

scale_color_manual(
  values = c("wind" = "darkblue", "lee" = "darkred"),
  labels = c("wind" = "Windward", "lee" = "Leeward")
)

Coordinate systems:

coord_cartesian(ylim = c(10, 30))  # Zoom in without dropping data

Themes:

theme_minimal() +
theme(
  axis.title = element_text(size = 14),
  legend.position = "bottom"
)

Combining plots with patchwork:
```
plot1 + plot2 + plot_layout(ncol = 2)
```

Practice Exercise 5: Publication-Quality Plot

Practice Exercise 4: Creating a Publication-Quality Plot

Create a fully customized plot that would be suitable for publication:

# Create a publication-quality plot
pine_df %>%
  ggplot(aes(x = wind, y = length_mm, fill = wind)) +
  geom_violin(alpha = 0.4) +
  geom_boxplot(width = 0.2, alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.1, alpha = 0.5, color = "gray30", size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
  labs(
    title = "Pine Needle Length Varies with Wind Exposure",
    subtitle = "Needles on the leeward side tend to be longer",
    x = "Tree Side", 
    y = "Needle Length (mm)",
    caption = "Data collected Spring 2023") +
  scale_fill_manual(
    values = c("wind" = "#1b9e77", "lee" = "#d95f02"),
    labels = c("wind" = "Windward", "lee" = "Leeward")) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    axis.title = element_text(face = "bold"),
    legend.title = element_blank(),
    legend.position = "bottom")

Key Takeaways

Plan your data management from the beginning
- Consistent naming conventions
- Good organization
- Regular backups
Make your data tidy from the start
- One observation per row
- One variable per column
- One value per cell
Create effective visualizations by:
- Focusing on data, not decoration
- Using appropriate plot types
- Following good design principles
- Customizing for clear communication
Master the grammar of graphics to:
- Build plots layer by layer
- Communicate patterns clearly
- Tell compelling stories with data

Next Steps

Practice creating different types of plots
Learn to combine multiple plots effectively
Explore statistical transformations in ggplot2
Develop a consistent visualization style

Other Formats

Review of Lecture 1

Lecture 2: Project Design & Data Visualization

The objectives:

Project Design: Step 1

Lecture 2: Project Design: Step 1

Project Design: Step 2

Practice Exercise 1: Pine Data Organization

Lecture 2: Data Management: Step 3

Lecture 2: Data Management: Step 4

Practice Exercise 2: Try plotting a histogram

Lecture 2: Data Management: Step 5

Lecture 2: Data Management: Step 6

Lecture 2: Effective Data Visualization

Why make plots?

Get in a group and discuss

Lecture 2: Tables vs. Visualizations

Lecture 2: Displaying data

Lecture 2: Principles of Effective Graphics

Lecture 2: Creating Effective Graphics

Lecture 2: Displaying data - Good Graphics

Lecture 2: Displaying data - Poor Example

Lecture 2: Displaying data - Better Example

Lecture 2: Displaying data - Common Problems

Common Visualization Problems

Practice Exercise 3: Basic Plots with Pine Data

Practice Exercise 4: Colors, Shapes, and Fills

Lecture 2: Introduction to the Grammar of Graphics - ggPLOT

Lecture 2: Building a ggplot Visualization

Key Components:

Lecture 2: Fine-tuning your visualizations

Practice Exercise 5: Publication-Quality Plot

Key Takeaways

Next Steps