03_Class_Activity

Author

Bill Perry

In class activity 3:

What did we do last time?

Implement data pipeline best practices
Apply controlled vocabulary and naming conventions
Create effective visualizations
Customize plots for publication quality

Combine multiple plots into composite figures

ggplot(name_df, aes(x_variable, y_variable, color = categorical_variable)) +
#      dataframe, aesthetics(x and y variables, mapping of color or fill or shape) + 
  geom_point() +
# this it the geometry you want and can add more layers like
  geom_line()

What questions do you have and what is unclear
What did not work so far when you started the homework?

Objectives and goals for today

Today’s Objectives

Implement descriptive statistics in R
Calculate measures of central tendency and spread
Compare distributions of data from different groups
Create effective visualizations of descriptive statistics
Interpret the meaning of these statistics in a biological context

Part 1: Setting Up Your Environment

First, let’s load the necessary packages and import our data:

# Load required packages
library(moments)      # For calculating skewness and kurtosis
library(skimr)        # for summary stats
library(tidyverse)    # For data wrangling and visualization

Getting the data

Practice Exercise 1: Loading and Examining the Grayling Data

We’ll be working with data on arctic grayling fish from two different lakes (I3 and I8).

# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!

# Load the grayling data
g_df <- read_csv("data/gray_I3_I8.csv")

Rows: 168 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): lake, species
dbl (3): site, length_mm, mass_g

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first few rows
head(g_df)

# A tibble: 6 × 5
   site lake  species         length_mm mass_g
  <dbl> <chr> <chr>               <dbl>  <dbl>
1   113 I3    arctic grayling       266    135
2   113 I3    arctic grayling       290    185
3   113 I3    arctic grayling       262    145
4   113 I3    arctic grayling       275    160
5   113 I3    arctic grayling       240    105
6   113 I3    arctic grayling       265    145

Questions to Consider:

What variables are in our dataset?
What are their data types?
How many fish were sampled from each lake?
Are there any missing values?
What is the distribution of data?

Base R way of getting some summary stats

# How many fish do we have from each lake?
summary(g_df)

      site         lake             species            length_mm    
 Min.   :113   Length:168         Length:168         Min.   :191.0  
 1st Qu.:113   Class :character   Class :character   1st Qu.:270.8  
 Median :118   Mode  :character   Mode  :character   Median :324.5  
 Mean   :116                                         Mean   :324.5  
 3rd Qu.:118                                         3rd Qu.:377.0  
 Max.   :118                                         Max.   :440.0  
                                                                    
     mass_g     
 Min.   : 53.0  
 1st Qu.:151.2  
 Median :340.0  
 Mean   :351.2  
 3rd Qu.:519.5  
 Max.   :889.0  
 NA's   :2

Skimnr way of seeing summary stats

g_df %>% 
  group_by(lake) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	168
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	3
________________________
Group variables	lake

Variable type: character

skim_variable	lake	n_missing	complete_rate	min	max	empty	n_unique	whitespace
species	I3	0	1	15	15	0	1	0
species	I8	0	1	15	15	0	1	0

Variable type: numeric

skim_variable	lake	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
site	I3	0	1.00	113.00	0.00	113	113.00	113	113.0	113	▁▁▇▁▁
site	I8	0	1.00	118.00	0.00	118	118.00	118	118.0	118	▁▁▇▁▁
length_mm	I3	0	1.00	265.61	28.30	191	256.00	266	280.0	320	▂▁▇▇▂
length_mm	I8	0	1.00	362.60	52.34	199	340.00	373	401.0	440	▁▂▃▇▆
mass_g	I3	0	1.00	150.50	42.22	53	130.75	147	177.5	260	▂▅▇▃▁
mass_g	I8	2	0.98	483.71	176.48	68	369.00	490	615.5	889	▂▃▇▆▂

Part 2: Visualizing Distributions

Visualizations can help us better understand the descriptive statistics we’ve calculated.

Exercise 1: Creating Histograms

One of the best ways to look at data is a histogram - and we will do it again

# Create a histogram of all fish lengths
g_df %>% ggplot(aes(x = length_mm)) +
  geom_histogram(binwidth = 15)

# Create histograms by lake
g_df  %>% ggplot(aes(x = length_mm, fill = lake)) +
  geom_histogram(binwidth = 15, position = "dodge", alpha = 0.7)

Exercise 2: Creating Box Plots

Personally I like box plots

# Create a box plot comparing fish lengths by lake
# Create a box plot comparing fish lengths by lake
g_df  %>%  ggplot( aes(x = lake, y = length_mm, fill = lake)) +
  geom_boxplot()

Exercise 3: Creating Density Plots

Now these will be really important later on

## Create density plots
g_df  %>%  ggplot(aes(x = length_mm, fill = lake)) +
  geom_density(alpha = 0.5)

Part 2: Calculating Descriptive Statistics

Let’s calculate various descriptive statistics for our data:

Practice Exercise 4: Measures of Central Tendency

# Calculate the mean and median fish length
mean(g_df$length_mm)

[1] 324.494

median(g_df$length_mm)

[1] 324.5

# Calculate mean and median by lake
g_df %>%
  group_by(lake) %>%
  summarise(
    mean_length = mean(length_mm),
    median_length = median(length_mm)
  )

# A tibble: 2 × 3
  lake  mean_length median_length
  <chr>       <dbl>         <dbl>
1 I3           266.           266
2 I8           363.           373

Summarizing data - two ways

lets say we want to summarize the data and need to get n, means, standard deviation, standard error

We could do the following - if we had missing cells the code below would give an error

mean(g_df$length_mm)

[1] 324.494

mean(g_df$length_mm, na.rm = TRUE) # removes missing values

[1] 324.494

length(g_df$length_mm)

[1] 168

the length counts missing and non-missing data
however this would get old if we had to do this for everything and then to do it for the different groupings

We need to learn to pipe

passes things from the dataframe to a command and so on…

the dataframe –> pipe command that feed the dataframe into –> next command

g_df %>% 
  summarize(mean_length = mean(length_mm, na.rm = TRUE))

# A tibble: 1 × 1
  mean_length
        <dbl>
1        324.

What is cool is we can do a lot of different things now

g_df %>% 
  summarize(
    mean_length = mean(length_mm, na.rm = TRUE),
    sd_length = sd(length_mm, na.rm = TRUE),
    n_length = n())

# A tibble: 1 × 3
  mean_length sd_length n_length
        <dbl>     <dbl>    <int>
1        324.      65.0      168

Super cool code in case there are missing values

g_df %>% 
  summarize(
    mean_length = mean(length_mm, na.rm = TRUE),
    sd_length = sd(length_mm, na.rm = TRUE),
    n_length = sum(!is.na(length_mm)))

# A tibble: 1 × 3
  mean_length sd_length n_length
        <dbl>     <dbl>    <int>
1        324.      65.0      168

Now for Spread…

Practice Exercise 5: Measures of Spread

# Write your code here to read in the file
# Calculate standard deviation and variance
mean_length <- mean(g_df$length_mm, na.rm=TRUE)
sd_length <- sd(g_df$length_mm)
var_length <- var(g_df$length_mm)
mean_length

[1] 324.494

sd_length

[1] 65.00659

var_length

[1] 4225.856

Exercise 6: Calculate Quartiles and Percentiles

# Calculate quartiles for overall data
quartiles <- quantile(g_df$length_mm, probs = c(0.25, 0.5, 0.75))
quartiles

   25%    50%    75% 
270.75 324.50 377.00

# Calculate a more comprehensive set of percentiles
percentiles <- quantile(g_df$length_mm, 
                        probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
percentiles

   10%    25%    50%    75%    90% 
251.10 270.75 324.50 377.00 408.60

Note you could add a box plot by lake to see this if you wanted

Exercise 7: Calculate the Coefficient of Variation

The coefficient of variation (CV) is the standard deviation expressed as a percentage of the mean:

\[CV = \frac{s}{\bar{Y}} \times 100\%\]

# Calculate coefficient of variation
sd_length / mean_length * 100

[1] 20.03321

# Calculate by lake
g_df %>%
  group_by(lake) %>%
  summarise(
    mean_length = mean(length_mm),
    sd_length = sd(length_mm),
    cv_length = sd_length / mean_length * 100
  )

# A tibble: 2 × 4
  lake  mean_length sd_length cv_length
  <chr>       <dbl>     <dbl>     <dbl>
1 I3           266.      28.3      10.7
2 I8           363.      52.3      14.4

Questions to Consider:

How do the means and medians compare within each lake? What might this tell you about the distribution?
Which lake has more variable fish lengths? How can you tell?
Why might the coefficient of variation be useful when comparing variability between different measurements (e.g., length vs. mass)?

Questions to Consider:

Which visualization best shows the differences in fish lengths between lakes?
What can you learn from the violin plots that might not be apparent from the box plots?
How would you interpret the cumulative frequency distribution?
What patterns or insights can you identify from these visualizations?

Part 4: Interpreting the Results

Based on our analysis, we can make the following observations:

Lake Differences: Fish from Lake I8 are generally larger than those from Lake I3, both in length and mass.
Variability: Lake I8 shows greater variability in fish lengths and masses than Lake I3, as indicated by higher standard deviations and IQRs.
Distribution Shape:
- Lake I3 fish lengths are more symmetrically distributed.
- Lake I8 fish lengths show a slight negative skew, suggesting a few smaller fish pulling the distribution to the left.

Other Formats

In class activity 3:

What did we do last time?

Objectives and goals for today

Today’s Objectives

Part 1: Setting Up Your Environment

Getting the data

Questions to Consider:

Base R way of getting some summary stats

Skimnr way of seeing summary stats

Part 2: Visualizing Distributions

Part 2: Calculating Descriptive Statistics

Let’s calculate various descriptive statistics for our data:

Summarizing data - two ways

We need to learn to pipe

passes things from the dataframe to a command and so on…

What is cool is we can do a lot of different things now

Super cool code in case there are missing values

Now for Spread…

Note you could add a box plot by lake to see this if you wanted

Questions to Consider:

Questions to Consider:

Part 4: Interpreting the Results