# Load required packages
library(moments) # For calculating skewness and kurtosis
library(skimr) # for summary stats
library(tidyverse) # For data wrangling and visualization
03_Class_Activity
In class activity 3:
What did we do last time?
Implement data pipeline best practices
Apply controlled vocabulary and naming conventions
Create effective visualizations
Customize plots for publication quality
Combine multiple plots into composite figures
ggplot(name_df, aes(x_variable, y_variable, color = categorical_variable)) + # dataframe, aesthetics(x and y variables, mapping of color or fill or shape) + geom_point() + # this it the geometry you want and can add more layers like geom_line()
What questions do you have and what is unclear
What did not work so far when you started the homework?
Objectives and goals for today
Today’s Objectives
- Implement descriptive statistics in R
- Calculate measures of central tendency and spread
- Compare distributions of data from different groups
- Create effective visualizations of descriptive statistics
- Interpret the meaning of these statistics in a biological context
Part 1: Setting Up Your Environment
First, let’s load the necessary packages and import our data:
Getting the data
We’ll be working with data on arctic grayling fish from two different lakes (I3 and I8).
# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!
# Load the grayling data
<- read_csv("data/gray_I3_I8.csv") g_df
Rows: 168 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): lake, species
dbl (3): site, length_mm, mass_g
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(g_df)
# A tibble: 6 × 5
site lake species length_mm mass_g
<dbl> <chr> <chr> <dbl> <dbl>
1 113 I3 arctic grayling 266 135
2 113 I3 arctic grayling 290 185
3 113 I3 arctic grayling 262 145
4 113 I3 arctic grayling 275 160
5 113 I3 arctic grayling 240 105
6 113 I3 arctic grayling 265 145
Questions to Consider:
- What variables are in our dataset?
- What are their data types?
- How many fish were sampled from each lake?
- Are there any missing values?
- What is the distribution of data?
Base R way of getting some summary stats
# How many fish do we have from each lake?
summary(g_df)
site lake species length_mm
Min. :113 Length:168 Length:168 Min. :191.0
1st Qu.:113 Class :character Class :character 1st Qu.:270.8
Median :118 Mode :character Mode :character Median :324.5
Mean :116 Mean :324.5
3rd Qu.:118 3rd Qu.:377.0
Max. :118 Max. :440.0
mass_g
Min. : 53.0
1st Qu.:151.2
Median :340.0
Mean :351.2
3rd Qu.:519.5
Max. :889.0
NA's :2
Skimnr way of seeing summary stats
%>%
g_df group_by(lake) %>%
skim()
Name | Piped data |
Number of rows | 168 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 3 |
________________________ | |
Group variables | lake |
Variable type: character
skim_variable | lake | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|---|
species | I3 | 0 | 1 | 15 | 15 | 0 | 1 | 0 |
species | I8 | 0 | 1 | 15 | 15 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | lake | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
site | I3 | 0 | 1.00 | 113.00 | 0.00 | 113 | 113.00 | 113 | 113.0 | 113 | ▁▁▇▁▁ |
site | I8 | 0 | 1.00 | 118.00 | 0.00 | 118 | 118.00 | 118 | 118.0 | 118 | ▁▁▇▁▁ |
length_mm | I3 | 0 | 1.00 | 265.61 | 28.30 | 191 | 256.00 | 266 | 280.0 | 320 | ▂▁▇▇▂ |
length_mm | I8 | 0 | 1.00 | 362.60 | 52.34 | 199 | 340.00 | 373 | 401.0 | 440 | ▁▂▃▇▆ |
mass_g | I3 | 0 | 1.00 | 150.50 | 42.22 | 53 | 130.75 | 147 | 177.5 | 260 | ▂▅▇▃▁ |
mass_g | I8 | 2 | 0.98 | 483.71 | 176.48 | 68 | 369.00 | 490 | 615.5 | 889 | ▂▃▇▆▂ |
Part 2: Visualizing Distributions
Visualizations can help us better understand the descriptive statistics we’ve calculated.
One of the best ways to look at data is a histogram - and we will do it again
# Create a histogram of all fish lengths
%>% ggplot(aes(x = length_mm)) +
g_df geom_histogram(binwidth = 15)
# Create histograms by lake
%>% ggplot(aes(x = length_mm, fill = lake)) +
g_df geom_histogram(binwidth = 15, position = "dodge", alpha = 0.7)
Personally I like box plots
# Create a box plot comparing fish lengths by lake
# Create a box plot comparing fish lengths by lake
%>% ggplot( aes(x = lake, y = length_mm, fill = lake)) +
g_df geom_boxplot()
Now these will be really important later on
## Create density plots
%>% ggplot(aes(x = length_mm, fill = lake)) +
g_df geom_density(alpha = 0.5)
Part 2: Calculating Descriptive Statistics
Let’s calculate various descriptive statistics for our data:
# Calculate the mean and median fish length
mean(g_df$length_mm)
[1] 324.494
median(g_df$length_mm)
[1] 324.5
# Calculate mean and median by lake
%>%
g_df group_by(lake) %>%
summarise(
mean_length = mean(length_mm),
median_length = median(length_mm)
)
# A tibble: 2 × 3
lake mean_length median_length
<chr> <dbl> <dbl>
1 I3 266. 266
2 I8 363. 373
Summarizing data - two ways
lets say we want to summarize the data and need to get n, means, standard deviation, standard error
We could do the following - if we had missing cells the code below would give an error
mean(g_df$length_mm)
[1] 324.494
mean(g_df$length_mm, na.rm = TRUE) # removes missing values
[1] 324.494
length(g_df$length_mm)
[1] 168
the length counts missing and non-missing data
however this would get old if we had to do this for everything and then to do it for the different groupings
We need to learn to pipe
passes things from the dataframe to a command and so on…
- the dataframe –> pipe command that feed the dataframe into –> next command
%>%
g_df summarize(mean_length = mean(length_mm, na.rm = TRUE))
# A tibble: 1 × 1
mean_length
<dbl>
1 324.
What is cool is we can do a lot of different things now
%>%
g_df summarize(
mean_length = mean(length_mm, na.rm = TRUE),
sd_length = sd(length_mm, na.rm = TRUE),
n_length = n())
# A tibble: 1 × 3
mean_length sd_length n_length
<dbl> <dbl> <int>
1 324. 65.0 168
Super cool code in case there are missing values
%>%
g_df summarize(
mean_length = mean(length_mm, na.rm = TRUE),
sd_length = sd(length_mm, na.rm = TRUE),
n_length = sum(!is.na(length_mm)))
# A tibble: 1 × 3
mean_length sd_length n_length
<dbl> <dbl> <int>
1 324. 65.0 168
Now for Spread…
# Write your code here to read in the file
# Calculate standard deviation and variance
<- mean(g_df$length_mm, na.rm=TRUE)
mean_length <- sd(g_df$length_mm)
sd_length <- var(g_df$length_mm)
var_length mean_length
[1] 324.494
sd_length
[1] 65.00659
var_length
[1] 4225.856
# Calculate quartiles for overall data
<- quantile(g_df$length_mm, probs = c(0.25, 0.5, 0.75))
quartiles quartiles
25% 50% 75%
270.75 324.50 377.00
# Calculate a more comprehensive set of percentiles
<- quantile(g_df$length_mm,
percentiles probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
percentiles
10% 25% 50% 75% 90%
251.10 270.75 324.50 377.00 408.60
Note you could add a box plot by lake to see this if you wanted
The coefficient of variation (CV) is the standard deviation expressed as a percentage of the mean:
\[CV = \frac{s}{\bar{Y}} \times 100\%\]
# Calculate coefficient of variation
/ mean_length * 100 sd_length
[1] 20.03321
# Calculate by lake
%>%
g_df group_by(lake) %>%
summarise(
mean_length = mean(length_mm),
sd_length = sd(length_mm),
cv_length = sd_length / mean_length * 100
)
# A tibble: 2 × 4
lake mean_length sd_length cv_length
<chr> <dbl> <dbl> <dbl>
1 I3 266. 28.3 10.7
2 I8 363. 52.3 14.4
Questions to Consider:
- How do the means and medians compare within each lake? What might this tell you about the distribution?
- Which lake has more variable fish lengths? How can you tell?
- Why might the coefficient of variation be useful when comparing variability between different measurements (e.g., length vs. mass)?
Questions to Consider:
- Which visualization best shows the differences in fish lengths between lakes?
- What can you learn from the violin plots that might not be apparent from the box plots?
- How would you interpret the cumulative frequency distribution?
- What patterns or insights can you identify from these visualizations?
Part 4: Interpreting the Results
Based on our analysis, we can make the following observations:
Lake Differences: Fish from Lake I8 are generally larger than those from Lake I3, both in length and mass.
Variability: Lake I8 shows greater variability in fish lengths and masses than Lake I3, as indicated by higher standard deviations and IQRs.
Distribution Shape:
Lake I3 fish lengths are more symmetrically distributed.
Lake I8 fish lengths show a slight negative skew, suggesting a few smaller fish pulling the distribution to the left.