# Load required packages
library(knitr) # For creating tables
library(moments) # For calculating skewness and kurtosis
library(skimr) # for summary stats
library(flextable) # for tables if you want - now tinytable
library(tidyverse) # For data wrangling and visualization
# Set a consistent theme for our plots
theme_set(theme_minimal(base_size = 12))
03_Class_Activity
In class activity 3:
What did we do last time?
Implement data pipeline best practices
Apply controlled vocabulary and naming conventions
Create effective visualizations
Customize plots for publication quality
Combine multiple plots into composite figures
ggplot(name_df, aes(x_variable, y_variable, color = categorical_variable)) + # dataframe, aesthetics(x and y variables, mapping of color or fill or shape) + geom_point() + # this it the geometry you want and can add more layers like geom_line()
What questions do you have and what is unclear
What did not work so far when you started the homework?
Objectives and goals for today
Today’s Objectives
- Implement descriptive statistics in R
- Calculate measures of central tendency and spread
- Compare distributions of data from different groups
- Create effective visualizations of descriptive statistics
- Interpret the meaning of these statistics in a biological context
Part 1: Setting Up Your Environment
First, let’s load the necessary packages and import our data:
Getting the data
We’ll be working with data on arctic grayling fish from two different lakes (I3 and I8).
# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!
# Load the grayling data
<- read_csv("data/gray_I3_I8.csv") g_df
Rows: 168 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): lake, species
dbl (3): site, length_mm, mass_g
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(g_df)
# A tibble: 6 × 5
site lake species length_mm mass_g
<dbl> <chr> <chr> <dbl> <dbl>
1 113 I3 arctic grayling 266 135
2 113 I3 arctic grayling 290 185
3 113 I3 arctic grayling 262 145
4 113 I3 arctic grayling 275 160
5 113 I3 arctic grayling 240 105
6 113 I3 arctic grayling 265 145
# Examine the data structure
glimpse(g_df)
Rows: 168
Columns: 5
$ site <dbl> 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, 113, …
$ lake <chr> "I3", "I3", "I3", "I3", "I3", "I3", "I3", "I3", "I3", "I3", …
$ species <chr> "arctic grayling", "arctic grayling", "arctic grayling", "ar…
$ length_mm <dbl> 266, 290, 262, 275, 240, 265, 265, 253, 246, 203, 289, 239, …
$ mass_g <dbl> 135, 185, 145, 160, 105, 145, 150, 130, 130, 71, 179, 108, 1…
# Get a statistical summary
summary(g_df)
site lake species length_mm
Min. :113 Length:168 Length:168 Min. :191.0
1st Qu.:113 Class :character Class :character 1st Qu.:270.8
Median :118 Mode :character Mode :character Median :324.5
Mean :116 Mean :324.5
3rd Qu.:118 3rd Qu.:377.0
Max. :118 Max. :440.0
mass_g
Min. : 53.0
1st Qu.:151.2
Median :340.0
Mean :351.2
3rd Qu.:519.5
Max. :889.0
NA's :2
# How many fish do we have from each lake?
%>%
g_df count(lake)
# A tibble: 2 × 2
lake n
<chr> <int>
1 I3 66
2 I8 102
Questions to Consider:
- What variables are in our dataset?
- What are their data types?
- Are there any missing values?
- What is the range of fish lengths in our dataset?
- How many fish were sampled from each lake?
Part 2: Calculating Descriptive Statistics
Let’s calculate various descriptive statistics for our data:
Let’s recreate the basic histogram of fish lengths from our last class. Use the sculpin_df
data frame that’s already loaded.
# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!
# Calculate the mean and median fish length
mean(g_df$length_mm)
[1] 324.494
median(g_df$length_mm)
[1] 324.5
# Calculate mean and median by lake
%>%
g_df group_by(lake) %>%
summarise(
mean_length = mean(length_mm),
median_length = median(length_mm)
)
# A tibble: 2 × 3
lake mean_length median_length
<chr> <dbl> <dbl>
1 I3 266. 266
2 I8 363. 373
Summarizing data - two ways
lets say we want to summarize the data and need to get n, means, standard deviation, standard error
We could do the following - if we had missing cells the code below would give an error
mean(g_df$length_mm)
[1] 324.494
mean(g_df$length_mm, na.rm = TRUE) # removes missing values
[1] 324.494
length(g_df$length_mm)
[1] 168
the length counts missing and non-missing data
however this would get old if we had to do this for everything and then to do it for the different groupings - lee and windward…
We need to learn to pipe
passes things from the dataframe to a command and so on…
- the dataframe –> pipe command that feed the dataframe into –> next command
%>% summarize(mean_length = mean(length_mm, na.rm = TRUE)) g_df
# A tibble: 1 × 1
mean_length
<dbl>
1 324.
What is cool is we can do a lot of different things now
%>%
g_df summarize(
mean_length = mean(length_mm, na.rm = TRUE),
sd_length = sd(length_mm, na.rm = TRUE),
n_length = n())
# A tibble: 1 × 3
mean_length sd_length n_length
<dbl> <dbl> <int>
1 324. 65.0 168
Super cool code in case there are missing values
%>%
g_df summarize(
mean_length = mean(length_mm, na.rm = TRUE),
sd_length = sd(length_mm, na.rm = TRUE),
n_length = sum(!is.na(length_mm)))
# A tibble: 1 × 3
mean_length sd_length n_length
<dbl> <dbl> <int>
1 324. 65.0 168
Now for Spread…
# Write your code here to read in the file
# Calculate standard deviation and variance
<- mean(g_df$length_mm, na.rm=TRUE)
mean_length <- sd(g_df$length_mm)
sd_length <- var(g_df$length_mm)
var_length sd_length
[1] 65.00659
var_length
[1] 4225.856
# Calculate quartiles for overall data
<- quantile(g_df$length_mm, probs = c(0.25, 0.5, 0.75))
quartiles # cat("First quartile (Q1):", quartiles[1], "mm\n")
# cat("Second quartile (Median):", quartiles[2], "mm\n")
# cat("Third quartile (Q3):", quartiles[3], "mm\n")
# Calculate a more comprehensive set of percentiles
<- quantile(g_df$length_mm,
percentiles probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
# Display the percentiles using flextable
data.frame(
Percentile = c("10th", "25th (Q1)", "50th (Median)", "75th (Q3)", "90th"),
Value = percentiles
)
Percentile Value
10% 10th 251.10
25% 25th (Q1) 270.75
50% 50th (Median) 324.50
75% 75th (Q3) 377.00
90% 90th 408.60
Note you could add a box plot by lake to see this if you wanted
%>%
g_df ggplot(aes(lake, length_mm, color= lake))+
geom_boxplot()
The coefficient of variation (CV) is the standard deviation expressed as a percentage of the mean:
\[CV = \frac{s}{\bar{Y}} \times 100\%\]
# Calculate coefficient of variation
/ mean_length * 100 sd_length
[1] 20.03321
# Calculate by lake
%>%
g_df group_by(lake) %>%
summarise(
mean_length = mean(length_mm),
sd_length = sd(length_mm),
cv_length = sd_length / mean_length * 100
%>%
) flextable()
lake | mean_length | sd_length | cv_length |
---|---|---|---|
I3 | 265.6061 | 28.30378 | 10.65630 |
I8 | 362.5980 | 52.33901 | 14.43444 |
Questions to Consider:
- How do the means and medians compare within each lake? What might this tell you about the distribution?
- Which lake has more variable fish lengths? How can you tell?
- Why might the coefficient of variation be useful when comparing variability between different measurements (e.g., length vs. mass)?
Part 3: Visualizing Distributions
Visualizations can help us better understand the descriptive statistics we’ve calculated.
One of the best ways to look at data is a histogram - and we will do it again
# Create a histogram of all fish lengths
%>% ggplot(aes(x = length_mm)) +
g_df geom_histogram(bins = 15)
# Create histograms by lake
%>% ggplot(aes(x = length_mm, fill = lake)) +
g_df geom_histogram(bins = 15, position = "dodge", alpha = 0.7)
Personally I like box plots
# Create a box plot comparing fish lengths by lake
# Create a box plot comparing fish lengths by lake
%>% ggplot( aes(x = lake, y = length_mm, fill = lake)) +
g_df geom_boxplot()
Now these will be really important later on
## Create density plots
%>% ggplot(aes(x = length_mm, fill = lake)) +
g_df geom_density(alpha = 0.5)
Questions to Consider:
- Which visualization best shows the differences in fish lengths between lakes?
- What can you learn from the violin plots that might not be apparent from the box plots?
- How would you interpret the cumulative frequency distribution?
- What patterns or insights can you identify from these visualizations?
Part 4: Interpreting the Results
Based on our analysis, we can make the following observations:
Lake Differences: Fish from Lake I8 are generally larger than those from Lake I3, both in length and mass.
Variability: Lake I8 shows greater variability in fish lengths and masses than Lake I3, as indicated by higher standard deviations and IQRs.
Distribution Shape:
- Lake I3 fish lengths are more symmetrically distributed.
- Lake I8 fish lengths show a slight negative skew, suggesting a few smaller fish pulling the distribution to the left.
Length-Mass Relationship: Both lakes show a strong positive correlation between fish length and mass, following an approximately cubic relationship (mass increases with the cube of length).
Guided Questions for Deeper Understanding of descriptive statistics
Biological Interpretation: What ecological factors might explain the differences in fish size between the two lakes?
Statistical Reasoning: Why might we prefer to use the median and IQR instead of the mean and standard deviation in some cases?
Data Visualization: Which visualization method was most effective for comparing the two lakes? Why?
Scientific Communication: How would you concisely summarize these findings in a scientific paper?
Further Analysis: What additional analyses might be useful to better understand this dataset?