Download this lecture
These are variables - do you know what they mean?
We’ll use a dataset on grayling - gray_I3_I8.csv
from two different lakes to explore these concepts.. like you did in the homework
Before we dive into descriptive statistics, let’s clarify some fundamental concepts:
Types of populations:
Sampling involves
It’s important to distinguish between:
For example:
Let’s explore our grayling dataset and identify the types of variables it contains.
Biology is fundamentally different from fields like physics/chemistry in that:
Statistics helps us understand biological processes in this variable world by:
Practice Exercise 1: Can you open the fish data gray_I3_I8.csv
and look at the structure and make a histogram?
Practice Exercise 2: Can you do this for the pine data we have collected?
Let’s examine the different data and determine what they are?
# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!
# Load the grayling data
grayling_df <- read_csv("data/gray_I3_I8.csv")
# Take a look at the first few rows
head(grayling_df)
# A tibble: 6 × 5
site lake species length_mm mass_g
<dbl> <chr> <chr> <dbl> <dbl>
1 113 I3 arctic grayling 266 135
2 113 I3 arctic grayling 290 185
3 113 I3 arctic grayling 262 145
4 113 I3 arctic grayling 275 160
5 113 I3 arctic grayling 240 105
6 113 I3 arctic grayling 265 145
When taking biological measurements, understanding measurement quality is essential:
Accuracy is a function of both precision and bias.
For statisticians, BIAS is usually a more serious problem than low precision because:
Practice Exercise 1: What are potential sources of error in fish data?
For our grayling data, potential sources of measurement error might include:
The two most common measures of central tendency are the mean and the median.
The Arithmetic Mean The arithmetic mean is the average of a set of measurements:
Where:
The Median
The spread of a distribution tells us how variable the measurements are.
The variance is
The standard deviation is the square root of variance
# Calculate standard deviation of fish length
var_length <- var(grayling_df$length_mm)
sd_length <- sd(grayling_df$length_mm)
# Calculate by lake
grayling_df %>%
group_by(lake) %>%
summarise(
var_length = var(length_mm),
sd_length = sd(length_mm) )
# A tibble: 2 × 3
lake var_length sd_length
<chr> <dbl> <dbl>
1 I3 801. 28.3
2 I8 2739. 52.3
The area under the curve of a bell shaped curve within + and - 2 standard deviations on each side includes about 95% of the data
so there is only 2.5% of the data that is outside this range
i3 Lake Fish Length Summary:
Number of fish: 66
Mean length: 265.61 mm
Standard Deviation: 28.3 mm
Range for ±2 SD: 209 to 322.21 mm
Percentage within ±2 SD: 90.91 %
The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean:
This is useful for comparing the variability of measurements with different units or vastly different scales.
Coefficient of variation: 10.7 %
# A tibble: 2 × 2
lake cv_length
<chr> <dbl>
1 I3 10.7
2 I8 14.4
The interquartile range (IQR) is the range of the middle 50% of the data:
\[IQR = Q_3 - Q_1\]
Where \(Q_1\) is the first quartile (25th percentile) and \(Q_3\) is the third quartile (75th percentile).
First quartile: 270.75 mm
Third quartile: 377 mm
Interquartile range: 106.25 mm
# A tibble: 2 × 4
lake q1 q3 iqr
<chr> <dbl> <dbl> <dbl>
1 I3 256 280 24
2 I8 340 401 61
Percentiles are values that divide a dataset into 100 equal parts.
The standard deviation and interquartile range both measure spread, but:
Standard deviation: Sensitive to outliers
Interquartile range: Robust against outliers
When the data is approximately normal, the IQR ≈ 1.35 × standard deviation.
# A tibble: 2 × 4
lake sd iqr ratio_iqr_sd
<chr> <dbl> <dbl> <dbl>
1 I3 28.3 24 0.848
2 I8 52.3 61 1.17
Biological data are often skewed (asymmetrical), which can make the arithmetic mean less representative of central tendency. Data transformations can help address this issue.
The logarithmic transformation is one of the most common for right-skewed biological data:
When data are log-normally distributed, the geometric mean often provides a better measure of central tendency than the arithmetic mean.
But there are issues and it might not be good…
detecting differences in geometric means, not arithmetic means
Can’t handle zeros without adding arbitrary constants (log(x+1) transformations), which can bias results
Arithmetic mean of original data: 265.6 mm
Geometric mean (back-transformed mean of logs): NA mm
To tranform data to a “normal” distribution we can use the following transformations…
In a reflection the parameter k is typically chosen to be a value that’s larger than the maximum value
Histograms
Histograms show the frequency distribution of our data.
Box Plots
Box plots show the median, quartiles, and potential outliers.
The mean and median measure different aspects of a distribution:
Mean: Center of gravity of the distribution
Median: Middle value of the data
When a distribution is symmetric, the mean and median are similar. When it’s skewed or has outliers, they can differ significantly.
# A tibble: 2 × 6
lake mean median sd iqr skewness
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 I3 266. 266 28.3 24 -0.883
2 I8 363. 373 52.3 61 -1.09
The mean and median measure different aspects of a distribution:
Mean: Center of gravity of the distribution
Median: Middle value of the data
When a distribution is symmetric, the mean and median are similar. When it’s skewed or has outliers, they can differ significantly.
Let’s examine how missing values affect our descriptive statistics by looking at the mass variable, which has some missing data.
[1] 2
Mean mass without handling NAs: NA g
Mean mass with na.rm=TRUE: 351.2289 g
# A tibble: 2 × 5
lake mean_mass median_mass sd_mass n_missing
<chr> <dbl> <dbl> <dbl> <int>
1 I3 150. 147 42.2 0
2 I8 484. 490 176. 2
Now that we have estimates of the sample we need to relate that to the population
In reality, we rarely know the true population parameters. When studying fish in lakes I3 and I8:
Let’s demonstrate how different samples from the same population can give different estimates.
If we could sample all fish in the lake, we would know the true mean length. But that’s usually impossible in ecology!
Let’s take several random samples from Lake I3 and see how the sample means vary:
# Filter for Lake I3
i3_data <- grayling_df %>% filter(lake == "I3")
# Function to take a random sample and calculate the mean
sample_mean <- function(data, sample_size) {
sample_data <- sample_n(data, sample_size)
return(mean(sample_data$length_mm))
}
# Take 10 different samples of size 15 from Lake I3
set.seed(123) # For reproducibility
sample_size <- 15
sample_means <- replicate(50, sample_mean(i3_data, sample_size))
# Create a data frame with sample numbers and means
samples_df <- data.frame(
sample_number = 1:50,
sample_mean = sample_means
)
Notice how each sample’s mean differs from the overall mean. This demonstrates sampling variation.
The standard error of the mean (SEM) measures the precision of a sample mean as an estimate of the population mean.
Formula: \(SE_{\bar{x}} = \frac{s}{\sqrt{n}}\)
Where:
The standard error tells us:
Remember:
# Calculate mean, SD, and SE for each lake
grayling_stats <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = mean(length_mm),
sd_length = sd(length_mm),
n = n(),
se_length = sd_length / sum(!is.na(length_mm))
)
# Display the statistics
grayling_stats
# A tibble: 2 × 5
lake mean_length sd_length n se_length
<chr> <dbl> <dbl> <int> <dbl>
1 I3 266. 28.3 66 0.429
2 I8 363. 52.3 102 0.513
The sampling distribution of the mean is the theoretical distribution of all possible sample means of a given sample size from a population.
Important properties:
The larger the sample size:
Let’s simulate the sampling distribution for Lake I3 fish data.
Let’s simulate taking many samples from Lake I3 to visualize the sampling distribution:
# Filter for Lake I3
i3_data <- grayling_df %>% filter(lake == "I3")
# Number of samples to simulate
num_simulations <- 1000
sample_size <- 20 # change the number and examine the range of values
# Simulate many samples and calculate means
set.seed(46) # For reproducibility
simulated_means <- replicate(num_simulations, sample_mean(i3_data, sample_size))
# Calculate the mean and standard deviation of the simulated means
mean_of_means <- mean(simulated_means)
sd_of_means <- sd(simulated_means)
# Create a data frame with the simulated means
simulated_df <- data.frame(sample_mean = simulated_means)
# Plot the sampling distribution
ggplot(simulated_df, aes(x = sample_mean)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
geom_vline(xintercept = mean(i3_data$length_mm),
linetype = "dashed", color = "red", linewidth = 1) +
annotate("text", x = mean(i3_data$length_mm) + 2, y = 50,
label = "Full sample mean", color = "red") +
labs(title = "Simulated Sampling Distribution of the Mean",
subtitle = paste("Based on", num_simulations, "samples of size", sample_size),
x = "Sample Mean (mm)",
y = "Frequency") +
theme_minimal()
Notice that the simulated sampling distribution:
Let’s see how the standard error changes with different sample sizes:
# Display the results
# Plot how SE changes with sample size
results_long <- pivot_longer(results,
cols = c(empirical_se, theoretical_se),
names_to = "se_type",
values_to = "standard_error")
ggplot(results_long, aes(x = sample_size, y = standard_error, color = se_type)) +
geom_line() +
geom_point(size = 3) +
scale_x_continuous(breaks = sample_sizes) +
labs(title = "Standard Error vs. Sample Size",
subtitle = "Standard error decreases as sample size increases",
x = "Sample Size",
y = "Standard Error",
color = "SE Type") +
theme_minimal()
A confidence interval is a range of values that is likely to contain the true population parameter.
The 95% confidence interval for the mean is approximately:
\(\bar{x} \pm 2 \times SE_{\bar{x}}\)
This “2 SE rule of thumb” means:
Confidence intervals provide a way to express the precision of our estimates.
Let’s calculate and visualize the 95% confidence intervals for the mean fish length in each lake:
# Calculate 95% confidence intervals
grayling_ci <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = mean(length_mm),
sd_length = sd(length_mm),
n = n(),
se_length = sd_length / sqrt(n),
ci_lower = mean_length - 2 * se_length,
ci_upper = mean_length + 2 * se_length
)
# Display the confidence intervals
grayling_ci
# A tibble: 2 × 7
lake mean_length sd_length n se_length ci_lower ci_upper
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 I3 266. 28.3 66 3.48 259. 273.
2 I8 363. 52.3 102 5.18 352. 373.
# Plot with confidence intervals
ggplot(grayling_ci, aes(x = lake, y = mean_length, fill = lake)) +
geom_bar(stat = "identity", alpha = 0.7) +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 0.2) +
labs(title = "Mean Fish Length by Lake with 95% Confidence Intervals",
subtitle = "Error bars represent 95% confidence intervals",
x = "Lake",
y = "Mean Length (mm)") +
theme_minimal()
Let’s compare different ways of displaying uncertainty in our estimates:
# Calculate statistics for different types of error bars
grayling_error_bars <- grayling_df %>% group_by(lake) %>%
summarize(mean_length = mean(length_mm),
sd_length = sd(length_mm), n = n(),
se_length = sd_length / sqrt(n),
ci_lower = mean_length - 1.96 * se_length,
ci_upper = mean_length + 1.96 * se_length,
one_sd_lower = mean_length - sd_length,
one_sd_upper = mean_length + sd_length)
# Create a data frame for plotting different error types
lake_i3 <- grayling_error_bars %>% filter(lake == "I3")
error_types <- data.frame(
error_type = c("Standard Deviation", "Standard Error", "95% Confidence Interval"),
lower = c(lake_i3$one_sd_lower,
lake_i3$mean_length - lake_i3$se_length,
lake_i3$ci_lower),
upper = c(lake_i3$one_sd_upper,
lake_i3$mean_length + lake_i3$se_length,
lake_i3$ci_upper))
# Plot the comparison
ggplot() +
geom_point(data = lake_i3, aes(x = "Mean",
y = mean_length), size = 4) +
geom_errorbar(data = error_types,
aes(x = error_type, ymin = lower,
ymax = upper, color = error_type),
width = 0.2, linewidth = 1) +
labs(title = "Different Types of Error Bars for Lake I3",
subtitle = "Comparing SD, SE, and 95% CI",
x = "",
y = "Length (mm)",
color = "Error Bar Type") +
theme_minimal() +
theme(legend.position = "none")
In this lecture, we’ve explored:
These tools form the foundation of statistical analysis and will be essential as we move forward to more complex statistical methods.