Practice Exercise 1: Exploring the Grayling Dataset
Let’s explore the Arctic grayling data from lakes I3 and I8. Use the grayling_df
data frame to create basic summary statistics.
# Write your code here to explore the basic structure of the data
# also note plottig a box plot is really useful
str(grayling_df)
spc_tbl_ [168 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ site : num [1:168] 113 113 113 113 113 113 113 113 113 113 ...
$ lake : chr [1:168] "I3" "I3" "I3" "I3" ...
$ species : chr [1:168] "arctic grayling" "arctic grayling" "arctic grayling" "arctic grayling" ...
$ length_mm: num [1:168] 266 290 262 275 240 265 265 253 246 203 ...
$ mass_g : num [1:168] 135 185 145 160 105 145 150 130 130 71 ...
- attr(*, "spec")=
.. cols(
.. site = col_double(),
.. lake = col_character(),
.. species = col_character(),
.. length_mm = col_double(),
.. mass_g = col_double()
.. )
- attr(*, "problems")=<externalptr>
site lake species length_mm
Min. :113 Length:168 Length:168 Min. :191.0
1st Qu.:113 Class :character Class :character 1st Qu.:270.8
Median :118 Mode :character Mode :character Median :324.5
Mean :116 Mean :324.5
3rd Qu.:118 3rd Qu.:377.0
Max. :118 Max. :440.0
mass_g
Min. : 53.0
1st Qu.:151.2
Median :340.0
Mean :351.2
3rd Qu.:519.5
Max. :889.0
NA's :2
The standard normal distribution is crucial for understanding statistical inference:
Z-scores allow us to convert any normal distribution to the standard normal distribution.
Practice Exercise 2: Calculating Z-scores
Let’s practice converting raw values to Z-scores using the Arctic grayling data.
# Calculate the mean and standard deviation of fish lengths
mean_length <- mean(grayling_df$length_mm, na.rm = TRUE)
sd_length <- sd(grayling_df$length_mm, na.rm = TRUE)
# Calculate Z-scores for fish lengths
grayling_df <- grayling_df %>%
mutate(z_score = (length_mm - mean_length) / sd_length)
# View the first few rows with Z-scores
head(grayling_df)
# A tibble: 6 × 6
site lake species length_mm mass_g z_score
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 113 I3 arctic grayling 266 135 -0.900
2 113 I3 arctic grayling 290 185 -0.531
3 113 I3 arctic grayling 262 145 -0.961
4 113 I3 arctic grayling 275 160 -0.761
5 113 I3 arctic grayling 240 105 -1.30
6 113 I3 arctic grayling 265 145 -0.915
You want to know things about this population like
# A tibble: 1 × 1
mean_length
<dbl>
1 266.
Standard Normal Distribution
~68% of the curve area within +/- 1 σ of the mean,
~95% within +/- 2 σ of the mean,
~99.7% within +/- 3 σ of the mean
*remember σ = standard deviation
Areas under curve of Standard Normal Distribution
Done by converting original data points to z-scores
So lets do this for a fish that is 300mm long and guess the probability of catching something larger
z = (300 - 265.61)/28.3 = 1.215194
i3_stats <- gray_i3_df %>%
summarize(
mean_length = round(mean(length_mm, na.rm = TRUE), 2),
sd_length = sd(length_mm, na.rm = TRUE),
n = sum(!is.na(length_mm)),
se_length = round(sd_length / sqrt(sum(!is.na(length_mm))), 2),
.groups = "drop"
)
# Display the results
i3_stats
# A tibble: 1 × 4
mean_length sd_length n se_length
<dbl> <dbl> <int> <dbl>
1 266. 28.3 66 3.48
Done by converting original data points to z-scores
So lets do this for a fish that is 320mm long and guess the probability of catching something larger
z = (320 - 265.61)/28.3 = 1.92
or .9726 in table or 97.3% is the area left of the curve and
100 - 97.3 = 2.7% or 2.7% of fish are expected to be longer
The standard error of the mean (SEM) tells us how precise our sample mean is as an estimate of the population mean.
Standard Error Formula: \[ SE_{\bar{Y}} = \frac{s}{\sqrt{n}} \]
Where:
Key properties:
Practice Exercise 5: Sampling Distributions
Let’s explore how sample size affects our estimates by taking samples of different sizes:
# Set seed for reproducibility
set.seed(456)
# Create samples of different sizes
small_sample <- grayling_df %>% sample_n(5)
medium_sample <- grayling_df %>% sample_n(30)
large_sample <- grayling_df %>% sample_n(125)
# Calculate mean and standard error for each sample
small_mean <- mean(small_sample$length_mm, na.rm = TRUE)
small_se <- sd(small_sample$length_mm, na.rm = TRUE) / sqrt(10)
medium_mean <- mean(medium_sample$length_mm, na.rm = TRUE)
medium_se <- sd(medium_sample$length_mm, na.rm = TRUE) / sqrt(30)
large_mean <- mean(large_sample$length_mm, na.rm = TRUE)
large_se <- sd(large_sample$length_mm, na.rm = TRUE) / sqrt(100)
# Create a data frame with the results
results <- data.frame(
Sample_Size = c(10, 30, 100),
Mean = c(small_mean, medium_mean, large_mean),
SE = c(small_se, medium_se, large_se)
)
# Display the results
results
Sample_Size Mean SE
1 10 302.000 26.607330
2 30 319.200 12.082989
3 100 323.328 6.478149
What do you observe about the standard error as sample size increases? Why does this happen?
can estimate the standard deviation of sample means
“Standard error of sample mean”
How good is your estimate of population mean? (based on the sample collected)
quantifies how much the sample means are expected to vary from samples
gives an estimate of the error associated with using \(\bar{y}\) to estimate \(\mu\)…
Notice: - \(s_{\bar{y}}\) depends on - sample s (standard deviation) - sample n - (\(s_{\bar{y}} = \frac{s}{\sqrt{n}}\))
How and why? - Decreases with sample n - number - increases with sample s - standard deviation
The standard error of the mean (SEM) tells us how precise our sample mean is as an estimate of the population mean.
Standard Error Formula: \[ SE_{\bar{Y}} = \frac{s}{\sqrt{n}} \]
Where:
Key properties:
A confidence interval is a range of values that is likely to contain the true population parameter.
95% Confidence Interval Formula: \[\text{95% CI} = \bar{y} \pm z \cdot \frac{\sigma}{\sqrt{n}}\]
Where:
A confidence interval is a range of values that is likely to contain the true population parameter.
Interpretation: If we were to take many samples and calculate the 95% CI for each, about 95% of these intervals would contain the true population mean.
Common misinterpretation: “There is a 95% probability that the true mean is in this interval.”
Lets compare what the two plots look like near each other
Practice Exercise 3: Calculating Standard Error and Confidence Intervals
Calculate the standard error and 95% confidence interval for the mean length of Arctic grayling in each lake.
# Calculate the standard error and confidence intervals by lake
ci_results <- grayling_df %>%
group_by(lake) %>%
summarize(
mean_length = round(mean(length_mm, na.rm = TRUE), 2),
sd_length = sd(length_mm, na.rm = TRUE),
n = sum(!is.na(length_mm)),
se_length = round(sd_length / sqrt(n), 2),
ci = round(1.96 * se_length, 2),
ci_lower = round(mean_length - 1.96 * se_length, 2),
ci_upper = round(mean_length + 1.96 * se_length, 2),
.groups = "drop"
)
# Display the results
ci_results
# A tibble: 2 × 8
lake mean_length sd_length n se_length ci ci_lower ci_upper
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 I3 266. 28.3 66 3.48 6.82 259. 272.
2 I8 363. 52.3 102 5.18 10.2 352. 373.
What do these confidence intervals tell us about the difference between lakes?
In the more typical case DON’T know the population σ
Instead, we use Student’s t distribution
When sample sizes are small, the t-distribution is more appropriate than the normal distribution.
Practice Exercise 4: Using the t-distribution
Let’s compare confidence intervals using the normal approximation (z) versus the t-distribution for our fish data.
# Calculate CI using both z and t distributions for a smaller subset
small_sample <- grayling_df %>%
filter(lake == "I3") %>%
slice_sample(n = 10)
# Calculate statistics
sample_mean <- mean(small_sample$length_mm)
sample_sd <- sd(small_sample$length_mm)
sample_n <- nrow(small_sample)
sample_se <- sample_sd / sqrt(sample_n)
# Calculate confidence intervals
z_ci_lower <- sample_mean - 1.96 * sample_se
z_ci_upper <- sample_mean + 1.96 * sample_se
# For t-distribution, get critical value for 95% CI with df = n-1
t_crit <- qt(0.975, df = sample_n - 1)
t_ci_lower <- sample_mean - t_crit * sample_se
t_ci_upper <- sample_mean + t_crit * sample_se
# Display results
cat("Mean:", round(sample_mean, 1), "mm\n")
Mean: 255.3 mm
Standard deviation: 26.26 mm
Standard error: 8.31 mm
95% CI using z: 239 to 271.6 mm
95% CI using t: 236.5 to 274.1 mm
t critical value: 2.262 vs z critical value: 1.96
To calculate CI for sample from “unknown” population:
Where:
Here is a t-table
One-tailed questions: area of distribution left or (right) of a certain value
Two-tailed questions refer to area between certain values
Let’s calculate CIs again:
Use two-sided test
Hypothesis testing is a systematic way to evaluate research questions using data.
Key components:
Null hypothesis (H₀): Typically assumes “no effect” or “no difference”
Alternative hypothesis (Hₐ): The claim we’re trying to support
Statistical test: Method for evaluating evidence against H₀
P-value: Probability of observing our results (or more extreme) if H₀ is true
Significance level (α): Threshold for rejecting H₀, typically 0.05
Decision rule: Reject H₀ if p-value < α
Hypothesis testing is a systematic way to evaluate research questions using data.
Key components:
Null hypothesis (H₀): Typically assumes “no effect” or “no difference”
Alternative hypothesis (Hₐ): The claim we’re trying to support
Statistical test: Method for evaluating evidence against H₀
P-value: Probability of observing our results (or more extreme) if H₀ is true
Significance level (α): Threshold for rejecting H₀, typically 0.05
Decision rule: Reject H₀ if p-value < α
Practice Exercise 5: Lets practice a One-Sample t-Test
Let’s perform a one-sample t-test to determine if the mean fish length in Toolik Lake differs from 50 mm:
# get only lake I3
i3_df <- grayling_df %>% filter(lake=="I3")
# what is the mean
i3_mean <- mean(i3_df$length_mm, na.rm=TRUE)
cat("Mean:", round(i3_mean, 1), "mm\n")
Mean: 265.6 mm
# Perform a one-sample t-test
t_test_result <- t.test(i3_df$length_mm, mu = 260)
# View the test results
t_test_result
One Sample t-test
data: i3_df$length_mm
t = 1.6091, df = 65, p-value = 0.1124
alternative hypothesis: true mean is not equal to 260
95 percent confidence interval:
258.6481 272.5640
sample estimates:
mean of x
265.6061
Interpret this test result by answering these questions:
Practice Exercise 6: Formulating Hypotheses
For the following research questions about Arctic grayling, write the null and alternative hypotheses:
# Let's test one of these hypotheses: Are fish in Lake I8 longer than fish in Lake I3?
# Perform an independent t-test
t_test_result <- t.test(length_mm ~ lake, data = grayling_df,
alternative = "less") # H₀: μ_I3 ≥ μ_I8, H₁: μ_I3 < μ_I8
# Display the results
t_test_result
Welch Two Sample t-test
data: length_mm by lake
t = -15.532, df = 161.63, p-value < 2.2e-16
alternative hypothesis: true difference in means between group I3 and group I8 is less than 0
95 percent confidence interval:
-Inf -86.66138
sample estimates:
mean in group I3 mean in group I8
265.6061 362.5980
Based on this t-test, what can we conclude about the difference in fish length between the two lakes?
A p-value is the probability of observing the sample result (or something more extreme) if the null hypothesis is true.
Common interpretations: - p < 0.05: Strong evidence against H₀ - 0.05 ≤ p < 0.10: Moderate evidence against H₀ - p ≥ 0.10: Insufficient evidence against H₀
Common misinterpretations: - p-value is NOT the probability that H₀ is true - p-value is NOT the probability that results occurred by chance - Statistical significance ≠ practical significance
When making decisions based on hypothesis tests, two types of errors can occur:
Type I Error (False Positive) - Rejecting H₀ when it’s actually true - Probability = α (significance level) - “Finding an effect that isn’t real”
Type II Error (False Negative) - Failing to reject H₀ when it’s actually false - Probability = β - “Missing an effect that is real”
Statistical Power = 1 - β - Probability of correctly rejecting a false H₀ - Increases with: - Larger sample size - Larger effect size - Lower variability - Higher α level
Practice Exercise 6: Interpreting P-values and Errors
Given the following scenarios, identify whether a Type I or Type II error might have occurred:
A researcher concludes that a new fishing regulation increased grayling size, when in fact it had no effect.
A study fails to detect a real decline in grayling population due to warming water, concluding there was no effect.
Let’s calculate the power of our t-test to detect a 30 mm difference in length between lakes:
# Calculate power for detecting a 30 mm difference
# First determine parameters
lake_I3 <- grayling_df %>% filter(lake == "I3")
lake_I8 <- grayling_df %>% filter(lake == "I8")
n1 <- nrow(lake_I3)
n2 <- nrow(lake_I8)
sd_pooled <- sqrt((var(lake_I3$length_mm) * (n1-1) +
var(lake_I8$length_mm) * (n2-1)) /
(n1 + n2 - 2))
# Calculate power
effect_size <- 30 / sd_pooled # Cohen's d
df <- n1 + n2 - 2
alpha <- 0.05
power <- power.t.test(n = min(n1, n2),
delta = effect_size,
sd = 1, # Using standardized effect size
sig.level = alpha,
type = "two.sample",
alternative = "two.sided")
# Display results
power
Two-sample t test power calculation
n = 66
delta = 0.6741298
sd = 1
sig.level = 0.05
power = 0.9702076
alternative = two.sided
NOTE: n is number in *each* group
Key concepts covered:
mean ± 1.96 × SE