we will always cover parametric tests
then we will cover non parametric approaches
later on we will explore other approaches to use the appropriate underlying distribution that is not normal but Poisson or other
What we will cover today:
Lets work with the Lake Trout data as the weights are pretty cool and the assumptions may or may not hold
This is easily translated into any of the other dataframes you might want to use
lake trout
# the stuff above controls the output and is also set at the top so dont need here
# Load the pine needle data
# Use here() function to specify the path
lt_df <- read_csv("data/lake_trout.csv")
# Examine the first few rows
head(lt_df)# A tibble: 6 × 5
  sampling_site species    length_mm mass_g lake 
  <chr>         <chr>          <dbl>  <dbl> <chr>
1 I8            lake trout       515   1400 I8   
2 I8            lake trout       468   1100 I8   
3 I8            lake trout       527   1550 I8   
4 I8            lake trout       525   1350 I8   
5 I8            lake trout       517   1300 I8   
6 I8            lake trout       607   2100 I8   # I had accdentally asked you to do mode in HW2 - wiht out telling you how... 
# here is one approach
lt_df %>%
  filter(!is.na(mass_g)) %>%
  group_by(lake, mass_g) %>%
  summarise(count = n(), .groups = "drop_last") %>%
  arrange(desc(count)) %>%
  slice(1) %>%
  select(-count) %>%
  rename(mode_mass = mass_g)# A tibble: 6 × 2
# Groups:   lake [6]
  lake        mode_mass
  <chr>           <dbl>
1 I8               1000
2 Island Lake      2200
3 N 01             1000
4 NE 12              90
5 NE 14            1150
6 Toolik            340T-tests are parametric tests
Random sampling
Normality
Equal variance (or Welches T Test)
No outliers
samples are randomly collected from populations; part of experimental design
Necessary for sample -> population inference
Basic assumptions of parametric t-tests:
NE 12 as if we were going to do a one sample T Test
ne12_dataBasic assumptions of parametric t-tests:
Normality
equal variance
random sampling
no outliers
Lets do the above for one lake - NE 12 as if we were going to do a one sample T Test
ne12_dataNormality: Samples from normally distributed population
“Null hypothesis is that data is normally distributed”
    Shapiro-Wilk normality test
data:  ne12_data$length_mm
W = 0.94528, p-value = 1.56e-09Basic assumptions of parametric t-tests:
Equal variance: samples are from populations with similar degree of variability
common “robust” test for means of two populations
Robust to violation of equal variance assumption, deals better with unequal sample size
Parametric test (assumes normal distribution)
Calculates a t statistic but recalculates df based on samples sizes and s
Lets compare a parametric T-Test to a Welch’s t-test
[1] "Standard t-test results for mass_g:"
    Two Sample t-test
data:  mass_g by lake
t = 14.181, df = 330, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Island Lake and group NE 12 is not equal to 0
95 percent confidence interval:
 2266.304 2996.360
sample estimates:
mean in group Island Lake       mean in group NE 12 
                3165.0000                  533.6677 [1] "Welch's t-test results for mass_g:"
    Welch Two Sample t-test
data:  mass_g by lake
t = 5.1368, df = 9.0578, p-value = 0.0006016
alternative hypothesis: true difference in means between group Island Lake and group NE 12 is not equal to 0
95 percent confidence interval:
 1473.676 3788.989
sample estimates:
mean in group Island Lake       mean in group NE 12 
                3165.0000                  533.6677 Rank-based tests: no assumptions about distribution (non-parametric)
Ranks of data: observations assigned ranks, sums (and signs for paired tests) of ranks for groups compared
Mann-Whitney U test common alternative to independent samples t-test
Wilcoxon signed-rank test is alternative to paired t-test
Assumptions: similar distributions for groups, equal variance
Less power than parametric tests
Best when normality assumption can not be met by transformation (weird distribution) or large outliers
[1] "Mann-Whitney U test results mass_g:"
    Wilcoxon rank sum test with continuity correction
data:  mass_g by lake
W = 3205.5, p-value = 9.506e-08
alternative hypothesis: true location shift is not equal to 0perm packagelibrary(perm) 
# Prepare data for permutation test
ne12_perm_data <- isl_ne12_df %>% 
  filter(lake == "NE 12") %>% 
  pull(length_mm)
# Randomly sample exactly 25 observations from NE 12 (set seed for reproducibility)
set.seed(123)
ne12_perm_data <- sample(ne12_perm_data, size = 25, replace = FALSE)
island_perm_data <- isl_ne12_df %>% 
  filter(lake == "Island Lake") %>% 
  pull(length_mm)
# Calculate the observed difference in means
observed_diff <- mean(ne12_perm_data, na.rm = TRUE) - mean(island_perm_data, na.rm = TRUE)
# Perform permutation test for difference in means using perm package
permTS(ne12_perm_data, island_perm_data, 
       alternative = "two.sided", 
       method = "exact.mc", 
       control = permControl(nmc = 10000))
    Exact Permutation Test Estimated by Monte Carlo
data:  GROUP 1 and GROUP 2
p-value = 2e-04
alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is not equal to 0
sample estimates:
mean GROUP 1 - mean GROUP 2 
                    -333.08 
p-value estimated from 10000 Monte Carlo replications
99 percent confidence interval on p-value:
 0.000000000 0.001059383 When assumptions aren’t met, transformations may help normalize data:
log10(x) - Useful for right-skewed data, multiplicative effectssqrt(x) - Useful for count data, moderately right-skewed distributionsStrengths:
Weaknesses:
Statistical tests have different strengths and assumptions. The choice should be guided by your data characteristics, not just convenience. Always visualize your data before deciding on the appropriate test.