A lot of the data was modified from

Leach, TH, LA Winslow, FW Acker, JA Bloomfield, CW Boylen, PA Bukaveckas,

DF Charles, RA Daniels, CT Driscoll, LW Eichler, JL Farrell, CS Funk,

CA Goodrich, TM Michelena, SA Nierzwicki-Bauer, KM Roy, WH Shaw,

JW Sutherland, MW Swinton, DA Winkler, KC Rose.

Long-term dataset on aquatic responses to concurrent climate change

and recovery from acidification. 2018. Scientific Data. online.

https://doi.org/10.1038/sdata.2018.59. 10.1038/sdata.2018.59

Load Libraries

Again, we use these libraries almost all the time in every script

# Load Libraries ----
# this is done each time you run a script
library(readxl) # read in excel files
library(tidyverse) # dplyr and piping and ggplot etc
library(lubridate) # dates and times
library(scales) # scales on ggplot ases
library(skimr) # quick summary stats
library(janitor) # clean up excel imports
library(patchwork) # multipanel graphs

So now we have seen how to look at the data

What if we wanted to modify the data in terms of columns or rows

# lets read in a new file to add some complexity for fun
lakes.df <- read_csv("data/reduced_lake_long_genus_species.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   permanent_id = col_double(),
##   lake_name = col_character(),
##   date = col_date(format = ""),
##   group = col_character(),
##   genus_species = col_character(),
##   org_l = col_double(),
##   year = col_double()
## )

Mutate —–

If you want to modify variables you can change them with MUTATE

# Mutate - log
lakes_modified.df <- lakes.df %>%
  mutate(log_org_l = log10(org_l + 1))

# Mutate and mean ----
lakes_modified.df <- lakes.df %>%
  mutate(mean_org_l = mean(org_l, na.rm=TRUE))

# Mean by group ------
lakes_modified.df <- lakes.df %>%
  group_by(group) %>%
  mutate(mean_org_l = mean(org_l, na.rm=TRUE))

# how would you modify this to do the mean by group and lake?
lakes_modified.df <- lakes.df %>%
  group_by(group) %>%
  mutate(mean_org_l = mean(org_l, na.rm=TRUE))

# Mean and Standard Error -----
# there is no na.rm=TRUE for sum so we have to do some 
# special things
lakes_modified.df <- lakes.df %>%
  group_by(group) %>%
  mutate(mean_org_l = mean(org_l, na.rm=TRUE),
         se_org_l = sd(org_l, na.rm = T) / sqrt(sum(!is.na(org_l))))

So mutate is a key thing we will use a lot in the future

but this just adds a new column

Summarize data —-

What if we wanted a summary dataset rather than adding a new column

# there are two ways...
# the first is do all of the math manually
lakes_summary.df <- lakes.df %>%
  group_by(lake_name, group) %>%
  summarize(mean_org_l = mean(org_l, na.rm=TRUE),
         se_org_l = sd(org_l, na.rm = T) / sqrt(sum(!is.na(org_l))))

## `summarise()` has grouped output by 'lake_name'. You can override using the `.groups` argument.

the other way to do this is using skimr to look at summary data

lakes.df %>% group_by(lake_name, group) %>% skim(org_l)

Data summary
Name	Piped data
Number of rows	1368
Number of columns	7
_______________________
Column type frequency:
numeric	1
________________________
Group variables	lake_name, group

Variable type: numeric

skim_variable	lake_name	group	complete_rate	mean	sd	p50	p75	p100	hist
org_l	Grass	Cladoceran	1	1.50	3.12	0.08	1.63	19.48	▇▁▁▁▁
org_l	Grass	Copepod	1	4.88	9.47	0.00	4.51	46.06	▇▁▁▁▁
org_l	Indian	Cladoceran	1	2.58	7.19	0.00	0.80	56.20	▇▁▁▁▁
org_l	Indian	Copepod	1	3.21	6.72	0.07	1.96	34.22	▇▁▁▁▁
org_l	South	Cladoceran	1	1.44	5.00	0.00	0.62	55.60	▇▁▁▁▁
org_l	South	Copepod	1	3.27	7.76	0.00	2.04	56.03	▇▁▁▁▁
org_l	Willis	Cladoceran	1	1.87	5.91	0.00	1.06	48.35	▇▁▁▁▁
org_l	Willis	Copepod	1	2.35	7.00	0.00	0.73	57.34	▇▁▁▁▁

# this can be saved to a dataframe as well
skim.df <- lakes.df %>% dplyr::group_by(group) %>% skim(org_l)

there are a lot of things we can do with mutate and the possibilities are

endless. What would you like to see done?

Modifying variables - mutate and summarize

Bill Perry

2019/10/26

A lot of the data was modified from

Leach, TH, LA Winslow, FW Acker, JA Bloomfield, CW Boylen, PA Bukaveckas,

DF Charles, RA Daniels, CT Driscoll, LW Eichler, JL Farrell, CS Funk,

CA Goodrich, TM Michelena, SA Nierzwicki-Bauer, KM Roy, WH Shaw,

JW Sutherland, MW Swinton, DA Winkler, KC Rose.

Long-term dataset on aquatic responses to concurrent climate change

and recovery from acidification. 2018. Scientific Data. online.

https://doi.org/10.1038/sdata.2018.59. 10.1038/sdata.2018.59

Load Libraries

So now we have seen how to look at the data

What if we wanted to modify the data in terms of columns or rows

Mutate —–

If you want to modify variables you can change them with MUTATE

So mutate is a key thing we will use a lot in the future

but this just adds a new column

Summarize data —-

What if we wanted a summary dataset rather than adding a new column

the other way to do this is using skimr to look at summary data

there are a lot of things we can do with mutate and the possibilities are

endless. What would you like to see done?