Coding workshop: Week 5

Analysis of variance

tidyverse
lterdatasampler
rstatix
aov
summary
TukeyHSD
eta_squared
read_csv
pipe operators
|>
select
rename
ggplot
geom_histogram
geom_qq
geom_qq_line
facet_wrap
group_by
summarize
as_factor
mutate
Author
Affiliation
Modified

February 6, 2026

Workshop date: Friday 6 February

1. Summary

Packages

  • tidyverse
  • lterdatasampler
  • rstatix

Operations

New functions

  • make sure a variable is a factor using as_factor()
  • do an ANOVA using aov()
  • look for more information from model results using summary()
  • do post-hoc Tukey test using TukeyHSD()
  • calculate effect size for ANOVA using rstatix::eta_squared()

Review

  • read in data using read_csv()
  • chain functions together using |>
  • modify columns using mutate()
  • select columns using select()
  • rename columns using rename()
  • visualize data using ggplot()
  • create histograms using geom_histogram()
  • visualize QQ plots using geom_qq() and geom_qq_line()
  • create multi-panel plots using facet_wrap()
  • group data using group_by()
  • summarize data using summarize()

Data sources

The Plum Island Ecosystem fiddler crab data is from lterdatasampler (data info here). Crab carapace width is measured in mm.

2. Code

1. Packages

library(tidyverse) # general use
library(lterdatasampler) # data package
library(rstatix) # effect size calculations

data(pie_crab) # displaying data in Environment

2. Parametric tests

a. Cleaning and wrangling

  • Create a new object called pie_crab_clean.
  • Filter to only include the following sites: Cape Cod, Virginia Coastal Reserve LTER, and Zeke’s Island NERR.
  • Make sure site and name are set as factors.
  • Select the columns of interest: site, name, and code.
  • Rename the column name to site_name and code to site_code.
pie_crab_clean <- pie_crab |> # start with the pie_crab dataset
  filter(site %in% c("CC", "ZI", "VCR")) |> # filter for Cape Cod, Zeke's Island, Virginia Coastal
  mutate(name = as_factor(name), # make sure site name is being read in as a factor
         site = as_factor(site)) |> 
  select(site, name, size) |> # select columns of interest
  rename(site_code = site, # rename site to be site_code
         site_name = name) # rename name to be site_name

# display some rows from the data frame
# useful for showing a small part of the data frame
slice_sample(pie_crab_clean, # data frame
             n = 10) # number of rows to show
# A tibble: 10 × 3
   site_code site_name                      size
   <fct>     <fct>                         <dbl>
 1 VCR       Virginia Coastal Reserve LTER 15.1 
 2 VCR       Virginia Coastal Reserve LTER 19.2 
 3 ZI        Zeke's Island NERR             8.05
 4 ZI        Zeke's Island NERR             9.81
 5 ZI        Zeke's Island NERR             9.86
 6 VCR       Virginia Coastal Reserve LTER 20.8 
 7 ZI        Zeke's Island NERR            12.7 
 8 CC        Cape Cod                      14.4 
 9 ZI        Zeke's Island NERR            13.6 
10 ZI        Zeke's Island NERR            11.1 

b. Quick summary

  • Create a new object called pie_crab_summary.
  • Calculate the mean, variance, and sample size. Display the object.
# creating a new object called pie_crab_summary
pie_crab_summary <- pie_crab_clean |> # starting with clean data frame
  group_by(site_name) |> # group by site
  summarize(mean = mean(size), # calculate mean size
            var = var(size), # calculate variance of size
            n = length(size)) |> # calculate number of observations per site (sample size)
  ungroup() # ungrouping

# display pie_crab_summary
pie_crab_summary
# A tibble: 3 × 4
  site_name                      mean   var     n
  <fct>                         <dbl> <dbl> <int>
1 Zeke's Island NERR             12.1  4.04    35
2 Virginia Coastal Reserve LTER  16.3  8.63    30
3 Cape Cod                       16.8  4.22    27

c. Exploring the data

Create a jitter plot with the mean crab carapace width for each site.

# base layer: ggplot
ggplot(pie_crab_clean, # use the clean data set
       aes(x = site_name, # x-axis
           y = size)) + # y-axis
  # first layer: jitter (each point is an individual crab)
  geom_jitter(height = 0, # don't jitter points vertically
              width = 0.15) + # narrower jitter (easier to see)
  # second layer: point representing mean
  stat_summary(geom = "point", # geometry being plotted
               fun = mean, # function (calculating the mean)
               color = "red", # color to make it easier to see
               size = 5) + # larger point size
  theme_minimal() + # cleaner plot theme
  labs(x = "Site name",
       y = "Carapace width (mm)")

Is there a difference in mean crab carapace width between the three sites?

Yes, and Zeke’s Island NERR crabs tend to be smaller than those from Cape Cod or Virginia Coastal Reserve LTER.

d. Check 1: normally distributed variable

  • Create a histogram of crab size.
  • Facet your histogram so that you have 3 panels, with one panel for each site.
# base layer: ggplot
ggplot(data = pie_crab_clean, 
       aes(x = size)) + # x-axis
  # first layer: histogram
  geom_histogram(bins = 6, # bins from Rice Rule (approximate)
                 fill = "orange",
                 color = "black") + 
  # faceting by site_name: creating 3 different panels
  facet_wrap(~ site_name) +
  theme_minimal() +
  labs(x = "Crab carapace width (mm)",
       y = "Frequency")

  • Create a QQ plot of crab size.
  • Facet your QQ plot so that you have 3 panels, with one panel for each site.
# base layer: ggplot
ggplot(data = pie_crab_clean, 
       aes(sample = size)) + # y-axis
  # first layer: QQ plot reference line
  geom_qq_line() + 
  # second layer: QQ plot points (actual observations)
  geom_qq(color = "orange") + 
  # faceting by site_name
  facet_wrap(~ site_name, 
             scales = "free") + # let axes vary between panels
  theme_minimal() +
  labs(x = "Theoretical (standard normal) quantiles",
       y = "Sample quantiles")

What are the outcomes of your visual checks?

Not perfect (crab carapace width from VCR LTER seems not normally distributed) but with large sample sizes for each group, this might not matter too much.

e. Check 2: equal variances

Do a gut check: is the ratio of the largest variance to the smallest variance less than 3?

8.63/4.04
[1] 2.136139

What are the outcomes of your variance check?

By the rule of thumb (ratio of largest variance to smallest variance < 3), the variances are close enough; the largest variance is about 2.1 times larger than the smallest variance.

f. ANOVA

# creating an object called crab_anova
crab_anova <- aov(size ~ site_name, # formula
                  data = pie_crab_clean) # data

# test object
crab_anova
Call:
   aov(formula = size ~ site_name, data = pie_crab_clean)

Terms:
                site_name Residuals
Sum of Squares   442.2073  497.4372
Deg. of Freedom         2        89

Residual standard error: 2.364145
Estimated effects may be unbalanced
# gives more information
summary(crab_anova)
            Df Sum Sq Mean Sq F value  Pr(>F)    
site_name    2  442.2  221.10   39.56 5.1e-13 ***
Residuals   89  497.4    5.59                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summarize results: is there a difference in mean crab carapace width between the three sites?

There is a difference in mean crab carapace width between Cape Cod, Virginia Coastal Reserve LTER, and Zeke’s Island NERR (one-way ANOVA, F(2, 89) = 39.6, p < 0.001, \(\alpha\) = 0.05).

g. Post-hoc: Tukey HSD

TukeyHSD(crab_anova)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = size ~ site_name, data = pie_crab_clean)

$site_name
                                                      diff       lwr      upr
Virginia Coastal Reserve LTER-Zeke's Island NERR 4.2719524  2.869912 5.673993
Cape Cod-Zeke's Island NERR                      4.7514709  3.308099 6.194843
Cape Cod-Virginia Coastal Reserve LTER           0.4795185 -1.015317 1.974354
                                                     p adj
Virginia Coastal Reserve LTER-Zeke's Island NERR 0.0000000
Cape Cod-Zeke's Island NERR                      0.0000000
Cape Cod-Virginia Coastal Reserve LTER           0.7255994

Which pairwise comparisons are actually different? Which ones are not different?

Zeke’s Island NERR and Cape Cod crabs are different, and Zeke’s Island NERR and Virginia Coastal Reserve LTER crabs are different. Virginial Coastal Reserve LTER and Cape Cod crabs are not different.

h. effect size

Using eta_squared() from rstatix

eta_squared(crab_anova)
site_name 
0.4706113 

What is the magnitude of the differences between sites in crab size?

There is a large difference in crab size between sites.

i. Putting everything together

We found a large (\(\eta^2\) = 0.47) difference between sites in mean crab carapace width (one-way ANOVA, F(2, 89) = 39.6, p < 0.001, \(\alpha\) = 0.05). The smallest crabs were from Zeke’s Island NERR, which were on average 12.1 mm. Zeke’s Island crabs were 4.8 mm (95% CI: [3.3, 6.2] mm) smaller than crabs from Cape Cod (Tukey’s HSD: p < 0.001) and 4.3 mm (95% CI: [2.9, 5.7] mm) smaller than crabs from Virginia Coastal Reserve LTER (Tukey’s HSD: p < 0.001).

3. Captions

Figure 1.

Figure 1. Crab carapace width differs across sites. Small black points represent measurements of crab carapace widths at three sites: Zeke’s Island National Estuarine Research Reserve (NERR), Virginia Coastal Reserve Long-Term Ecological Research (LTER) site, and Cape Cod. Large red points represent mean crab carapace width. Data originally from: Johnson, D. 2019. Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/4c27d2e778d3325d3830a5142e3839bb. Accessed via lterdatasampler package.

Figure 2.

Figure 2. Histograms of crab carapace width. Columns represent counts of individual crabs. Distributions of crab carapace width are represented in three panels: Zeke’s Island NERR (left), Virginia Coastal Reserve LTER (center), and Cape Cod (right). Data originally from: Johnson, D. 2019. Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/4c27d2e778d3325d3830a5142e3839bb. Accessed via lterdatasampler package.

Alternative option if you have a figure made using the same dataset as a previous figure:

Figure 2. Histograms of crab carapace widths. Bars represent counts of crab sizes. Distributions are shown for three sites: Zeke’s Island NERR (left), Virginia Coastal Reserve (center), and Cape Cod (right). Data source and access as in Figure 1.

Figure 3.

Figure 3. Quantile-Quantile (QQ) plots of crab carapace width. Points represent individual observations of crab carapace width. Lines represent reference line relating theoretical quantiles (x-axis) and sample quantiles (y-axis). Panels represent different sites: Zeke’s Island NERR (left), Virginia Coastal Reserve LTER (center), and Cape Cod (right). Data originally from: Johnson, D. 2019. Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/4c27d2e778d3325d3830a5142e3839bb. Accessed via lterdatasampler package.

Alternative option if you have a figure made using the same dataset as a previous figure:

Figure 3. Quantile-quantile (QQ) plots of crab carapace widths. Points represent individual crab carapace width measurements. Lines represent reference quantiles of sample (y-axis) and standard normal theoretical (x-axis) distributions. Panels, data source, and access as in Figure 2.

END OF WORKSHOP 5

Extra stuff

More applications of stat_summary()

stat_summary() in the ggplot2 package is a useful function to do quick summary visualizations within a plot.

In class, we used stat_summary() to plot mean carapace widths at each site in red (see the “Exploring the data” section above).

You can also use stat_summary() to visualize means and bars (representing standard deviation, standard error, confidence intervals, etc.).

For example, here is a plot using stat_summary() to visualize means and standard errors:

# base layer: ggplot
ggplot(pie_crab_clean, # use the clean data set
       aes(x = site_name, # x-axis
           y = size)) + # y-axis
  # first layer: jitter (each point is an individual crab)
  geom_jitter(height = 0, # don't jitter points vertically
              width = 0.15) + # narrower jitter (easier to see)
  # second layer: point representing mean
  stat_summary(geom = "pointrange", # geometry being plotted
               fun.data = mean_sdl, # function to calculate mean and sd (note new argument name)
               color = "red", # color to make it easier to see
               size = 1) + # larger point size
  theme_minimal() + # cleaner plot theme
  labs(x = "Site name",
       y = "Carapace width (mm)",
       caption = "Mean \U00B1 standard deviation")

And standard error:

# base layer: ggplot
ggplot(pie_crab_clean, # use the clean data set
       aes(x = site_name, # x-axis
           y = size)) + # y-axis
  # first layer: jitter (each point is an individual crab)
  geom_jitter(height = 0, # don't jitter points vertically
              width = 0.15) + # narrower jitter (easier to see)
  # second layer: point representing mean
  stat_summary(geom = "pointrange", # geometry being plotted
               fun.data = mean_se, # function to calculate mean and SE (note new argument name)
               color = "red", # color to make it easier to see
               size = 1) + # larger point size
  theme_minimal() + # cleaner plot theme
  labs(x = "Site name",
       y = "Carapace width (mm)",
       caption = "Mean \U00B1 standard error")

And a 95% confidence interval:

# base layer: ggplot
ggplot(pie_crab_clean, # use the clean data set
       aes(x = site_name, # x-axis
           y = size)) + # y-axis
  # first layer: jitter (each point is an individual crab)
  geom_jitter(height = 0, # don't jitter points vertically
              width = 0.15) + # narrower jitter (easier to see)
  # second layer: point representing mean
  stat_summary(geom = "pointrange", # geometry being plotted
               fun.data = mean_cl_normal, # function to calculate range for confidence interval (note new argument name)
               color = "red", # color to make it easier to see
               size = 1) + # larger point size
  theme_minimal() + # cleaner plot theme
  labs(x = "Site name",
       y = "Carapace width (mm)",
       caption = "Means with 95% confidence interval")