10  Terrain ruggedness

Show the code
state_boundaries_sf <- us_states() |>
  # `dams` data uses `state_abb` rather than 'state_abbr`, which `dam` gets from R's built-in `state.abb` data frame.
  # next 3 lines are a workaround since "rename(state_abb = state_abbr)" doesn't work
  #   "Error in rename.sf(us_states(), state_abb = state_abbr) : internal error: can't find `agr` columns"
  mutate(state_abb = state_abbr) |>
  relocate(state_abb, .after = state_abbr) |>
  select(-state_abbr) |>
  filter(!state_abb %in% c("DC", "HI", "AK", "PR")) |>
  arrange(state_abb) |>
  st_transform("NAD83") # US Government standard CRS

nc_boundary <- state_boundaries_sf |>
  filter(state_abb == "NC")
Show the code
###### Does dam TRI data exist?

states_with_dams <- sort(unique(dams$state_abb))

for(i in seq_len(length(states_with_dams))) {
  
  state_dam_tri_summary_fname <- paste0(state_dam_tri_summary_pname, "/", states_with_dams[i], ".rds")
  
  if(!file.exists(state_dam_tri_summary_fname)) {
    
    errorCondition(
      glue("This chapter ('ruggedness.qmd') needs dam terrain ruggedness (TRI) data created in the 'Exploring ruggedness' chapter.", 
    "\nRerender this chapter after rendering that chapter (from a terminal command line: 'quarto render ruggedness-exploring.qmd')",
    "\nAfter doing that files will exist for each continental state similar to\n{state_dam_tri_summary_fname}"
      )
    )
    
    knitr::knit_exit()
    
  }
}

###### Does state sample TRI data exist?

states_with_points <- states_with_dams

for(i in seq_len(length(states_with_points))) {
  
  state_tri_circle_samples_fname <- paste0(state_tri_circle_samples_pname, "/", states_with_points[i], ".rds")
  
  if(!file.exists(state_tri_circle_samples_fname)) {
    
    errorCondition(
      glue("This chapter ('ruggedness.qmd') needs state sample terrain ruggedness (TRI) data created in the 'Exploring ruggedness' chapter.", 
    "\nRerender this chapter after rendering that chapter (from a terminal command line: 'quarto render ruggedness-exploring.qmd').",
    "\nAfter doing that files will exist for each continental state similar to\n{state_tri_circle_samples_fname}",
    "\nThen you can rerender 'ruggedness.qmd' from a terminal command line: 'quarto render' or 'quarto rend ruggedness.qmd')."
      )
    )
    
    knitr::knit_exit()
    
  }
}
Show the code
states_with_dams <- sort(unique(dams$state_abb))
n_states_with_dams <- length(states_with_dams)

states_and_regions <- tibble(
  state_region = state.region,
  state_abb = state.abb) |>
  filter(state_abb %in% states_with_dams) |>
  mutate(n_states = n(),
         region_label = glue("{state_region} (n={n_states})"),
         .by = state_region)

###### Get state samples TRI data

state_samples_tri_summary_df <- map(dir(state_tri_circle_samples_pname, full.names = TRUE), 
                                    read_rds) |>
  bind_rows() |>
  left_join(
    states_and_regions,
    by = "state_abb"
  ) |>
  rename(tri_n_sample = tri_n,
         tri_mean_sample = tri_mean,
         tri_median_sample = tri_median,
         tri_sd_sample = tri_sd)


###### Get dam TRI data

# state_dam_tri_summary_fname <- paste0(state_dam_tri_summary_pname, "/", states_with_dams[i], ".rds")

state_dam_tri_summary_df <- map(dir(state_dam_tri_summary_pname, full.names = TRUE), 
             read_rds) |>
  bind_rows() |>
  left_join(
    states_and_regions,
    by = "state_abb"
  ) |>
  rename(tri_n_dam = tri_n,
         tri_mean_dam = tri_mean,
         tri_median_dam = tri_median,
         tri_sd_dam = tri_sd)

dams_with_tri <- dams |>
  inner_join(state_dam_tri_summary_df |>
               select(-c(state_region:region_label)),
             by = join_by(nidId, state_abb)) |>
  filter(nidHeightId != "Undetermined")

n_dams_with_tri <- state_dam_tri_summary_df |>
  nrow()

min_tri_n_dam_state <- count(state_dam_tri_summary_df, state_abb, wt = tri_n_dam) |>
  filter(min(n) == n)

min_dams_n_state <- count(state_dam_tri_summary_df, state_abb) |>
  filter(min(n) == n)

max_dams_n_state <- count(state_dam_tri_summary_df, state_abb) |>
  filter(max(n) == n)


###### Get Blue Ridge Parkway multiline

fname <- here("data/processed/blue-ridge-parkway.rds")
if(!file.exists(fname)) {
  
  results <- st_bbox(nc_boundary) |>
    opq() %>%
    add_osm_feature(key = "name", 
                    value = "Blue Ridge Parkway") %>%
    osmdata_sf()
  
  parkway_lines <- results$osm_multilines |>
    st_transform("NAD83")
  
  parkway_lines_nc <- st_intersection(parkway_lines, nc_boundary)
  
  write_rds(parkway_lines_nc, fname)
  
} else {
  
  parkway_lines_nc <- read_rds(fname)
  
}

10.2 Methodology

10.2.1 Key metric

I calculate terrain ruggedness index (TRI) from digital elevation model (DEM) data I downloaded and saved in GeoTIFF files. The terra::terrain() function calculates TRI for me from the elevation data using eight-neighbor “classic” TRI per Wilson et al. (2007).2

From the terra::terrain() help:

“TRI (Terrain Ruggedness Index) is the mean of the absolute differences between the value of a cell and its 8 surrounding cells.”

Note that TRI values are sensitive to the resolution of the digital elevation model, since differences in elevation differ significantly over various distances.

10.2.2 Data sources

I get elevation data from elevatr::get_elev_raster(), which returns Shuttle Radar Topography Mission (SRTM)3 data via Open Topography. The SRTM’s Digital Elevation Model (DEM) is a Digital Surface Model (DSM) rather than a Digital Terrain Model (DTM). In other words, the radar reflections used to calculate elevations in some cases are from the tops of buildings and partway through the tree canopy; elevation values have not been adjusted to estimate ground level.4 For my purposes, and because I have so many points to work with, this noise is not an issue.

The SRTM digital elevation model (DEM) has a range over continental North America south of 60° N latitude. This range is fine, since the northernmost point in the continental USA is the Northwest Angle Inlet in Lake of the Woods, MN at 49.4°).5

I use zoom level 9 (elevatr::get_elev_raster(z = 9)), which provides data at the resolution of about 125 m per point in North Carolina and greater distances between points in lower latitudes. For more information see Section 11.1.4.1 Resolution of elevation data varies by latitude.

For state boundaries, I use USAboundaries::us_states()

For drawing the Blue Ridge Parkway in North Carolina, I use shapes from OpenStreetMap, which I acquire via the osmdata package.

10.2.3 Data prep

  1. Dam data: Using data the from the National Inventory of Dams, I prepare a data set with 86351 dams that include nidHeightId (and thus nidHeight) along with state abbreviation (state_abb) and \((longitude, latitude)\) location.

  2. State boundaries, etc.: I use USAboundaries::us_states() to get state boundaries and use osmdata::opq() to download OpenStreetMap shapes for the Blue Ridge Parkway in North Carolina.

  3. State-level elevation data: Using state boundaries, I downloaded DEM and saved them as GeoTIFF files for use in the following steps.

  4. Local terrain ruggedness around dams: I define a circle of radius r = 5 km around each dam location, get the elevations for every point in the circle, and calculate the TRI for the points. Then I create summary statistics for each dam’s circle (tri_mean_dam, tri_median_dam, tri_sd_dam, tri_n_dam points).

  5. Local terrain ruggedness around random points: The number of dams vary by state. So for each state, I randomly sample the same number of points as there are dams in that state, define circles of radius r = 5 km, get the elevations for every point in the circle, and calculate the TRI for the points. Then I create summary statistics for each sample’s circle (tri_mean_sample, tri_median_sample, tri_sd_sample, tri_n_sample points).

For more information about data preparation, questions I asked, and trade-offs I made along the way, see Chapter 11 Preparing terrain ruggedness.

10.2.4 Exploration

To explore \(H_1:\) in Section 10.3:

  • I plot the distribution of \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) by state_abb, state_region, and nidHeightId.

  • I sort the dam and state samples in order of \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) respectively, and I pair them in linear regressions with various models to check which models explain the most variance. I plot some of the model parameter (a.k.a. term) estimates.

To explore \(H_2:\)in Section 10.4:

  • I plot the distributions of \(surfaceArea\), \(nidStorage\), and \(\frac{surfaceArea}{nidStorage}\) aggregated by state_abb and nidHeigthId.

  • I sort the dam and state samples in order of \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) respectively, and I pair them in linear regressions predicting \(tri\_mean_{sample}\) using various models to check which models explain the most variance.

To explore \(H_3:\)in Section 10.5:

  • I plot the distributions of \(surfaceArea\), \(nidStorage\), and \(\frac{surfaceArea}{nidStorage}\) aggregated by state_abb and nidHeigthId.

  • I use the same data set as in \(H_2\) and run linear regressions predicting \(tri\_mean_{dam}\) using various models to check which models explain the most variance.


10.3 Exploring H1 : Local terrain ruggedness in the area around dams is greater than the ruggedness of areas around randomly sampled points in the contentinental US states

The area around a dam has greater terrain ruggedness than its paired state sample if

\[tri\_mean_{dam} - tri\_mean_{sample} > 0\]

When the differences in \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) are aggregated by state, region, or other property, their distributions and differences in distribution medians allow us to make some generalizations:

\[{median(tri\_mean_{dam} - tri\_mean_{sample})} > 0\]

for some aggregation.

10.3.1 First: dam and sample distributions

Of note re: Figure 10.1:

  1. There is a lot of variation in the distributions and median tri_means of \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) by state. And I didn’t expect West Virginia to have the highest median \(tri\_mean_{sample}\).

  2. While there are states that have high median \(tri\_mean_{dam}\) for the local areas around some dams, there is no concentration in the distributions at these higher values. Ruggedness varies a lot more than flatness.

  3. There is wide variation in number of points in each state, proportional to the number of dams in the state. DE has the fewest points: 471,661, which is more than enough for my purposes.

Show the code
dta_for_plot <- state_dam_tri_summary_df |>
  mutate(state_mean_dam = weighted.mean(tri_mean_dam, w = tri_n_dam),
         state_median_dam = median(tri_mean_dam),
         .by = c(state_abb, state_region, region_label)) |>
  mutate(state_abb = fct_reorder(state_abb, state_median_dam))

p1_dam_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(tri_mean_dam, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = state_dam_tri_summary_df |>
      mutate(
        state_mean_dam = weighted.mean(tri_mean_dam, w = tri_n_dam),
        state_median_dam = median(tri_mean_dam),
        .by = state_abb
      ) |>
      distinct(state_abb, state_median_dam),
    aes(state_median_dam, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.75, 0.2)) +
  coord_cartesian(xlim = c(0, 28)) +
  labs(
    subtitle = "A. Distribution of tri_mean<sub>dam</sub><br>",
    y = NULL
  )

p2 <- dta_for_plot |>
  ggplot() +
  geom_col(
    aes(tri_n_dam, state_abb, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  scale_x_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  labs(
    subtitle = "B. Number of TRI values<br>",
    y = NULL
  )

p1_dam_state + p2 +
  plot_annotation(
    title = "Distribution of *tri_mean<sub>dam</sub>* for local areas around dams",
    subtitle = "Red indicator is median tri_mean<sub>dam</sub><br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.1: Distributions of tri_meandam in areas around dams in each state


Of note re: Figure 10.2

  • The Northeast and West regions have the highest median \(tri\_mean_{dam}\) (panel A).

  • The median \(tri\_mean_{dam}\) is higher for every higher category of nidHeightId (panel B).

Show the code
dta_for_plot <- state_dam_tri_summary_df |>
  mutate(region_mean_dam = weighted.mean(tri_mean_dam, w = tri_n_dam),
         region_median_dam = median(tri_mean_dam),
         .by = c(region_label))

dta_for_plot_p1_region_medians <- dta_for_plot |>
  distinct(region_label, .keep_all = TRUE)

p1_dam_region <- dta_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_mean_dam, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot_p1_region_medians,
    aes(region_median_dam, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = dta_for_plot_p1_region_medians,
    aes(x= I(0.7), y = I(0.8), 
        label = glue("Median: {round(region_median_dam, digits = 1)}")
        )
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  facet_wrap(~region_label, ncol = 1) +
  coord_cartesian(xlim = c(0, 55)) +
  labs(
    subtitle = "A. Distribution of tri_mean<sub>dam</sub>\nby region<br>",
    y = NULL
  )

dta_for_plot_p2 <- dams_with_tri |>
  st_drop_geometry() |>
  filter(nidHeightId != "Undetermined") |>
  select(nidId, state_abb, nidHeight, nidHeightId, starts_with("tri")) |>
  mutate(nidHeightId_tri_mean_dam = weighted.mean(tri_mean_dam, w = tri_n_dam),
         nidHeightId_tri_median_dam = median(tri_mean_dam),
         .by = nidHeightId)

dta_for_plot_p2_nidHeightId_medians <- dta_for_plot_p2 |>
  distinct(nidHeightId, .keep_all = TRUE)

p2_dam_nidHeightId <- dta_for_plot_p2 |>
  ggplot() +
  geom_density(
    aes(tri_mean_dam),
    linewidth = 0.1,
    fill = "grey80",
    color = NA,
  ) +
  geom_text(
    data = dta_for_plot_p2_nidHeightId_medians,
    aes(nidHeightId_tri_median_dam, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = dta_for_plot_p2_nidHeightId_medians,
    aes(x= I(0.7), y = I(0.8), 
        label = glue("Median: {round(nidHeightId_tri_median_dam, digits = 1)}")
        )
  ) +
  facet_wrap(~nidHeightId, ncol = 1) +
  coord_cartesian(xlim = c(0, 55)) +
  labs(
    subtitle = "B. Distribution of tri_mean<sub>dam</sub>\nby nidHeightId<br>",
    y = NULL
  )

p1_dam_region + p2_dam_nidHeightId +
  plot_annotation(
    title = "Distribution of *tri_mean<sub>dam</sub>* for local areas around dams",
    subtitle = "Red indicator is median tri_mean<sub>dam</sub><br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.2: Distribution of tri_meandam of areas around dams by nidHeightId and region


Of note re: Figure 10.3:

  • States with large flat areas as well as mountains have a distribution with a mode in low TRI values and a long tail. See for example, CO, NC, OH, TX (panel A).
Show the code
dta_for_plot <- state_samples_tri_summary_df |>
  mutate(state_mean_sample = weighted.mean(tri_mean_sample, w = tri_n_sample),
         state_median_sample = median(tri_mean_sample),
         .by = state_abb) |>
  mutate(state_abb = fct_reorder(state_abb, state_median_sample))
  
dta_for_plot2 <- dta_for_plot |>
  mutate(region_mean_sample = weighted.mean(tri_mean_sample, w = tri_n_sample),
         region_median_sample = median(tri_mean_sample),
         .by = region_label) #|>
  # FYI: need to adjust factors to match regions in panel A if I use the fct_reorder() line below
  # mutate(region_label = fct_reorder(region_label, region_mean)) 

dta_for_plot2_region_medians <- dta_for_plot2 |>
  distinct(region_median_sample, .keep_all = TRUE)

p1_sample_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(tri_mean_sample, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = state_samples_tri_summary_df |>
      mutate(
        state_mean_sample = weighted.mean(tri_mean_sample, w = tri_n_sample),
        state_median_sample = median(tri_mean_sample),
        .by = state_abb
      ) |>
      distinct(state_abb, state_median_sample),
    aes(state_median_sample, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_continuous(breaks = 0:10 * 5) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.75, 0.15)) +
  coord_cartesian(xlim = c(0, 30)) +
  labs(
    subtitle = "A. Distribution of tri_mean<sub>sample</sub><br>",
    y = NULL
  )

p2 <- dta_for_plot |>
  ggplot() +
  geom_col(
    aes(tri_n_sample, state_abb, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  scale_x_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  labs(
    subtitle = "B. Number of TRI values<br>",
    y = NULL
  )

p1_sample_state + p2 +
  plot_annotation(
    title = "Distribution of *tri_mean<sub>sample</sub>* for local area<br>around sampled points in each state",
    subtitle = "Red indicator is median tri_mean<sub>sample</sub><br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.3: Distribution of tri_meansample for areas around randomly sampled points in each state


Of note re: fig-state-local-area-tri-summary-density-non-state:

  • Similar to Figure 10.2, there is more variation in \(tri\_mean_{sample}\) in the Northeast and West, which have higher median \(tri\_mean_{sample}\) than the North Central and South regions (panel A).

  • North Central looks by eye to be the region with a distribution of \(tri\_mean_{sample}\) closest to the overall distribution (panel B), however Northeast region has the closest median \(tri\_mean_{sample}\).

Show the code
dta_for_plot <- state_samples_tri_summary_df |>
  mutate(region_mean_sample = weighted.mean(tri_mean_sample, w = tri_n_sample),
         region_median_sample = median(tri_mean_sample),
         .by = c(region_label))

dta_for_plot_p1_region_medians <- dta_for_plot |>
  distinct(region_label, .keep_all = TRUE)

p1_sample_region <- dta_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_median_sample, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot_p1_region_medians,
    aes(region_mean_sample, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = dta_for_plot_p1_region_medians,
    aes(x= I(0.7), y = I(0.8), 
        label = glue("Median: {round(region_median_sample, digits = 1)}")
        )
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  facet_wrap(~region_label, ncol = 1) +
  coord_cartesian(xlim = c(0, 55)) +
  labs(
    subtitle = "A. Distribution of tri_mean<sub>sample</sub> by region<br>",
    y = NULL
  )

dta_for_plot_p1_overall_median <- dta_for_plot |>
  summarize(overall_tri_median_sample = mean(tri_mean_sample))

p2_sample_overall <- dta_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_mean_sample),
    linewidth = 0.1,
    fill = "grey80",
    color = NA,
    # alpha = 0.3
  ) +
  geom_text(
    data = dta_for_plot_p1_overall_median,
    aes(overall_tri_median_sample, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = dta_for_plot_p1_overall_median,
    aes(x= I(0.7), y = I(0.8), 
        label = glue("Median: {round(overall_tri_median_sample, digits = 1)}")
        )
  ) +
  # facet_wrap(~nidHeightId, ncol = 1) +
  coord_cartesian(xlim = c(0, 55)) +
  labs(
    subtitle = "B. Distribution of tri_mean<sub>sample</sub><br>",
    y = NULL
  )

p1_sample_region + p2_sample_overall +
  plot_annotation(
    title = "Distribution of *tri_mean<sub>sample</sub>* for local areas around random points",
    subtitle = "Red indicator is median tri_mean<sub>sample</sub><br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.4: Distribution of tri_meansample for areas around randomly sampled points in each state


10.3.2 Second: difference in tri_meandam - tri_meansample

Show the code
tri_dam_tmp <- dams_with_tri |>
      st_drop_geometry() |>
      select(nidId, state_abb, nidHeight, nidHeightId, nidStorage, surfaceArea, starts_with("tri_")) |>
  arrange(state_abb, tri_mean_dam)

n_tri_dam_tmp <- tri_dam_tmp |>
  count(state_abb, name = "n_dams")

tri_sample_tmp <- state_samples_tri_summary_df

check_n <- left_join(
  dams_with_tri |>
    st_drop_geometry() |>
    count(state_abb, name = "n_dam"),
  state_dam_tri_summary_df |>
    count(state_abb, name = "n_sample"),
  by = "state_abb"
) |>
  mutate(
    n_diff = n_sample - n_dam
  )

# Need n_diff to be zero for all states to pair tri_mean_dam and tri_mean_sample in dta_for_model

if(sum(check_n$n_diff) != 0) {
  # We have some work to do
  
  tri_sample_tmp2 <- tri_sample_tmp |>
    left_join(check_n,
              by = "state_abb")
  
  if(any(check_n$n_diff) < 0) {
    stop("Code block define-tri-diff-model-data: at least one state has fewer rows of sample points than dams. Fix this manually.")
  } 
  
  tri_sample_tmp3 <- tri_sample_tmp2 |>
    mutate(
      test_n_sample = n_sample - n_diff,
      test_tri_mean_sample = if_else(row_number() <= n_diff,
                              NA_real_,
                              tri_mean_sample),
      .by = state_abb
    ) |>
    filter(!is.na(test_tri_mean_sample)) |>
    mutate(#test_tri_n_sample = test_tri_mean_sample,
           n_diff2 = test_n_sample - n_dam
    ) |>
    select(-starts_with("test")) |>
    # TODO: solve the following kludge at source and remove this filter
    filter(!(state_abb == "TX" & row_number() <=3) &
             !(state_abb == "VT" & row_number() == 1) &
             !(state_abb == "WV" & row_number() == 1),
           .by = state_abb)
  
  n_tri_sample_tmp3 <- tri_sample_tmp3 |>
    count(state_abb, name = "tri_sample_n2")
  
  tri_check_tmp <- 
    inner_join(
      n_tri_sample_tmp3,
      n_tri_dam_tmp,
      by = join_by(state_abb)
    ) |>
    mutate(dif = tri_sample_n2 - n_dams) |>
    arrange(desc(dif))
  
  state_abb_levels <- tri_dam_tmp |>
    mutate(mean_tri_mean_dam = mean(tri_mean_dam),
           .by = state_abb
    ) |>
    arrange(mean_tri_mean_dam) |>
    distinct(state_abb, mean_tri_mean_dam) |>
    mutate(state_abb = as_factor(state_abb)) # keeps current order
  
} else {
  # no adjustments needed
  
  state_tri_tmp3 <- tri_dam_tmp #
  
}

dta_for_model <- 
  bind_cols(
    tri_sample_tmp3 |>
      group_by(state_abb) |>
      arrange(tri_mean_sample,
              .by_group = TRUE) |>
      ungroup(),
    tri_dam_tmp |>
      group_by(state_abb) |>
      arrange(tri_mean_dam,
              .by_group = TRUE) |>
      ungroup() |>
      select(-state_abb) # already in state_tri_tmp
  ) |>
  mutate(tri_diff_dam_sample = tri_mean_dam - tri_mean_sample,
         area_to_volume_dam = surfaceArea / nidStorage,
         state_abb = factor(state_abb, levels = state_abb_levels$state_abb)
         ) |>
  filter(nidId != "NM00039", # surfaceArea seems to be an error (or very major outlier)
         nidId != "MI00650") # Lake Superior adjustments as noted in an earlier chapter

n_dams_by_nidHeightId <- dta_for_model |>
  count(nidHeightId) |>
  mutate(
    supports_h1 = nidHeightId %in% c("Less than 25 feet", "25-50 feet")
  )

n_dams_support_h1 <- sum(n_dams_by_nidHeightId[n_dams_by_nidHeightId$supports_h1, ]$n)
pct_dams_support_h1 <- n_dams_support_h1 / sum(n_dams_by_nidHeightId$n)

states_support_h1 <- dta_for_model |>
  mutate(
    supports_h1 = tri_mean_dam > tri_mean_sample 
  ) |>
  reframe(
    n_dams_state = n(),
    n_dams_state_support_h1 = sum(supports_h1),
    .by = state_abb
  ) |>
  mutate(
    pct_dams_in_state_support_h1 = n_dams_state_support_h1 / n_dams_state
  )

n_states_support_h1 <- states_support_h1 |>
  filter(pct_dams_in_state_support_h1 > 0.5) |>
  nrow()

Of note re: Figure 10.5 and related data:

Aggregations in which \(median({tri\_mean_{dam} - tri\_mean_{sample}}) > 0\) have a higher terrain ruggedness than the state in general. The panels in Figure 10.5 reveal the following:

  • Panel A: there is a lot of variation in the distributions by state.

    • 30 of 48 states (62%) have positive \({median(tri\_mean_{dam} - tri\_mean_{sample})} > 0\), supporting \(H_1\).

    • in flatter states \(H_1\) holds true. In mountainous states it does not.

  • Panel B: the median for all dams and samples is slightly positive, supporting \(H_1\).

  • Panel C: North Central and South regions have a (mildly) positive median. These, the flattest regions, support \(H_1\).

  • Panel D:nidHeightId categories < 50 ft have positive means, which includes most dams

    • 79305 of 85854 (92%) support \(H_1\).
Show the code
data_for_plot <- dta_for_model |>
  mutate(state_abb = fct_reorder(state_abb, tri_diff_dam_sample))

p1_medians <- data_for_plot |>
  summarize(state_median = median(tri_diff_dam_sample),
         .by = state_abb)

p1_diff_dam_sample <- data_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(tri_diff_dam_sample, state_abb, fill = region_label),
    rel_min_height = 0.005,
    quantile_lines = TRUE,
    quantiles = 2,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_vline(
    xintercept = 0,
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5) +
  geom_text(
    data = p1_medians,
    aes(state_median, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.25, 0.85)) +
  coord_cartesian(xlim = c(-13, 6)) +
  labs(
    subtitle = "A. By state",
    y = NULL,
    fill = "Region (n=states)"
  )

p2_median <- data_for_plot |>
  summarize(state_median = median(tri_diff_dam_sample))

p2_diff_all <- data_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_diff_dam_sample),
    linewidth = 0.1,
    color = "grey80",
    fill = "grey80",
    # alpha = 0.5
  ) +
  geom_vline(
    xintercept = 0,
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5) +
  geom_text(
    data = p2_median,
    aes(state_median, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  annotate("text", x= I(0.3), y = I(0.8), label = glue("Median: {round(p2_median$state_median, digits = 2)}")) +
  labs(
    subtitle = "B. All in one"
  )

p3_medians <- data_for_plot |>
  summarize(median_tri_diff_dam_state = median(tri_diff_dam_sample),
            .by = region_label)

p3_diff_region <- data_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_diff_dam_sample, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_vline(
    xintercept = 0,
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5) +
  geom_text(
    data = p3_medians,
    aes(median_tri_diff_dam_state, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = p3_medians,
    aes(x= I(0.3), y = I(0.8), 
        label = glue("Median: {round(median_tri_diff_dam_state, digits = 1)}")
        )
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  facet_wrap(~region_label, ncol = 2) +
  labs(
    subtitle = "C. By region",
    y = NULL
  )

p4_medians <- data_for_plot |>
  summarize(median_tri_diff_dam_state = median(tri_diff_dam_sample),
            .by = nidHeightId)

p4_diff_nidHeightId <- data_for_plot |>
  ggplot() +
  geom_density(
    aes(tri_diff_dam_sample, fill = nidHeightId),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_vline(
    xintercept = 0,
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5) +
  geom_text(
    data = p4_medians,
    aes(median_tri_diff_dam_state, 0, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  geom_text(
    data = p4_medians,
    aes(x= I(0.3), y = I(0.8), 
        label = glue("Median: {round(median_tri_diff_dam_state, digits = 2)}")
        )
  ) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  facet_wrap(~nidHeightId, ncol = 2) +
  labs(
    subtitle = "D. By nidHeightId",
    y = NULL
  )

p1_diff_dam_sample + (p2_diff_all / p3_diff_region / p4_diff_nidHeightId) +
  plot_annotation(
    title = "Distributions of *tri_mean<sub>dam</sub> - tri_mean<sub>sample</sub>*<br>",
    subtitle = glue("TRI means calculated from 5 km circles around each of {n_dams} dams",
                    " and same number of random samples in each state.",
                    "\nRed indicators are median(tri_mean<sub>dam</sub> - tri_mean<sub>sample</sub>)<br>"),
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.5: Distribution of dam tri_meandam and tri_meanstate sample by state


Of note re: fig-three-tri-mean-distributions-by-date:

  • This plot of distributions is a recap of earlier plots. I place them next to each other for easy comparison.

  • Where the distribution of \(tri\_mean_{dam}\) (panel A) differs from \(tri\_mean_{sample}\) (panel B), it tells us the ruggedness around dams is not representative of the state. For example, Colorado doesn’t seem to have many dams in the flat, eastern part of the state–or the dams in the eastern part are in relatively rare hilly areas. I also note that more distributions in panel B seem to have long tails of high \(tri\_mean_{sample}\) values.

  • Panel C is ordered by the size of the difference \(tri\_mean_{dam} - tri\_mean_{sample}\) for each state (a different ordering than panel A or B). States with a positive difference support \(H_1\); those near zero or negative challenge it.

Show the code
(p1_dam_state + 
   labs(subtitle = "A. Distribution of tri_mean_dam")) + 
  (p1_sample_state + 
     labs(subtitle = "B. Distribution of tri_mean_sample")) + 
  (p1_diff_dam_sample + 
     labs(subtitle = "C. Distribution of tri_diff_dam_sample")) +
  plot_annotation(
    title = "tri_mean distributions",
    subtitle = glue("Red indicator is median tri_mean for each variable. Ordered by descending median.",
                    "\nNote that each subplot orders the states differently."),
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple()) &
  guides(fill = "none")
Figure 10.6: Distribution of TRI means in each state


Of note re: Figure 10.7:

  • Plots in panels A1 and B1 appear to show that \(tri\_mean_{sample}\) exceeds \(tri\_mean_{dam}\) in most cases, which would disprove \(H_1\).

  • However, panels A1 and B1 hide most of the data, which have low \(tri\_mean\) values. The log-log plots in panels A2 and B2 show that in most cases, \(tri\_mean_{dam}\) exceeds \(tri\_mean_{sample}\).

  • In other words, in flatter areas \(H_1\) holds true. In mountainous areas it does not.

Show the code
p1 <- dta_for_model |>
  filter(tri_mean_dam > 0,
         tri_mean_sample > 0) |>
  ggplot(aes(tri_mean_dam, tri_mean_sample, color = state_abb, group = state_abb)) +
  geom_line(
    linewidth = 0.25,
    alpha = 0.8
  ) +
  geom_abline(
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5
  ) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA),
                  ylim = c(0.01, NA)) +
  labs(
    subtitle = "A1. Line plot",
    x = "tri_mean_dam",
    y = "tri_mean_sample"
  )

p1_log_log <- dta_for_model |>
  filter(tri_mean_dam > 0,
         tri_mean_sample > 0) |>
  ggplot(aes(tri_mean_dam, tri_mean_sample, color = state_abb, group = state_abb)) +
  geom_line(
    linewidth = 0.25,
    alpha = 0.8
  ) +
  geom_abline(
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5
  ) +
  scale_x_log10() +
  scale_y_log10() +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA),
                  ylim = c(0.01, NA)) +
  labs(
    subtitle = "A2. Line plot (log10 scales)",
    x = "tri_mean_dam (log10 scale)",
    y = "tri_mean_sample (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(tri_mean_dam > 0,
         tri_mean_sample > 0) |>
  ggplot(aes(tri_mean_dam, tri_mean_sample, 
             color = state_abb, group = state_abb)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.2,
    alpha = 0.8,
    se = FALSE) +
  geom_abline(
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5
  ) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA),
                  ylim = c(0.01, NA)) +
  guides(color = "none") +
  labs(
    subtitle = "B1. Regression line plot (same data)",
    x = "tri_mean_dam",
    y = "tri_mean_sample"
  )

p2_log_log <- dta_for_model |>
  filter(tri_mean_dam > 0,
         tri_mean_sample > 0) |>
  ggplot(aes(tri_mean_dam, tri_mean_sample, 
             color = state_abb, group = state_abb)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.2,
    alpha = 0.8,
    se = FALSE) +
  geom_abline(
    lty = 2,
    linewidth = 0.25,
    alpha = 0.5
  ) +
  scale_x_log10() +
  scale_y_log10() +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA),
                  ylim = c(0.01, NA)) +
  guides(color = "none") +
  labs(
    subtitle = "B2. Regression line plot (log10 scales)",
    x = "tri_mean_dam (log10 scale)",
    y = "tri_mean_sample (log10 scale)"
  )

p1 + p1_log_log + p2 + p2_log_log +
  plot_annotation(
    title = "tri_mean<sub>sample</sub> by tri_mean<sub>dam</sub> for each state",
    subtitle = "In most states, tri_mean<sub>sample</sub> is larger than tri_mean<sub>dam</sub>, disproving H1<br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.7: Distribution of TRI means in each state


10.3.3 Regression models

I ordered \(tri\_mean_{dam}\) and \(tri\_mean_{sample}\) from smallest to largest in each state before combining them in dta_for_model. The data set includes 85,854 pairs in 48 states. DE has the least dams: 83. TX has the most dams: 7276.

Thus, when model mod.h1.1 \(tri\_mean_{dam} \sim \beta_1 tri\_mean_{sample} + \epsilon\) has \(\beta_1 > 1.0\) , it would that \(tri\_mean_{dam}\) in general is larger than \(tri\_mean_{sample}\) , which would consistent with the conclusions from Figure 10.5 in the prior section in supporting \(H_1\).

10.3.3.1 Model mod.h1.1

mod.h1.1 explains most of the variation: \(R^2 = 0.87\).

Show the code
mod.h1.1 <- dta_for_model |>
  lm(tri_mean_dam ~ tri_mean_sample,
     data = _)

summary(mod.h1.1)

Call:
lm(formula = tri_mean_dam ~ tri_mean_sample, data = dta_for_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.3141  -0.4044  -0.1503   0.4948  20.5014 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.2566466  0.0071688   175.3   <2e-16 ***
tri_mean_sample 0.6809898  0.0008868   768.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.641 on 85852 degrees of freedom
Multiple R-squared:  0.8729,    Adjusted R-squared:  0.8729 
F-statistic: 5.898e+05 on 1 and 85852 DF,  p-value: < 2.2e-16


\(\beta_1\) in \(\beta_1 tri\_mean_{dam}\) is 0.68, meaning \(tri\_mean_{dam}\) is on average 68% the size of \(tri\_mean_{sample}\), apparently disproving \(H_1\).

Note there is an implicit \(\left\{\beta_1,..\beta_n\right\}\) before each term \(\left\{term_1,..term_n\right\}\) in the regression formulas (and \(+ \epsilon\)) at the end. For the sake of brevity and readability, I omit them from the formulas in the remainder of this chapter.


10.3.3.2 Model mod.h1.2

mod.h1.2 explains more than a third of the variance using only state means: \(R^2 = 0.43\). This is a base model useful when considering the variance explained by other models.

\[mod.h1.2: tri\_mean_{dam} \sim 1 | state\]

Show the code
mod.h1.2 <- dta_for_model |>
  mutate(#tri_mean_sample_scaled = scale(tri_mean_sample),
         tri_mean_dam_scaled = scale(tri_mean_dam)
         ) |>
  lmer(tri_mean_dam_scaled ~ 1 | state_abb,
     data = _)

summary(mod.h1.2)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ 1 | state_abb
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam))

REML criterion at convergence: 207681.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5983 -0.4175 -0.1013  0.2144 14.4448 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.5590   0.7477  
 Residual              0.6552   0.8095  
Number of obs: 85854, groups:  state_abb, 48

Fixed effects:
            Estimate Std. Error t value
(Intercept)   0.2026     0.1080   1.876
Show the code
r_squared <- rsq(mod.h1.2)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 3)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 3)}", 
     " and random effects: {round(r_squared$random, digits = 3)}"
  )


For this model, which includes only intercepts for each state: r_squared: 0.395 consisting of fixed effects: -0.041 and random effects: 0.436.


10.3.3.3 Model mod.h1.3

mod.h1.3 explains more than a third of the variance using only state means: \(R^2 = 0.39\). This is a base model useful when considering the variance explained by other models.

\[mod.h1.3: tri\_mean_{sample} \sim 1 | state\]

Show the code
mod.h1.3 <- dta_for_model |>
  mutate(tri_mean_sample_scaled = scale(tri_mean_sample),
         # tri_mean_dam_scaled = scale(tri_mean_dam)
         ) |>
  lmer(tri_mean_sample_scaled ~ 1 | state_abb,
     data = _)

summary(mod.h1.3)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_sample_scaled ~ 1 | state_abb
   Data: 
mutate(dta_for_model, tri_mean_sample_scaled = scale(tri_mean_sample),      )

REML criterion at convergence: 208106

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4398 -0.3894 -0.1136  0.2037 10.8430 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.5491   0.7410  
 Residual              0.6585   0.8115  
Number of obs: 85854, groups:  state_abb, 48

Fixed effects:
            Estimate Std. Error t value
(Intercept)   0.1903     0.1070   1.778
Show the code
r_squared <- rsq(mod.h1.3)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 3)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 3)}", 
     " and random effects: {round(r_squared$random, digits = 3)}"
  )


For this model, which includes only intercepts for each state: r_squared: 0.392 consisting of fixed effects: -0.036 and random effects: 0.429.

Show the code
as.data.frame(ranef(mod.h1.3, condVar = FALSE)) |>
  rename(state_abb = grp) |>
  select(-grpvar) |>
  left_join(
    states_and_regions,
    by = join_by(state_abb)
  ) |>
  mutate(state_abb = fct_reorder(state_abb, condval)) |>
  ggplot(aes(condval, state_abb, fill = region_label)) +
  geom_col() +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.75, 0.2)) +
  labs(
    title = "mod.h1.3 random effect\nconditional means indicate\nrelative level of mean ruggedness",
    subtitle = "West Virginia has the highest tri_mean, and Delaware has the lowest.",
    x = "Conditional mean",
    y = NULL,
    caption = my_caption_nid_opentopography
  )
Figure 10.8: mod.h1.3 random effect condidtional means


10.3.3.4 Model mod.h1.4

Better is a linear mixed model adjusting for states. mod.h1.3 explains even more of the variation: \(R^2 = 0.95\).

\[mod.h1.4: tri\_mean_{dam} \sim tri\_mean_{sample} | state\]

Show the code
mod.h1.4 <- dta_for_model |> # dams_with_tri |>
  mutate(tri_mean_dam_scaled = scale(tri_mean_dam), 
         tri_mean_sample_scaled = scale(tri_mean_sample))|>
  lmer(tri_mean_dam_scaled ~ tri_mean_sample_scaled + state_abb + (1 | state_abb),
       data = _)

summary(mod.h1.4)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ tri_mean_sample_scaled + state_abb + (1 |  
    state_abb)
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam),  
    tri_mean_sample_scaled = scale(tri_mean_sample))

REML criterion at convergence: 13306.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-7.5624 -0.2065  0.0302  0.1899 19.8787 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.08446  0.2906  
 Residual              0.06809  0.2609  
Number of obs: 85854, groups:  state_abb, 48

Fixed effects:
                        Estimate Std. Error t value
(Intercept)            -0.130513   0.292026  -0.447
tri_mean_sample_scaled  0.944270   0.001098 860.165
state_abbFL             0.068811   0.412070   0.167
state_abbMI             0.013371   0.412073   0.032
state_abbLA             0.112335   0.412109   0.273
state_abbIL             0.069854   0.412043   0.170
state_abbMN             0.151334   0.412066   0.367
state_abbND             0.064370   0.412088   0.156
state_abbMS             0.076474   0.412005   0.186
state_abbTX             0.018921   0.412003   0.046
state_abbKS             0.154197   0.412029   0.374
state_abbWI             0.084184   0.412079   0.204
state_abbOK             0.056549   0.412009   0.137
state_abbNE             0.110183   0.412020   0.267
state_abbRI             0.154448   0.412364   0.375
state_abbIN             0.267030   0.412068   0.648
state_abbSD             0.163901   0.412025   0.398
state_abbSC             0.297093   0.412026   0.721
state_abbAL             0.047262   0.412030   0.115
state_abbIA             0.301910   0.412012   0.733
state_abbMO             0.094925   0.412008   0.230
state_abbGA             0.237632   0.412007   0.577
state_abbAR             0.047341   0.412058   0.115
state_abbNJ             0.378236   0.412097   0.918
state_abbOH             0.219358   0.412051   0.532
state_abbME             0.077980   0.412143   0.189
state_abbMD             0.309765   0.412198   0.751
state_abbMA             0.142192   0.412061   0.345
state_abbMT            -0.611956   0.412023  -1.485
state_abbTN            -0.005935   0.412062  -0.014
state_abbVA            -0.175889   0.412025  -0.427
state_abbCT             0.251949   0.412061   0.611
state_abbNC             0.443372   0.412017   1.076
state_abbWY            -0.065306   0.412050  -0.158
state_abbNM             0.407111   0.412202   0.988
state_abbNH            -0.089691   0.412125  -0.218
state_abbNY             0.401771   0.412038   0.975
state_abbNV             0.101763   0.412160   0.247
state_abbAZ             0.252024   0.412222   0.611
state_abbKY             0.109801   0.412073   0.266
state_abbPA             0.260837   0.412052   0.633
state_abbWA            -0.692934   0.412107  -1.681
state_abbID            -0.180944   0.412219  -0.439
state_abbVT             0.502196   0.412233   1.218
state_abbOR             0.047552   0.412095   0.115
state_abbCO             0.537532   0.412038   1.305
state_abbUT             0.985488   0.412096   2.391
state_abbCA             0.640588   0.412056   1.555
state_abbWV             0.522318   0.412178   1.267
Show the code
r_squared <- rsq(mod.h1.4)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


For this model r_squared: 0.95 consisting of fixed effects: 0.93 and random effects: 0.02.


While residuals are mostly small, a plot of residuals by predicted (Figure 10.9) is not what we’d hope for:

Show the code
# following pattern at 
# https://mspeekenbrink.github.io/sdam-r-companion/linear-mixed-effects-models.html#visually-assessing-model-assumptions

tdat <- tibble(predicted = predict(mod.h1.4), 
               residual = residuals(mod.h1.4), 
               state_abb = dta_for_model$state_abb)

p0 <- tdat |>
  ggplot(aes(abs(residual))) + 
  geom_histogram(binwidth = 0.1, 
                 alpha = 0.5) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA)) +
  labs(
    subtitle = "A. Histogram (residuals)",
    x = "abs(residual)",
  )

p1 <- tdat |>
  ggplot(aes(x = predicted, y = residual,
             color = state_abb)) + 
  geom_point(size = 0.25,
             alpha = 0.25) +
  geom_hline(yintercept = 0, 
             lty = 2,
             linewidth = 0.5,
             alpha = 0.5) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA)) +
  labs(
    subtitle = "A. Scatter plot"
  )

p1b <- tdat |>
  ggplot(aes(x = predicted, y = residual,
             color = state_abb)) + 
  geom_line(linewidth = 0.2,
             alpha = 0.8) + 
  geom_hline(yintercept = 0, 
             lty = 2,
             linewidth = 0.5,
             alpha = 0.5) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA)) +
  labs(
    subtitle = "A2. Line plot (same data)",
  )

p2 <- tdat |>
  ggplot(aes(abs(residual),
             color = state_abb)) + 
  stat_ecdf(linewidth = 0.2,
             alpha = 0.8,
            pad = FALSE) +
  scale_y_continuous(labels = label_percent()) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA)) +
  labs(
    subtitle = "B1. ECDF plot (residuals)",
    x = "abs(residual)",
    y = "Percent of residuals"
  )

p2b <- tdat |>
  filter(residual > 0) |> # because scale_x_log10()
  ggplot(aes(abs(residual),
             color = state_abb)) + 
  stat_ecdf(linewidth = 0.2,
             alpha = 0.8,
            pad = FALSE) +
  stat_ecdf(data = tdat |>
              filter(residual > 0),
            aes(abs(residual), color = state_abb),
            linewidth = 1,
            color = "blue",
             alpha = 0.8,
            pad = FALSE) +
  scale_x_log10() +
  scale_y_continuous(labels = label_percent()) +
  guides(color = "none") +
  coord_cartesian(xlim = c(0.01, NA)) +
  labs(
    subtitle = "B2. ECDF plot (residuals on log10 scale)",
    x = "abs(residual) (log10 scale)",
    y = "Percent of residuals"
  )

my_layout <- 
c("
AA
BC
DE")

p0 + p1 + p1b + p2 + p2b +
  plot_annotation(
    title = "mod.h1.4 residuals are small but far from the random distribution\none looks for in a well-described model",
    subtitle = 'Each "thread" is the residuals for a state',
    caption = my_caption_nid_opentopography
  ) +
  plot_layout(design = my_layout)
Figure 10.9: Residuals of model mod.h1.4 are mostly small but not randomly distributed


Show the code
ggplot(tdat, aes(sample = residual)) +
  stat_qq() + 
  stat_qq_line(
    lty = 2,
    linewidth = 0.5,
    alpha = 0.5
  ) +
  labs(
    title = "mod.h1.4 QQ Plot residuals are not normal",
    subtitle = "Tails are heavier than a normal distribution would produce",
    caption = my_caption_nid_opentopography
  )
Figure 10.10: mod.h1.4 QQ Plot shows the model produces heavy non-normal tails


10.3.4 Conclusion: H1

From the plots:

  • Figure 10.5 is quite helpful. All four panels support \(H_1\).
  • Figure 10.7 is also helpful. The log-log plots in panels A2 and B2 show that in most cases, \(tri\_mean_{dam}\) exceeds \(tri\_mean_{sample}\).

In flatter areas \(H_1\) is true. In mountainous areas it is not.

From the regressions:

  • Section 10.3.3.4 @mod.h1.4 best describes the variation in the data. Conditional means of the state effects are also aligned with the states’ \(median(tri\_mean_{sample})\).

  • The regressions don’t make a strong case for \(H_1\).


10.4 Exploring H2 : In states with higher terrain ruggedness, dams’ \(\frac{surfaceArea}{nidStorage}\) will be lower

As a reminder:

Where dams are in more rugged areas, it seems likely dams’ lakes will be deeper and thus have smaller ratios \(\frac{surfaceArea}{nidStorage}\). In other words, considering the states in the continental US, there is an inverse proportional relationship:

\[tri\_mean_{sample} \propto \frac{1}{\left(\frac{surfaceArea}{nidStorage}\right)}\]

I look at two regressions:

  1. lm(tri_mean_sample ~ ratio)
  • How much does this simple mode capture the variance in the data? Does it perform better than the base model looking only at state effects Section 10.3.3.3?
  1. lmer(tri_mean_sample ~ ratio + (1 | state_abb)
  • I assume this is the model likely to explain more variance than the other models.

where \(ratio = \frac{surfaceArea}{nidStorage}\).

10.4.1 First: surfaceArea and nidStorage distributions

10.4.1.1 Data availability

The NID records are mostly complete with nidHeight. In comparison, there are more gaps for nidStorage (water volume) and water surfaceArea. For most states, the nidStorage missing rate is less than 4%. The exception is Connecticut, where the missing rate is 35%

Show the code
missing_nidStorage_by_state <- dta_for_model |>
  select(state_abb, nidStorage) |>
  mutate(no_nidStorage = is.na(nidStorage)) |>
  reframe(n_dams = n(),
          n_missing_nidStorage = sum(no_nidStorage),
          pct_missing_nidStorage = n_missing_nidStorage / n_dams,
          .by = state_abb) |>
  arrange(desc(pct_missing_nidStorage)) |>
  filter(pct_missing_nidStorage > 0) |>
  adorn_totals(where = "row") |>
  mutate(rowid = row_number(),
         pct_missing_nidStorage = n_missing_nidStorage / n_dams) # Do this again to pick up total row

missing_nidStorage_by_state |>
  gt() |>
  tab_header(md("**Percent of dams missing nidStorage values**")) |>
  fmt_percent(columns = starts_with("pct_"),
              decimals = 2)
Percent of dams missing nidStorage values
state_abb n_dams n_missing_nidStorage pct_missing_nidStorage rowid
CT 1211 425 35.09% 1
MD 402 15 3.73% 2
RI 222 8 3.60% 3
NJ 787 23 2.92% 4
IN 1081 25 2.31% 5
NC 3346 72 2.15% 6
IL 1606 28 1.74% 7
NM 395 5 1.27% 8
UT 825 9 1.09% 9
WV 473 5 1.06% 10
CA 1430 10 0.70% 11
NY 1850 11 0.59% 12
ME 547 3 0.55% 13
MO 5375 23 0.43% 14
OR 854 3 0.35% 15
KY 1063 3 0.28% 16
AZ 362 1 0.28% 17
PA 1462 4 0.27% 18
WA 789 2 0.25% 19
NV 499 1 0.20% 20
MI 1014 2 0.20% 21
MN 1113 2 0.18% 22
TN 1199 2 0.17% 23
NH 634 1 0.16% 24
LA 704 1 0.14% 25
FL 1059 1 0.09% 26
AL 2175 2 0.09% 27
MA 1213 1 0.08% 28
AR 1253 1 0.08% 29
SD 2530 2 0.08% 30
MT 2958 2 0.07% 31
CO 1948 1 0.05% 32
KS 2226 1 0.04% 33
GA 5356 2 0.04% 34
VA 2695 1 0.04% 35
Total 52656 698 1.33% 36
Table 10.1: Missing nidStorage by state

Only the smallest dams have a missing 1 rate greater than 1%:

Show the code
missing_nidStorage_by_nidHeightId <- dta_for_model |>
  select(nidHeightId, nidStorage) |>
  mutate(no_nidStorage = is.na(nidStorage)) |>
  reframe(n_dams = n(),
          n_missing_nidStorage = sum(no_nidStorage),
          pct_missing_nidStorage = n_missing_nidStorage / n_dams,
          .by = nidHeightId) |>
  # arrange(desc(pct_missing_nidStorage)) |>
  filter(pct_missing_nidStorage > 0) |>
  adorn_totals(where = "row") |>
  mutate(rowid = row_number(),
         pct_missing_nidStorage = n_missing_nidStorage / n_dams) # Do this again to pick up total row

missing_nidStorage_by_nidHeightId |>
  gt() |>
  tab_header(md("**Percent of dams missing nidStorage values**")) |>
  fmt_percent(columns = starts_with("pct_"),
              decimals = 2) |>
  cols_align(columns = nidHeightId,
             align = "left")
Percent of dams missing nidStorage values
nidHeightId n_dams n_missing_nidStorage pct_missing_nidStorage rowid
Less than 25 feet 43011 598 1.39% 1
51-100 feet 5023 15 0.30% 2
25-50 feet 36294 80 0.22% 3
Greater than 100 feet 1526 5 0.33% 4
Total 85854 698 0.81% 5
Table 10.2: Missing nidStorage by nidHeightId


For surfaceArea the gaps are worse. For example, Utah and Montana, South Dakota, and Alabama are missing surfaceArea for more than 75% of their dams. Among all dams, 20% are missing this data.

Show the code
missing_surfaceArea_by_state <- dta_for_model |>
  select(state_abb, surfaceArea) |>
  mutate(no_surfaceArea = is.na(surfaceArea)) |>
  reframe(n_dams = n(),
          n_missing_surfaceArea = sum(no_surfaceArea),
          pct_missing_surfaceArea = n_missing_surfaceArea / n_dams,
          .by = state_abb) |>
  arrange(desc(pct_missing_surfaceArea)) |>
  filter(pct_missing_surfaceArea > 0) |>
  adorn_totals(where = "row") |>
  mutate(rowid = row_number(),
         pct_missing_surfaceArea = n_missing_surfaceArea / n_dams) # Do this again to pick up total row

missing_surfaceArea_by_state |>
  gt() |>
  tab_header(md("**Percent of dams missing surfaceArea values**")) |>
  fmt_percent(columns = starts_with("pct_"),
              decimals = 2)
Percent of dams missing surfaceArea values
state_abb n_dams n_missing_surfaceArea pct_missing_surfaceArea rowid
UT 825 790 95.76% 1
MT 2958 2445 82.66% 2
SD 2530 2007 79.33% 3
AL 2175 1642 75.49% 4
IL 1606 938 58.41% 5
MS 6088 2344 38.50% 6
AR 1253 471 37.59% 7
MN 1113 380 34.14% 8
VA 2695 918 34.06% 9
TX 7270 1787 24.58% 10
NM 395 90 22.78% 11
AZ 362 68 18.78% 12
WV 473 75 15.86% 13
ND 860 127 14.77% 14
OK 4968 708 14.25% 15
KY 1063 141 13.26% 16
WA 789 104 13.18% 17
OR 854 112 13.11% 18
PA 1462 186 12.72% 19
KS 2226 278 12.49% 20
MD 402 49 12.19% 21
NC 3346 370 11.06% 22
MA 1213 105 8.66% 23
WI 953 77 8.08% 24
CA 1430 94 6.57% 25
GA 5356 309 5.77% 26
SC 2399 133 5.54% 27
OH 1393 64 4.59% 28
RI 222 10 4.50% 29
TN 1199 49 4.09% 30
CT 1211 49 4.05% 31
IN 1081 43 3.98% 32
DE 83 3 3.61% 33
ME 547 18 3.29% 34
MI 1014 27 2.66% 35
CO 1948 47 2.41% 36
NV 499 12 2.40% 37
WY 1471 31 2.11% 38
LA 704 14 1.99% 39
NY 1850 36 1.95% 40
NJ 787 15 1.91% 41
NE 2948 53 1.80% 42
IA 4042 70 1.73% 43
NH 634 6 0.95% 44
ID 375 2 0.53% 45
FL 1059 4 0.38% 46
MO 5375 9 0.17% 47
Total 85506 17310 20.24% 48
Table 10.3: Missing surfaceArea by state


The biggest gaps in surfaceArea data are for the smallest dams, where a quarter are missing.

Show the code
missing_surfaceArea_by_nidHeightId <- dta_for_model |>
  select(nidHeightId, surfaceArea) |>
  mutate(no_surfaceArea = is.na(surfaceArea)) |>
  reframe(n_dams = n(),
          n_missing_surfaceArea = sum(no_surfaceArea),
          pct_missing_surfaceArea = n_missing_surfaceArea / n_dams,
          .by = nidHeightId) |>
  # arrange(desc(pct_missing_surfaceArea)) |>
  filter(pct_missing_surfaceArea > 0) |>
  adorn_totals(where = "row") |>
  mutate(rowid = row_number(),
         pct_missing_surfaceArea = n_missing_surfaceArea / n_dams) # Do this again to pick up total row

missing_surfaceArea_by_nidHeightId |>
  gt() |>
  tab_header(md("**Percent of dams missing surfaceArea values**")) |>
  fmt_percent(columns = starts_with("pct_"),
              decimals = 2) |>
  cols_align(columns = nidHeightId,
             align = "left")
Percent of dams missing surfaceArea values
nidHeightId n_dams n_missing_surfaceArea pct_missing_surfaceArea rowid
Less than 25 feet 43011 10887 25.31% 1
51-100 feet 5023 514 10.23% 2
25-50 feet 36294 5719 15.76% 3
Greater than 100 feet 1526 190 12.45% 4
Total 85854 17310 20.16% 5
Table 10.4: Missing surfaceArea by nidHeightId


Let’s proceed with exploring \(H_2\), recognizing that for some states the results may not be useful.

10.4.1.2 Plots

In this section I look at the distributions and relationship between surfaceArea and nidStorage before exploring their associations with \(tri\_mean_{sample}\) in the next section.

Of note in Figure 10.11:

  • The distributions of surfaceArea, nidStorage, and \(\frac{surfaceArea}{nidStorage}\) differ in many states. I’m not sure what to make of this fact.
Show the code
dta_for_plot <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  mutate(state_median_surfaceArea = median(surfaceArea),
         state_median_nidStorage = median(nidStorage),
         state_median_area_to_vol = median(area_to_volume_dam),
         .by = state_abb) |>
  mutate(state_abb = fct_reorder(state_abb, state_median_surfaceArea))

p1_surfaceArea_by_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(surfaceArea, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot |>
      distinct(state_abb, state_median_surfaceArea),
    aes(state_median_surfaceArea, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  # guides(fill = guide_legend(position = "inside")) +
  guides(fill = "none") + #guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.75, 0.15)) +
  labs(
    subtitle = "A. surfaceArea<br>",
    y = NULL
  )

p2_nidStorage_by_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(nidStorage, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot |>
      distinct(state_abb, state_median_nidStorage),
    aes(state_median_nidStorage, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  # scale_x_continuous(label = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") + #guide_legend(position = "inside")) +
  # theme(legend.position.inside = c(0.75, 0.15)) +
  coord_cartesian(xlim = c(1, 1e6)) +
  labs(
    subtitle = "B. nidStorage<br>",
    y = NULL
  )

p3_surface_area_nidStorage_by_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(area_to_volume_dam, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot |>
      distinct(state_abb, state_median_area_to_vol),
    aes(state_median_area_to_vol, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  # scale_x_continuous(label = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") + #guide_legend(position = "inside")) +
  # theme(legend.position.inside = c(0.75, 0.15)) +
  coord_cartesian(xlim = c(0.001, 1)) +
  labs(
    subtitle = "C. surfaceArea / nidStorage<br>",
    y = NULL
  )

p4_n_dams_by_state <- dta_for_plot |>
  reframe(n = n(),
          .by = c(state_abb, region_label)) |>
  ggplot() +
  geom_col(
    aes(n, state_abb, fill = region_label),
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  scale_x_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") +
  labs(
    subtitle = "D. Number of dams",
    x = "n dams with surfaceArea\nand nidStorage values",
    y = NULL
  )

my_layout <-
c("
ABCD")

p1_surfaceArea_by_state + p2_nidStorage_by_state + 
  p3_surface_area_nidStorage_by_state + p4_n_dams_by_state +
  plot_annotation(
    title = "Distributions by state: surfaceArea, nidStorage, and surfaceArea / nidStorage",
    subtitle = glue("Red indicator is median for each subplot. All rows sorted by median surfaceArea.",
                    "\nX axis in panels A-C are on log10 scale.<br>"),
    caption = my_caption_nid_opentopography
  ) +
  plot_layout(design = my_layout) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.11: Distributions by state: surfaceArea, nidStorage, and surfaceArea / nidStorage


Of note re: Figure 10.12:

Faceting by nidHeightId shows the distributions of the largest dams “Greater than 100 ft” have the most variation in the distributions. This is due in part to the much larger nidHeight range in this category (~101 to ~800 ft).

Show the code
dta_for_plot <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  mutate(state_median_surfaceArea = median(surfaceArea),
         state_median_nidStorage = median(nidStorage),
         state_median_area_to_vol = median(area_to_volume_dam),
         .by = state_abb) |>
  mutate(state_abb = fct_reorder(state_abb, state_median_surfaceArea))

p1_surfaceArea_by_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(surfaceArea, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot |>
      distinct(state_abb, state_median_surfaceArea),
    aes(state_median_surfaceArea, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  # guides(fill = guide_legend(position = "inside")) +
  guides(fill = "none") + #guide_legend(position = "inside")) +
  theme(legend.position.inside = c(0.75, 0.15),
        strip.text = element_text(size = rel(0.75))) +
  facet_wrap(~ nidHeightId, nrow = 1) +
  labs(
    subtitle = "A. surfaceArea<br>",
    y = NULL
  )

p2_nidStorage_by_state <- dta_for_plot |>
  ggplot() +
  geom_density_ridges(
    aes(nidStorage, state_abb, fill = region_label),
    rel_min_height = 0.005,
    linewidth = 0.1,
    color = NA,
    alpha = 0.5
  ) +
  geom_text(
    data = dta_for_plot |>
      distinct(state_abb, state_median_nidStorage),
    aes(state_median_nidStorage, state_abb, label = "^"),
    size = 4,
    alpha = 0.75,
    color = "firebrick"
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  # scale_x_continuous(label = label_number(scale_cut = cut_short_scale())) +
  scale_fill_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(fill = "none") + #guide_legend(position = "inside")) +
  # theme(legend.position.inside = c(0.75, 0.15)) +
  theme(strip.text = element_text(size = rel(0.75))) +
  facet_wrap(~ nidHeightId, nrow = 1) +
  coord_cartesian(xlim = c(1, 1e6)) +
  labs(
    subtitle = "B. nidStorage<br>",
    y = NULL
  )

my_layout <-
c("
AB")

p1_surfaceArea_by_state + p2_nidStorage_by_state + 
  plot_annotation(
    title = "Distribution of surfaceArea and nidStorage by nidHeightId - continental states",
    subtitle = glue("Red indicator is median for all nidHeightId. All rows sorted by median surfaceArea.",
                    " X axes are on log10 scale.<br>"),
    caption = my_caption_nid_opentopography
  ) +
  plot_layout(design = my_layout) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.12: Distribution of surfaceArea and nidStorage in each state


Of note re: Figure 10.13:

  • This plot is helpful in exploring the relationship between surfaceArea and nidStorage.

  • There appear to be data errors, data entry errors, or estimation heuristics at play at small surfaceArea values around 1 acre (see horizontal lines in panel A).

  • Otherwise, there appears to be near-linear relationships between surfaceArea and nidStorage, implying that dams generally have a similar “shape”: they grow in surfaceArea as nidStorage grows.

Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(nidStorage, surfaceArea, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  labs(
    subtitle = "A. storage_area / nidStorage<br>",
    x = "nidStorage (log10 scale)",
    y = "surfaceArea (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(nidStorage, surfaceArea, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    # size = 0.25,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.35, 0.85)) +
  labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "nidStorage (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "surfaceArea by nidStorage - all dams",
    # subtitle = "REPLACE ME<br>",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.13: surfaceArea by nidStorage - all dams


10.4.1.3 Model mod.h2.1

The model mod.h2.1 has some explanatory power. \(R^2\) = 0.38.

\[mod.h2.1: surfaceArea \sim nidStorage\]

Show the code
mod.h2.1 <- dta_for_model |>
  lm(surfaceArea ~ nidStorage,
       data = _)

summary(mod.h2.1)

Call:
lm(formula = surfaceArea ~ nidStorage, data = dta_for_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-145078    -151    -146    -132  399225 

Coefficients:
                Estimate   Std. Error t value Pr(>|t|)    
(Intercept) 155.48337258  12.43326056   12.51   <2e-16 ***
nidStorage    0.01017373   0.00004963  205.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3240 on 68054 degrees of freedom
  (17798 observations deleted due to missingness)
Multiple R-squared:  0.3818,    Adjusted R-squared:  0.3818 
F-statistic: 4.203e+04 on 1 and 68054 DF,  p-value: < 2.2e-16


10.4.1.4 Model mod.h2.2

The model mod.h2.2 has some explanatory power. \(R^2\) = 0.39 (which is less improvement over mod.h2.1 than I expected). So there isn’t really a state-level effect on the ratio \(surfaceArea::nidStorage\).

\[mod.h2.2: surfaceArea \sim nidStorage | state\]

Show the code
mod.h2.2 <- dta_for_model |>
  mutate(surfaceArea_scaled = scale(surfaceArea),
         nidStorage_scaled = scale(nidStorage)) |>
  lmer(surfaceArea ~ nidStorage_scaled + state_abb + (1 | state_abb),
       data = _)

summary(mod.h2.2)
Linear mixed model fit by REML ['lmerMod']
Formula: surfaceArea ~ nidStorage_scaled + state_abb + (1 | state_abb)
   Data: mutate(dta_for_model, surfaceArea_scaled = scale(surfaceArea),  
    nidStorage_scaled = scale(nidStorage))

REML criterion at convergence: 1292501

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-44.635  -0.038  -0.016  -0.008 123.305 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept)     3713   60.93 
 Residual              10454667 3233.37 
Number of obs: 68056, groups:  state_abb, 48

Fixed effects:
                  Estimate Std. Error t value
(Intercept)        141.797    366.601   0.387
nidStorage_scaled 2296.989     11.217 204.777
state_abbFL        314.508    384.745   0.817
state_abbMI        211.262    385.632   0.548
state_abbLA        504.886    391.486   1.290
state_abbIL        249.938    392.183   0.637
state_abbMN       1410.151    390.375   3.612
state_abbND        -78.014    390.353  -0.200
state_abbMS        -29.856    375.369  -0.080
state_abbTX         76.693    374.188   0.205
state_abbKS         18.274    378.782   0.048
state_abbWI        764.930    387.355   1.975
state_abbOK         43.983    374.918   0.117
state_abbNE        -20.414    376.458  -0.054
state_abbRI         19.966    433.194   0.046
state_abbIN          6.842    385.174   0.018
state_abbSD        530.707    397.678   1.335
state_abbSC         82.017    377.787   0.217
state_abbAL        496.076    397.146   1.249
state_abbIA         -4.798    375.155  -0.013
state_abbMO        -14.619    374.252  -0.039
state_abbGA          4.605    374.407   0.012
state_abbAR        717.161    389.205   1.843
state_abbNJ          8.135    389.746   0.021
state_abbOH         21.453    382.068   0.056
state_abbME       1045.453    397.426   2.631
state_abbMD          3.124    410.381   0.008
state_abbMA         22.566    384.116   0.059
state_abbMT        157.714    398.124   0.396
state_abbTN        152.525    383.668   0.398
state_abbVA         -3.916    379.463  -0.010
state_abbCT          6.443    389.202   0.017
state_abbNC          3.552    376.362   0.009
state_abbWY         55.390    381.273   0.145
state_abbNM        258.783    415.748   0.622
state_abbNH        175.393    393.391   0.446
state_abbNY        169.048    379.331   0.446
state_abbNV       -229.166    399.539  -0.574
state_abbAZ       -234.985    416.789  -0.564
state_abbKY        201.683    386.604   0.522
state_abbPA         17.651    382.504   0.046
state_abbWA        259.276    391.631   0.662
state_abbID        821.693    407.604   2.016
state_abbVT         30.464    410.063   0.074
state_abbOR        304.713    390.128   0.781
state_abbCO        234.956    378.961   0.620
state_abbUT        802.745    660.980   1.214
state_abbCA        147.673    382.018   0.387
state_abbWV         51.198    405.764   0.126
Show the code
r_squared <- rsq(mod.h2.2)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.4.2 Second: exploring the relationship among tri_meansample, sufaceArea, and nidStorage

10.4.2.1 Plots

Of note re: Figure 10.14:

  • The near-vertical nature of the categories in panel A indicate that there is no strong relationship between \(tri\_mean_{sample}\) and surfaceArea.

  • One could tell a story from panel B if the model explained a lot of the variability (which is not likely given panel A). Regression sec-h3-mod1 Model 1 confirms this. Only when we adjust for state does the model capture some of the variability: sec-h3-mod4 Model 4.

    • The more surfaceArea, the lower the \(tri\_mean_{sample}\) (panel B). This makes some sense, since lakes behind dams are flat, resulting in lower \(tri\_mean_{sample}\).

    • The regression lines are somewhat parallel: in general, for any surfaceArea value, the higher the nidHeightId, the higher the \(tri\_mean_{sample}\).

Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(surfaceArea, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  # scale_y_continuous(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "A. Scatter plot<br>",
    x = "surface_area (log10 scale)",
    y = "tri_mean_sample (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(surfaceArea, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    # size = 0.25,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2),
        legend.background = element_rect(fill = "transparent")) +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "surface_area (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>sample</sub> by surfaceArea<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.14: tri_meansample by surfaceArea


Of note re: Figure 10.15:

  • It looks like there is a data entry artifact at nidStorage = 100 among small dams (the vertical line in panel A). The people estimating nidStorage probably use 100 acre-feet as a handy guestimate.

  • Like with surfaceArea in plot Figure 10.14, the near-vertical nature of the categories in panel A indicate that there is no strong relationship between \(tri\_mean_{sample}\) and surfaceArea.

  • One could tell a story from panel B if the model explained a lot of the variability. Regression sec-h3-mod2 Model 2 confirms this. Only when we adjust for state does the model capture some of the variability: sec-h3-mod5 Model 5

    • The more nidStorage, the lower the \(tri\_mean_{sample}\) (panel B). This makes some sense, since surfaceArea grows as nidStorage grows (Figure 10.13), and lakes behind dams are flat, resulting in lower \(tri\_mean_{sample}\).

    • The regression lines are somewhat parallel: in general, for any surfaceArea value, the higher the nidHeightId, the higher the \(tri\_mean_{sample}\).

Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage),
         nidStorage >= 1) |>
  ggplot(aes(nidStorage, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +labs(
    subtitle = "A. Scatter plot",
    x = "nidStorage (log10 scale)",
    y = "tri_mean_sample (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage),
         nidStorage >= 1) |>
  ggplot(aes(nidStorage, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2),
        legend.background = element_rect(fill = "transparent")) +
  coord_cartesian(ylim = c(0.1, 100)) +labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "nidStorage (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>sample</sub> by nidStorage<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.15: tri_meansample by nidStorage
for local areas around dams


Of note re: Figure 10.16 :

The plots below using \(surfaceArea::nidStorage\) are similar to those above, and the same conclusions apply.

  • Panel A suggests no meaningful relationship between \(tri\_mean_{sample}\) and \(\frac{surfaceArea}{nidStorage}\)

  • One could tell a story from panel B if the model explained a lot of the variability (which is not likely given panel A). Regression in Section 10.5.2.3 Model 3 confirm this.

    • The regression lines are somewhat parallel: in general, for any \(surfaceArea::nidStorage\) ratio, the higher the nidHeightId, the higher the \(tri\_mean_{sample}\).
Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(area_to_volume_dam, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "A. Scatter plot<br>",
    x = "surface_area::nidStorage (log10 scale)",
    y = "tri_mean_saample (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(area_to_volume_dam, tri_mean_sample, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2),
        legend.background = element_rect(fill = "transparent")) +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "surface_area::nidStorage (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>sample</sub> by surfaceArea::nidStorage<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.16: tri_meansample by surfaceArea::nidStorage
for local areas around dams


10.4.2.2 Regression models

Given my observations about Figure 10.16, and the fact that all three of the above plots have “vertical” features in the scatter plots, doubt there is a real relationship with \(tri\_mean_{sample}\). I explore only the best-performing model used in \(H_3\): Section 10.5.2.5 Model 5, which has \(R^2\) = 0.50 (but that’s not very good given Section 10.5.2.6 Model 6 using only state_abb has \(R^2\) = 0.42).

10.4.2.2.1 Model mod.h2.3

Model mod.h2.3 \(R^2\) = 0.49 (which is better than I expected).

\[mod.h2.3: tri\_mean_{dam} \sim nidstorage | state\]

Show the code
mod.h2.3 <- dta_for_model |> 
  mutate(tri_mean_sample_scaled = scale(tri_mean_sample),
         nidStorage_scaled = scale(nidStorage)) |>
  lmer(tri_mean_sample_scaled ~ nidStorage_scaled + state_abb + (1 | state_abb),
     data = _)

summary(mod.h2.3)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_sample_scaled ~ nidStorage_scaled + state_abb + (1 |  
    state_abb)
   Data: 
mutate(dta_for_model, tri_mean_sample_scaled = scale(tri_mean_sample),  
    nidStorage_scaled = scale(nidStorage))

REML criterion at convergence: 206380.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4308 -0.3894 -0.1134  0.2036 10.8448 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.342    0.5848  
 Residual              0.659    0.8118  
Number of obs: 85156, groups:  state_abb, 48

Fixed effects:
                   Estimate Std. Error t value
(Intercept)       -0.703919   0.591588  -1.190
nidStorage_scaled  0.015223   0.002785   5.465
state_abbFL        0.030041   0.832248   0.036
state_abbMI        0.170992   0.832265   0.205
state_abbLA        0.068816   0.832437   0.083
state_abbIL        0.144136   0.832124   0.173
state_abbMN        0.132261   0.832230   0.159
state_abbND        0.229857   0.832334   0.276
state_abbMS        0.266060   0.831938   0.320
state_abbTX        0.335930   0.831928   0.404
state_abbKS        0.233193   0.832051   0.280
state_abbWI        0.375901   0.832289   0.452
state_abbOK        0.417555   0.831953   0.502
state_abbNE        0.365198   0.832008   0.439
state_abbRI        0.354044   0.833722   0.425
state_abbIN        0.235095   0.832248   0.282
state_abbSD        0.368236   0.832030   0.443
state_abbSC        0.288565   0.832038   0.347
state_abbAL        0.597051   0.832056   0.718
state_abbIA        0.336578   0.831971   0.405
state_abbMO        0.578288   0.831947   0.695
state_abbGA        0.434063   0.831947   0.522
state_abbAR        0.670377   0.832190   0.806
state_abbNJ        0.339538   0.832392   0.408
state_abbOH        0.509397   0.832158   0.612
state_abbME        0.722752   0.832601   0.868
state_abbMD        0.504976   0.832896   0.606
state_abbMA        0.776572   0.832200   0.933
state_abbMT        1.611265   0.832007   1.937
state_abbTN        1.107426   0.832204   1.331
state_abbVA        1.293486   0.832020   1.555
state_abbCT        0.857129   0.832377   1.030
state_abbNC        0.741665   0.831994   0.891
state_abbWY        1.333745   0.832143   1.603
state_abbNM        0.901844   0.832888   1.083
state_abbNH        1.480325   0.832499   1.778
state_abbNY        1.084699   0.832089   1.304
state_abbNV        1.417800   0.832669   1.703
state_abbAZ        1.261012   0.832971   1.514
state_abbKY        1.474209   0.832247   1.771
state_abbPA        1.567438   0.832145   1.884
state_abbWA        2.747030   0.832377   3.300
state_abbID        2.258452   0.832929   2.711
state_abbVT        1.645687   0.833011   1.976
state_abbOR        2.138093   0.832339   2.569
state_abbCO        1.723418   0.832077   2.071
state_abbUT        1.749662   0.832359   2.102
state_abbCA        2.179372   0.832152   2.619
state_abbWV        2.828031   0.832719   3.396
Show the code
r_squared <- rsq(mod.h2.3)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.4.3 Conclusion: H2

The plots in Section 10.4.2, in particular the near-vertical nature of the categories in the scatter plots indicate that there is no strong relationship between \(tri\_mean_{sample}\) or \(tri\_mean_{dam}\)) and either surfaceArea, nidStorage, or \(surfaceArea::nidStorage\)

For this reason I minimized my time investment by modeling only the one regression with the most promise (after exploring it in \(H_3\)). It wasn’t helpful.

In other words, \(H_2\) is not supported by the data.


10.5 Exploring H3 : Compared to H2 there is a stronger inverse proportional relationship between ruggeness in the local areas around dams and surfaceArea::nidStorage

This seems intuitive when compared to \(H_2\): dams are built where the topography provides good places to hold water, and where terrain is more rugged and thus the hills steeper, the basins holding water are likely to be deeper.

10.5.1 Plots

Of note re: Figure 10.17:

  • Both panels show that dams with higher surface area tend to have lower \(tri\_mean_{dam}\), and the higher the nidHeightId, the higher the \(tri\_mean_{dam}\).

  • Compared to \(H_2\)’s Figure 10.14, Figure 10.17 does seem to show some relationship in panel A. See regression Section 10.5.2.1 Model 1.

Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(surfaceArea, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "A. Scatter plot<br>",
    x = "surface_area (log10 scale)",
    y = "tri_mean_dam (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(surfaceArea, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    # size = 0.25,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2)) +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "surface_area (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>dam</sub> by surfaceArea<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.17: tri_meandam by surfaceArea


Of note re: Figure 10.18:

  • There is a similar story for nidStorage: dams with higher nidStorage tend to have lower \(tri\_mean_{dam}\), and the higher the nidHeightId, the higher the \(tri\_mean_{dam}\).

  • It looks like there is a data entry artifact at nidStorage = 100 among small dams (the vertical line in panel A). People estimating nidStorage probably use 100 acre-feet as a handy guestimate.

  • Compared to \(H_2\)’s Figure 10.15, Figure 10.18 does seem to show some relationship in panel A. See regression Section 10.5.2.2 Model 2.

Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage),
         nidStorage >= 1) |>
  ggplot(aes(nidStorage, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +labs(
    subtitle = "A. Scatter plot",
    x = "nidStorage (log10 scale)",
    y = "tri_mean_dam (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage),
         nidStorage >= 1) |>
  ggplot(aes(nidStorage, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    # size = 0.25,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2)) +
  coord_cartesian(ylim = c(0.1, 100)) +labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "nidStorage (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>dam</sub> by nidStorage<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.18: tri_meandam by nidStorage
for local areas around dams


Of note re: Figure 10.19:

Results are the same as when using \(tri\_mean_{sample}\) in Figure 10.16 above to explore \(H_2\).

  • Panel A suggests there is no meaningful relationship between \(tri\_mean_{dam}\) and \(\frac{surfaceArea}{nidStorage}\)

  • One could tell a story from panel B if the model explained a lot of the variability (which is not likely given panel A). Regressions in sec-h3-mod6 Model 6 confirm this.

    • The regression lines are somewhat parallel: in general, for any surfaceArea::nidStorage ratio, the higher the nidHeightId, the higher the \(tri\_mean_{sample}\).
Show the code
p1 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(area_to_volume_dam, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_point(
    size = 0.25,
    alpha = 0.25
  ) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = "none") +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "A. Scatter plot<br>",
    x = "surface_area::nidStorage (log10 scale)",
    y = "tri_mean_dam (log10 scale)"
  )

p2 <- dta_for_model |>
  filter(!is.na(surfaceArea),
         !is.na(nidStorage)) |>
  ggplot(aes(area_to_volume_dam, tri_mean_dam, color = nidHeightId, group = nidHeightId)) +
  geom_smooth(
    method = 'gam',
    formula = y ~ s(x, bs = "cs"),
    linewidth = 0.5,
    alpha = 0.8,
    se = FALSE) +
  scale_x_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(label = label_number(scale_cut = cut_short_scale())) +
  scale_color_manual(values = c("firebrick", "darkslategrey", "blue", "skyblue")) +
  guides(color = guide_legend(position = "inside",
                              override.aes = list(linewidth = 3))) +
  theme(legend.position.inside = c(0.75, 0.2)) +
  coord_cartesian(ylim = c(0.1, 100)) +
  labs(
    subtitle = "B. Regression line plot (same data)<br>",
    x = "surface_area::nidStorage (log10 scale)",
    y = NULL
  )

p1 + p2 + 
  plot_annotation(
    title = "tri_mean<sub>dam</sub> by surfaceArea::nidStorage<br>for local areas around dams",
    caption = my_caption_nid_opentopography
  ) &
  theme(plot.title = element_textbox_simple(),
        plot.subtitle = element_textbox_simple())
Figure 10.19: tri_meandam by surfaceArea::nidStorage
for local areas around dams


10.5.2 Regression models

10.5.2.1 Model mod.h3.1

The model mod.h3.1 has no explanatory power. \(R^2\) is effectively zero.

\[mod.h3.1: tri\_mean_{dam} \sim surfaceArea\]

Show the code
mod.h3.1 <- dta_for_model |>
  lm(tri_mean_dam ~ surfaceArea,
     data = _)

summary(mod.h3.1)

Call:
lm(formula = tri_mean_dam ~ surfaceArea, data = dta_for_model)

Residuals:
   Min     1Q Median     3Q    Max 
-7.923 -2.550 -1.379  0.508 58.869 

Coefficients:
               Estimate  Std. Error t value Pr(>|t|)    
(Intercept) 4.776714814 0.017544845 272.257  < 2e-16 ***
surfaceArea 0.000011097 0.000004263   2.603  0.00925 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.584 on 68542 degrees of freedom
  (17310 observations deleted due to missingness)
Multiple R-squared:  9.882e-05, Adjusted R-squared:  8.423e-05 
F-statistic: 6.774 on 1 and 68542 DF,  p-value: 0.009251


10.5.2.2 Model mod.h3.2

Similarly, the model mod.h3.2 has no explanatory power. \(R^2\) is effectively zero.

\[mod.h3.2: tri\_mean_{dam} \sim nidStorage\]

Show the code
mod.h3.2 <- dta_for_model |>
  lm(tri_mean_dam ~ nidStorage,
     data = _)

summary(mod.h3.2)

Call:
lm(formula = tri_mean_dam ~ nidStorage, data = dta_for_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.932  -2.485  -1.359   0.405  58.967 

Coefficients:
                 Estimate    Std. Error t value        Pr(>|t|)    
(Intercept) 4.67615140184 0.01578323542 296.273         < 2e-16 ***
nidStorage  0.00000046983 0.00000006975   6.736 0.0000000000164 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.602 on 85154 degrees of freedom
  (698 observations deleted due to missingness)
Multiple R-squared:  0.0005325, Adjusted R-squared:  0.0005208 
F-statistic: 45.37 on 1 and 85154 DF,  p-value: 0.00000000001642


10.5.2.3 Model mod.h3.3

The model mod.h3.3 has no explanatory power. \(R^2\) is effectively zero and, accordingly, the \(p\) value is very high.

\[mod.h3.3: tri\_mean_{dam} \sim ratio\]

where \(ratio = \frac{surfaceArea}{nidStorage}\).

Show the code
mod.h3.3 <- dta_for_model |>
  lm(tri_mean_dam ~ area_to_volume_dam,
     data = _)

summary(mod.h3.3)

Call:
lm(formula = tri_mean_dam ~ area_to_volume_dam, data = dta_for_model)

Residuals:
   Min     1Q Median     3Q    Max 
-4.744 -2.550 -1.384  0.494 58.877 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.77071    0.01766 270.201   <2e-16 ***
area_to_volume_dam  0.01656    0.01205   1.374    0.169    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.594 on 68054 degrees of freedom
  (17798 observations deleted due to missingness)
Multiple R-squared:  2.775e-05, Adjusted R-squared:  1.306e-05 
F-statistic: 1.889 on 1 and 68054 DF,  p-value: 0.1694


10.5.2.4 Model mod.h3.4

Adjusting for state in a linear mixed model offers more explanatory power: mod.h3.4 \(R^2\) = 0.47.

\[mod.h3.4: tri\_mean_{dam} \sim surfaceArea | state\]

Show the code
mod.h3.4 <- dta_for_model |> # dams_with_tri |>
  mutate(tri_mean_dam_scaled = scale(tri_mean_dam),
         surfaceArea_scaled = scale(surfaceArea)) |>
  lmer(tri_mean_dam_scaled ~ surfaceArea_scaled + state_abb + (1 | state_abb),
     data = _)

summary(mod.h3.4)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ surfaceArea_scaled + state_abb + (1 | state_abb)
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam),  
    surfaceArea_scaled = scale(surfaceArea))

REML criterion at convergence: 164361.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.7335 -0.4130 -0.0943  0.2242 14.5480 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 2.188    1.4791  
 Residual              0.642    0.8012  
Number of obs: 68544, groups:  state_abb, 48

Fixed effects:
                    Estimate Std. Error t value
(Intercept)        -0.796668   1.481781  -0.538
surfaceArea_scaled  0.002789   0.003069   0.909
state_abbFL         0.099150   2.093784   0.047
state_abbMI         0.180159   2.093794   0.086
state_abbLA         0.179884   2.093861   0.086
state_abbIL         0.215946   2.093869   0.103
state_abbMN         0.227510   2.093849   0.109
state_abbND         0.278619   2.093848   0.133
state_abbMS         0.330694   2.093680   0.158
state_abbTX         0.328485   2.093667   0.157
state_abbKS         0.375448   2.093718   0.179
state_abbWI         0.403496   2.093814   0.193
state_abbOK         0.450294   2.093675   0.215
state_abbNE         0.456201   2.093692   0.218
state_abbRI         0.488730   2.094362   0.233
state_abbIN         0.496939   2.093787   0.237
state_abbSD         0.428591   2.093932   0.205
state_abbSC         0.567741   2.093707   0.271
state_abbAL         0.490383   2.093927   0.234
state_abbIA         0.622551   2.093678   0.297
state_abbMO         0.643281   2.093668   0.307
state_abbGA         0.645353   2.093669   0.308
state_abbAR         0.698400   2.093835   0.334
state_abbNJ         0.702580   2.093838   0.336
state_abbOH         0.707544   2.093754   0.338
state_abbME         0.759280   2.093929   0.363
state_abbMD         0.789333   2.094073   0.377
state_abbMA         0.890187   2.093777   0.425
state_abbMT         1.635832   2.093938   0.781
state_abbTN         1.061460   2.093772   0.507
state_abbVA         1.209061   2.093725   0.577
state_abbCT         1.054984   2.093771   0.504
state_abbNC         1.180449   2.093691   0.564
state_abbWY         1.201428   2.093745   0.574
state_abbNM         1.202664   2.094142   0.574
state_abbNH         1.304317   2.093883   0.623
state_abbNY         1.440652   2.093724   0.688
state_abbNV         1.438821   2.093954   0.687
state_abbAZ         1.396055   2.094161   0.667
state_abbKY         1.226042   2.093805   0.586
state_abbPA         1.769214   2.093759   0.845
state_abbWA         1.945266   2.093863   0.929
state_abbID         1.958452   2.094050   0.935
state_abbVT         2.057331   2.094080   0.982
state_abbOR         2.149127   2.093846   1.026
state_abbCO         2.112241   2.093720   1.009
state_abbUT         3.774710   2.098016   1.799
state_abbCA         2.737416   2.093754   1.307
state_abbWV         3.009690   2.094024   1.437
Show the code
r_squared <- rsq(mod.h3.4)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.5.2.5 Model mod.h3.5

A similar mod.h3.5 has slightly better results: \(R^2\) = 0.50.

\[mod.h3.5: tri\_mean_{dam} \sim nidstorage | state\]

Show the code
mod.h3.5 <- dta_for_model |> # dams_with_tri |>
  mutate(tri_mean_dam_scaled = scale(tri_mean_dam),
         nidStorage_scaled = scale(nidStorage)) |>
  lmer(tri_mean_dam_scaled ~ nidStorage_scaled + state_abb + (1 | state_abb),
     data = _)

summary(mod.h3.5)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ nidStorage_scaled + state_abb + (1 | state_abb)
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam),  
    nidStorage_scaled = scale(nidStorage))

REML criterion at convergence: 205879.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5830 -0.4164 -0.1007  0.2139 14.4515 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 1.0557   1.0275  
 Residual              0.6551   0.8094  
Number of obs: 85156, groups:  state_abb, 48

Fixed effects:
                   Estimate Std. Error t value
(Intercept)       -0.795199   1.031310  -0.771
nidStorage_scaled  0.014489   0.002777   5.217
state_abbFL        0.097349   1.455997   0.067
state_abbMI        0.174747   1.456007   0.120
state_abbLA        0.177532   1.456104   0.122
state_abbIL        0.206033   1.455927   0.142
state_abbMN        0.276087   1.455987   0.190
state_abbND        0.281396   1.456046   0.193
state_abbMS        0.327705   1.455821   0.225
state_abbTX        0.336124   1.455815   0.231
state_abbKS        0.374391   1.455885   0.257
state_abbWI        0.439131   1.456020   0.302
state_abbOK        0.450830   1.455830   0.310
state_abbNE        0.455028   1.455861   0.313
state_abbRI        0.489115   1.456835   0.336
state_abbIN        0.489100   1.455997   0.336
state_abbSD        0.511593   1.455873   0.351
state_abbSC        0.569573   1.455878   0.391
state_abbAL        0.610712   1.455888   0.419
state_abbIA        0.619730   1.455840   0.426
state_abbMO        0.640944   1.455826   0.440
state_abbGA        0.647461   1.455826   0.445
state_abbAR        0.680338   1.455964   0.467
state_abbNJ        0.703274   1.456079   0.483
state_abbOH        0.700364   1.455946   0.481
state_abbME        0.760266   1.456198   0.522
state_abbMD        0.788313   1.456366   0.541
state_abbMA        0.875513   1.455970   0.601
state_abbMT        0.909781   1.455860   0.625
state_abbTN        1.039303   1.455972   0.714
state_abbVA        1.045716   1.455868   0.718
state_abbCT        1.063890   1.456071   0.731
state_abbNC        1.143199   1.455853   0.785
state_abbWY        1.194106   1.455937   0.820
state_abbNM        1.257300   1.456361   0.863
state_abbNH        1.308047   1.456140   0.898
state_abbNY        1.425883   1.455907   0.979
state_abbNV        1.440460   1.456236   0.989
state_abbAZ        1.442600   1.456408   0.991
state_abbKY        1.502459   1.455997   1.032
state_abbPA        1.740860   1.455939   1.196
state_abbWA        1.903134   1.456070   1.307
state_abbID        1.951624   1.456384   1.340
state_abbVT        2.056167   1.456431   1.412
state_abbOR        2.064702   1.456049   1.418
state_abbCO        2.164664   1.455900   1.487
state_abbUT        2.636034   1.456060   1.810
state_abbCA        2.698780   1.455943   1.854
state_abbWV        3.194126   1.456265   2.193
Show the code
r_squared <- rsq(mod.h3.5)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.5.2.6 Model mod.h3.6

The model mod.h3.6 has some explanatory power. \(R^2\) = 0.42, however, it explains less variation than the two models above using surfaceArea and nidStorage as separate predictors.

\[mod.h3.6: tri\_mean_{dam} \sim ratio | state\]

where \(ratio = \frac{surfaceArea}{nidStorage}\)

Show the code
# mod6 <- dta_for_model |>
#   lmer(tri_mean_dam ~ area_to_volume_dam + (1 | state_abb),
#      data = _)

mod.h3.6 <- dta_for_model |>
  mutate(tri_mean_dam_scaled = scale(tri_mean_dam),
         area_to_volume_dam_scaled = scale(area_to_volume_dam)) |>
  lmer(tri_mean_dam_scaled ~ area_to_volume_dam_scaled + state_abb + (1 | state_abb),
       data = _)

summary(mod.h3.6)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ area_to_volume_dam_scaled + state_abb +  
    (1 | state_abb)
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam),  
    area_to_volume_dam_scaled = scale(area_to_volume_dam))

REML criterion at convergence: 163453

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.7186 -0.4121 -0.0943  0.2231 14.5186 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.09228  0.3038  
 Residual              0.64444  0.8028  
Number of obs: 68056, groups:  state_abb, 48

Fixed effects:
                           Estimate Std. Error t value
(Intercept)               -0.796721   0.316759  -2.515
area_to_volume_dam_scaled -0.003785   0.003079  -1.229
state_abbFL                0.099587   0.439578   0.227
state_abbMI                0.180016   0.439626   0.409
state_abbLA                0.180261   0.439945   0.410
state_abbIL                0.216103   0.439983   0.491
state_abbMN                0.228196   0.439884   0.519
state_abbND                0.278764   0.439882   0.634
state_abbMS                0.330563   0.439078   0.753
state_abbTX                0.328454   0.439016   0.748
state_abbKS                0.375324   0.439259   0.854
state_abbWI                0.404142   0.439719   0.919
state_abbOK                0.450162   0.439054   1.025
state_abbNE                0.456021   0.439135   1.038
state_abbRI                0.490155   0.442348   1.108
state_abbIN                0.498546   0.439601   1.134
state_abbSD                0.429141   0.440286   0.975
state_abbSC                0.567735   0.439206   1.293
state_abbAL                0.490743   0.440257   1.115
state_abbIA                0.622374   0.439067   1.417
state_abbMO                0.642044   0.439019   1.462
state_abbGA                0.645237   0.439027   1.470
state_abbAR                0.698930   0.439820   1.589
state_abbNJ                0.706553   0.439849   1.606
state_abbOH                0.707418   0.439434   1.610
state_abbME                0.761086   0.440273   1.729
state_abbMD                0.792350   0.441005   1.797
state_abbMA                0.890169   0.439544   2.025
state_abbMT                1.636419   0.440311   3.717
state_abbTN                1.061525   0.439520   2.415
state_abbVA                1.209035   0.439295   2.752
state_abbCT                1.064990   0.439820   2.421
state_abbNC                1.178392   0.439130   2.683
state_abbWY                1.201424   0.439391   2.734
state_abbNM                1.194106   0.441314   2.706
state_abbNH                1.304429   0.440049   2.964
state_abbNY                1.439915   0.439288   3.278
state_abbNV                1.440967   0.440390   3.272
state_abbAZ                1.396704   0.441372   3.164
state_abbKY                1.224942   0.439678   2.786
state_abbPA                1.768951   0.439457   4.025
state_abbWA                1.946315   0.439953   4.424
state_abbID                1.959219   0.440846   4.444
state_abbVT                2.057327   0.440986   4.665
state_abbOR                2.149651   0.439870   4.887
state_abbCO                2.111321   0.439268   4.806
state_abbUT                3.776444   0.459380   8.221
state_abbCA                2.737617   0.439431   6.230
state_abbWV                3.014099   0.440741   6.839
Show the code
r_squared <- rsq(mod.h3.6)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.5.2.7 Model mod.h3.7

mod.h3.7 uses only state effects. \(R^2\) = 0.40.

Show the code
mod.h3.7 <- dta_for_model |>
  mutate(tri_mean_dam_scaled = scale(tri_mean_dam)) |>
  lmer(tri_mean_dam_scaled ~  1 | state_abb,
       data = _)

summary(mod.h3.7)
Linear mixed model fit by REML ['lmerMod']
Formula: tri_mean_dam_scaled ~ 1 | state_abb
   Data: mutate(dta_for_model, tri_mean_dam_scaled = scale(tri_mean_dam))

REML criterion at convergence: 207681.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5983 -0.4175 -0.1013  0.2144 14.4448 

Random effects:
 Groups    Name        Variance Std.Dev.
 state_abb (Intercept) 0.5590   0.7477  
 Residual              0.6552   0.8095  
Number of obs: 85854, groups:  state_abb, 48

Fixed effects:
            Estimate Std. Error t value
(Intercept)   0.2026     0.1080   1.876
Show the code
r_squared <- rsq(mod.h3.7)

rsq_comment <- glue(
  "r_squared: {round(r_squared$model, digits = 2)}", 
     "\nconsisting of fixed effects: {round(r_squared$fixed, digits = 2)}", 
     " and random effects: {round(r_squared$random, digits = 2)}"
  )


10.5.3 Conclusion: H3

Plots re: \(tri\_mean_{dam}\) are similar to those when using \(tri\_mean_{sample}\) in Figure 10.16 above to explore \(H_2\). See Figure 10.19.

The near-vertical nature of the categories in the scatter plots indicate that there is no strong relationship between \(tri\_mean_{sample}\) or \(tri\_mean_{dam}\)) and either surfaceArea, nidStorage, or \(surfaceArea::nidStorage\).

I explored seven regression models. None strenghen the case for \(H_3\).

In other words, \(H_3\) is not supported by the data.


10.6 Summary: insights from ruggedness

\(H_1\) is the only one of the three hypotheses supported by the data–and that only on a qualified basis. In flatter regions it’s true that the areas around dams are more rugged than the median ruggedness. This is not true for mountainous regions, since dams are placed in valleys, which are flatter than the surrounding mountains. Additionally large dams create flatness by virtue of their lakes.

\(H_2\) and \(H_3\) are disproved by the data. While surfaceArea and nidStorage are related (Figure 10.13), the ratio \(surfaceArea::nidStorage\) does not appear to have an association with ruggedness–either of the areas around dams (\(tri\_mean_{dam}\)) or around random points in the state (\(tri\_mean_{sample}\)).



  1. And in the case of dams for mine tailings and waste products from coal-based electricity generation: other liquids and semi-liquids.↩︎

  2. Wilson et al 2007, Multiscale Terrain Analysis of Multibeam Bathymetry Data for Habitat Mapping on the Continental Slope. Marine Geodesy 30:3-35. See also the original paper defining TRI by Riley, DeGloia, and Elliot (1999): A Terrain Ruggedness Index that Quantifies Topographic Heterogeneity. Authors: Shawn J. Riley, Stephen D. DeGloria, Robert Elliot. Intermountain Journal of Sciences, Vol. 5, No. 1-4, 1999. https://arc.lib.montana.edu/ojs/index.php/IJS/article/view/1794/1457 which is available for download from the Archives and Special Collections of Montana State University.↩︎

  3. https://www.earthdata.nasa.gov/data/instruments/srtm ↩︎

  4. See https://geoservice.dlr.de/web/dataguide/srtm/ and https://gis.stackexchange.com/questions/110755/is-srtm-3-second-arc-free-of-non-terrain-features ↩︎

  5. https://en.wikipedia.org/wiki/List_of_extreme_points_of_the_United_States ↩︎