3 Modeling failure rate

Show the code

data_for_model <- dta_working_set |>
  mutate(median_km_driven_10K = median_km_driven / 10000) |>
  filter(any(model_year >= 2015),
         .by = brand) |>
  filter(n_model_years_brand >= 3) |>
  mutate(brand = fct_reorder(brand, failure_rate, mean))

3.1 Unfair average failure rates

Because failure_rate is influenced by distance driven and vehicle age (see Section A.1.2 Causal graph), a simple list of average failure rates by brand (Table 3.1) can be misleading. For example:

Brands with relatively more vehicles in older models years will have “unfairly” higher failure rates.
The lowest quality vehicles may be withdrawn from service relatively earlier than higher quality cars, “unfairly” reducing the lowest quality brands’ failure rates.
If brands mostly offer expensive sports cars that are driven less than average, then these brands will have “unfairly” low failure rates.

Show the code

##| column: page-right

data_for_model |>
  summarize(avg_failure_rate = weighted.mean(failure_rate, w = inspection_count),
            model_years_in_data = paste0(min(model_year), "-", max(model_year)),
            .by = brand) |>
  arrange(avg_failure_rate) |>
  mutate(rank = rank(avg_failure_rate)) |>
  gt() |>
  tab_header(md("**Brands ranked by average failure rate over all model years**")) |>
  fmt_number(columns = avg_failure_rate,
             decimals = 2)

Table 3.1: Brands ranked by average failure rate (not a fair ranking)
brand	avg_failure_rate	model_years_in_data	rank
Brands ranked by average failure rate over all model years
Porsche	0.09	2007-2018	1
Lexus	0.11	2007-2017	2
Suzuki	0.13	2007-2018	3
Toyota	0.13	2007-2018	4
Audi	0.14	2007-2018	5
Mini	0.15	2007-2018	6
Seat	0.16	2007-2018	7
Skoda	0.16	2007-2018	8
Honda	0.16	2007-2018	9
Tesla Motors	0.16	2015-2017	10
Mitsubishi	0.17	2007-2018	11
Volvo	0.17	2007-2018	12
VW	0.19	2007-2018	13
Opel	0.20	2007-2018	14
Subaru	0.21	2007-2018	15
Hyundai	0.21	2007-2018	16
Ford	0.21	2007-2018	17
BMW	0.22	2007-2018	18
MB	0.22	2007-2018	19
Kia	0.23	2007-2018	20
Jaguar Land Rover	0.23	2007-2017	21
Mazda	0.24	2007-2018	22
Nissan	0.27	2007-2018	23
Peugeot	0.28	2007-2018	24
Dacia	0.29	2010-2018	25
Renault	0.29	2007-2018	26
Citroen	0.30	2007-2018	27
Alfa Romeo	0.30	2007-2017	28
Fiat	0.31	2007-2017	29

3.2 Linear regressions

If we accept that inspection failure rate is a good proxy for brand quality, it’s possible to identify relative brand quality by accounting for these factors using regression analysis.

3.2.1 The two simplest models

Show the code

mod1 <- data_for_model |>
  lm(failure_rate ~ vehicle_age,
     data = _)

mod2 <- data_for_model |>
  lm(failure_rate ~ median_km_driven_10K,
     data = _)

According to Table 3.2

For every one year increase in vehicle_age, the failure_rate went up by 2.5%.
For every 10K km driven (median_km_driven_10K), the failure_rate increased by 1%.

Show the code

p1 <- bind_cols(
  tidy(mod1,
       conf.int = my_ci) |>
  select(term, estimate, std.error, p.value, conf.low, conf.high),
  glance(mod1) |>
  select(adj.r.squared, sigma, logLik, AIC, BIC, nobs)
) |> mutate(model = "mod1",
            formula = "lm(failure_rate ~ vehicle_age)")

p2 <- bind_cols(
  tidy(mod2,
       conf.int = my_ci) |>
  select(term, estimate, std.error, p.value, conf.low, conf.high),
  glance(mod2) |>
  select(adj.r.squared, sigma, logLik, AIC, BIC, nobs)
) |> mutate(model = "mod2",
            formula = "lm(failure_rate ~ median_km_driven_10K)")

bind_rows(p1, p2) |>
  relocate(c(model, formula), .before = term) |>
  filter(term != "(Intercept)") |>
  gt() |>
  tab_options(table.font.size = 10) |>
  fmt_number(columns = c(estimate, std.error, p.value, conf.low, conf.high, adj.r.squared, sigma),
             decimals = 3) |>
  fmt_number(columns = c( logLik, AIC, BIC),
             decimals = 0)

Table 3.2: Simple linear models compared
model	formula	term	estimate	std.error	p.value	conf.low	conf.high	adj.r.squared	sigma	logLik	AIC	BIC	nobs
mod1	lm(failure_rate ~ vehicle_age)	vehicle_age	0.025	0.001	0.000	0.024	0.026	0.557	0.079	1,730	−3,453	−3,437	1539
mod2	lm(failure_rate ~ median_km_driven_10K)	median_km_driven_10K	0.010	0.000	0.000	0.009	0.011	0.368	0.094	1,456	−2,907	−2,891	1539

Show the code

  # relatively higher R^2 and logLik indicates better fit
  # relatively lower AIC indicates better fit
  # relatively lower BIC indicates better fit

Regressing with vehicle_age provides a better fit than median_km_driven_10K

Relatively higher \(R^2\) and log likelihood
Relatively lower AIC and BIC

This is because vehicle_age encompasses failures due to age as well as a lot of the information about distance driven. See Section A.1.2 Causal graph and Table 1.3 High correlation among vehicle_age, median_km_driven and average_km_driven.

3.2.2 Relative brand quality in Finland

Show the code

my_lm_vehicle_age <- function(x) {
  lm(failure_rate ~ vehicle_age,
           data = x)
}

mod1_set <- data_for_model %>%
  filter(n_model_years_brand >= brands_cutoff_modeling_min) |>
  nest(data = c(failure_rate, vehicle_age),
       .by = brand) |>
  mutate(mod1 = map(data, my_lm_vehicle_age),
         tidyinfo = map(mod1, tidy,
                        conf.int = TRUE,
                        conf.level = my_ci)
         )

mod1_set_summary <- mod1_set |>
  select(brand, tidyinfo) |>
  unnest(tidyinfo) |>
  select(-statistic)

my_lm_km_driven <- function(x) {
  lm(failure_rate ~ median_km_driven_10K,
           data = x)
}

mod2_set <- data_for_model %>%
  filter(n_model_years_brand >= brands_cutoff_modeling_min) |>
  nest(data = c(failure_rate, median_km_driven_10K),
       .by = brand) |>
  mutate(mod1 = map(data, my_lm_km_driven),
         tidyinfo = map(mod1, tidy,
                        conf.int = TRUE,
                        conf.level = my_ci)
         )

mod2_set_summary <- mod2_set |>
  select(brand, tidyinfo) |>
  unnest(tidyinfo) |>
  select(-statistic)

For each brand separately, I calculated lm(failure_rate ~ vehicle_age) and lm(failure_rate ~ median_km_driven_10K) and present them in Figure 3.1. I excluded brands that did not have at least 30 data points.

Given the wide range of the 90% confidence intervals, a relative difference in brand quality can be determined in the regression with vehicle_age only for the highest and lowest quality brands (where the intervals do not overlap in panel A). The narrower confidence intervals in the regression using median_km_driven_10K (panel B) provide a larger useful set of relative higher-quality and lower-quality brands.

Show the code

p1 <- mod1_set_summary |>
  filter(term != "(Intercept)") |>
  mutate(term = fct_reorder(brand, -estimate)) |>
  ggplot() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term),
                 height = 0.3,
                 linewidth = 0.5, alpha = 0.4,
                 show.legend = FALSE) +
  geom_point(aes(x = estimate, y = term),
             size = 1, color = "firebrick",
             show.legend = FALSE) +
  labs(
    subtitle = "Increment: one year of vehicle_age",
    y = NULL,
    tag = "A"
  )

p2 <- mod2_set_summary |>
  filter(term != "(Intercept)") |>
  mutate(term = fct_reorder(brand, -estimate)) |>
  ggplot() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term),
                 height = 0.3,
                 linewidth = 0.5, alpha = 0.4,
                 show.legend = FALSE) +
  geom_point(aes(x = estimate, y = term),
             size = 1, color = "firebrick",
             show.legend = FALSE) +
  labs(
    subtitle = "Increment: 10K km driven",
    y = NULL,
    tag = "B"
  )

p1 + p2 + 
  plot_annotation(
    title = "Relative brand quality in Finland",
    subtitle = glue("For each increment, the brand adds `estimate` to the failure rate",
                    "\nwhen inspected in Finland. Includes brands with at least {brands_cutoff_modeling_min} data points.",
                    " {percent(my_ci)} CI."),
    caption = my_caption
  )

Figure 3.1: Relative brand quality determined using linear regression

The tables below include the estimates and confidence intervals plotted in in Figure 3.1.

Show the code

mod1_set_summary |>
  filter(term != "(Intercept)") |>
  mutate(plus_minus = (conf.high - conf.low) / 2) |>
  select(-term) |>
  gt() |>
  fmt_number(columns = c(estimate:plus_minus),
             decimals = 3) |>
  tab_source_note(md(glue("*For each brand, calculated separately: lm(failure_rate ~ vehicle_age)* at {my_ci} CI")))
mod2_set_summary |>
  filter(term != "(Intercept)") |>
  mutate(plus_minus = (conf.high - conf.low) / 2) |>
  select(-term) |>
  gt() |>
  fmt_number(columns = c(estimate:plus_minus),
             decimals = 3)|>
  tab_source_note(md(glue("*For each brand, calculated separately: lm(failure_rate ~ median_km_driven_10K)* at {my_ci} CI")))

Table 3.3: Linear models for each brand

(a) term is vehicle_age
brand	estimate	std.error	p.value	conf.low	conf.high	plus_minus
Audi	0.020	0.001	0.000	0.018	0.022	0.002
BMW	0.025	0.001	0.000	0.022	0.027	0.002
Citroen	0.025	0.003	0.000	0.021	0.030	0.004
Ford	0.024	0.002	0.000	0.021	0.028	0.004
Honda	0.024	0.002	0.000	0.021	0.028	0.003
Hyundai	0.027	0.003	0.000	0.022	0.032	0.005
Kia	0.032	0.002	0.000	0.029	0.035	0.003
Mazda	0.035	0.003	0.000	0.030	0.039	0.004
MB	0.020	0.002	0.000	0.016	0.024	0.004
Mitsubishi	0.029	0.003	0.000	0.024	0.035	0.005
Nissan	0.029	0.003	0.000	0.025	0.034	0.005
Opel	0.030	0.002	0.000	0.026	0.034	0.004
Peugeot	0.029	0.003	0.000	0.025	0.034	0.005
Renault	0.029	0.003	0.000	0.024	0.033	0.005
Seat	0.030	0.002	0.000	0.027	0.033	0.003
Skoda	0.027	0.001	0.000	0.025	0.029	0.002
Suzuki	0.023	0.002	0.000	0.019	0.027	0.004
Toyota	0.019	0.002	0.000	0.017	0.022	0.002
VW	0.025	0.001	0.000	0.023	0.028	0.002
Volvo	0.024	0.002	0.000	0.021	0.027	0.003
For each brand, calculated separately: lm(failure_rate ~ vehicle_age) at 0.9 CI

(b) term is median_km_driven_10K
brand	estimate	std.error	p.value	conf.low	conf.high	plus_minus
Audi	0.009	0.001	0.000	0.008	0.010	0.001
BMW	0.012	0.001	0.000	0.011	0.013	0.001
Citroen	0.013	0.002	0.000	0.011	0.016	0.003
Ford	0.010	0.001	0.000	0.008	0.012	0.002
Honda	0.013	0.001	0.000	0.010	0.015	0.002
Hyundai	0.019	0.001	0.000	0.016	0.021	0.002
Kia	0.019	0.001	0.000	0.017	0.021	0.002
Mazda	0.021	0.002	0.000	0.018	0.024	0.003
MB	0.008	0.001	0.000	0.007	0.009	0.001
Mitsubishi	0.017	0.001	0.000	0.015	0.019	0.002
Nissan	0.019	0.001	0.000	0.017	0.020	0.002
Opel	0.016	0.002	0.000	0.013	0.018	0.003
Peugeot	0.017	0.002	0.000	0.014	0.020	0.003
Renault	0.015	0.002	0.000	0.011	0.019	0.004
Seat	0.015	0.001	0.000	0.013	0.017	0.002
Skoda	0.012	0.001	0.000	0.011	0.014	0.002
Suzuki	0.015	0.002	0.000	0.011	0.019	0.004
Toyota	0.007	0.001	0.000	0.006	0.009	0.002
VW	0.010	0.001	0.000	0.009	0.011	0.001
Volvo	0.011	0.001	0.000	0.009	0.013	0.002
For each brand, calculated separately: lm(failure_rate ~ median_km_driven_10K) at 0.9 CI