Subsampling can be a helpful approach to dealing will classification data where one or more classes occur very infrequently. Often, most models will overfit to the majority class and produce very good statistics for the class containing the frequently occurring classes while the minority classes have poor performance.

Consider a two-class problem where the first class has a very low rate of occurrence. The caret package has a function that can simulate such data:

library(caret)

set.seed(244)
imbal_data <- twoClassSim(1000, intercept = 10)
table(imbal_data$Class)
#> 
#> Class1 Class2 
#>     47    953

If “Class1” is the event of interest, it is very likely that a classification model would be able to achieve very good specificity since almost all of the data are the second class. Sensitivity will often be poor since the models will optimize accuracy (or other loss functions) by predicting everything to be the majority class.

When there are two classes, the results is that the default probability cutoff of 50% is inappropriate; a different cutoff that is more extreme might be able to achieve good performance.

One way to alleviate this issue is to subsample the data. There are a number of ways to do this but the most simple one is to sample down the majority class data until it occurs with the same frequency as the minority class. While counterintuitive, throwing out a large percentage of the data can be effective at producing a results. In some cases, this means that the overall performance of the model is better (e.g. improved area under the ROC curve). However, subsampling almost always produces models that are better calibrated, meaning that the distributions of the class probabilities are model well behaved. As a result, the default 50% cutoff is much model likely to produce better sensitivity and specificity values than they would otherwise.

To demonstrate this, step_downsample will be used in a recipe for the simulated data. In terms of workflow:

  • It is extremely important that subsampling occurs inside of resampling. Otherwise, the resampling process can produce poor estimates of model performance.
  • The subsampling process should only be applied to the analysis set. The assessment set should reflect the event rates seen “in the wild” and, for this reason, the skip argument to step_downsample is defaulted to TRUE.

Here is a simple recipe:

library(recipes)
imbal_rec <- 
  recipe(Class ~ ., data = imbal_data) %>%
  step_downsample(Class)

Basic cross-validation is used to resample the model:

library(rsample)
set.seed(5732)
cv_folds <- vfold_cv(imbal_data, strata = "Class", repeats = 5)

An additional column is added to the data that contains the trained recipes for each resample:

The model that will be used to demonstrate subsampling is quadratic discriminant analysis via the MASS package. A function will be used to train the model and to produce class probabilities as well as hard class predictions using the default 50% cutoff. When a recipe is passed to the function, down-sampling will be applied. If no recipe is given, the data are used to fit the model as-is:

For example:

To measure model effectiveness, two metrics are used:

  • The area under the ROC curve is an overall assessment of performance across all cutoffs. Values near one indicate very good results while values near 0.05 would imply that the model is very poor.
  • The J index (a.k.a. Youden’s J statistic) is sensitivity + specificity - 1. Values near one are once again best.

If a model is poorly calibrated, the ROC curve value might not show diminished performance. However, the J index would be lower for models with pathological distributions for the class probabilities. The yardstick package will be used to compute these metrics.

Now, we train the models and generate the predictions. These are stored in list columns where each list element is a data frame of the predictions on the assessment data:

Now, the performance metrics are computed:

library(yardstick)
cv_folds <- 
  cv_folds %>%
  mutate(
    sampled_roc = 
      map_dfr(sampled_pred, roc_auc, Class, prob) %>% 
      pull(".estimate"),
    
    normal_roc =  
      map_dfr(normal_pred,  roc_auc, Class, prob) %>% 
      pull(".estimate"),  
    
    sampled_J =   
      map_dfr(sampled_pred, j_index, Class, pred) %>% 
      pull(".estimate"),
    
    normal_J =    
      map_dfr(normal_pred,  j_index, Class, pred) %>% 
      pull(".estimate")       
  )

What do the ROC values look like? A Bland-Altman plot can be used to show the differences in the results over the range of results:

There doesn’t appear that subsampling had much of an effect on this metric. The average difference is -0.015, which is fairly small.

For the J statistic, the results show a different story:

Almost all of the differences area greater than zero. We can use tidyposterior to do a more formal analysis:

library(tidyposterior)

# Remove all columns except the resample info and the J indices,
# then fit the Bayesian model
j_mod <- 
  cv_folds %>% 
  dplyr::select(-recipes, -matches("pred$"), -matches("roc$")) %>% 
  perf_mod(seed = 62378, iter = 5000)
#> 
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.000276 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.76 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 1: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 1: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 1.355 seconds (Warm-up)
#> Chain 1:                1.11739 seconds (Sampling)
#> Chain 1:                2.47239 seconds (Total)
#> Chain 1: 
#> 
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
#> Chain 2: 
#> Chain 2: Gradient evaluation took 2.5e-05 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2: 
#> Chain 2: 
#> Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 2: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 2: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 2: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 2: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 2: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 2: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 2: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 2: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 2: 
#> Chain 2:  Elapsed Time: 1.40265 seconds (Warm-up)
#> Chain 2:                1.87085 seconds (Sampling)
#> Chain 2:                3.2735 seconds (Total)
#> Chain 2: 
#> 
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
#> Chain 3: 
#> Chain 3: Gradient evaluation took 2.7e-05 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3: 
#> Chain 3: 
#> Chain 3: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 3: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 3: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 3: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 3: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 3: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 3: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 3: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 3: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 3: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 3: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 3: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 3: 
#> Chain 3:  Elapsed Time: 1.59503 seconds (Warm-up)
#> Chain 3:                0.940936 seconds (Sampling)
#> Chain 3:                2.53596 seconds (Total)
#> Chain 3: 
#> 
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
#> Chain 4: 
#> Chain 4: Gradient evaluation took 2.6e-05 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4: 
#> Chain 4: 
#> Chain 4: Iteration:    1 / 5000 [  0%]  (Warmup)
#> Chain 4: Iteration:  500 / 5000 [ 10%]  (Warmup)
#> Chain 4: Iteration: 1000 / 5000 [ 20%]  (Warmup)
#> Chain 4: Iteration: 1500 / 5000 [ 30%]  (Warmup)
#> Chain 4: Iteration: 2000 / 5000 [ 40%]  (Warmup)
#> Chain 4: Iteration: 2500 / 5000 [ 50%]  (Warmup)
#> Chain 4: Iteration: 2501 / 5000 [ 50%]  (Sampling)
#> Chain 4: Iteration: 3000 / 5000 [ 60%]  (Sampling)
#> Chain 4: Iteration: 3500 / 5000 [ 70%]  (Sampling)
#> Chain 4: Iteration: 4000 / 5000 [ 80%]  (Sampling)
#> Chain 4: Iteration: 4500 / 5000 [ 90%]  (Sampling)
#> Chain 4: Iteration: 5000 / 5000 [100%]  (Sampling)
#> Chain 4: 
#> Chain 4:  Elapsed Time: 1.53407 seconds (Warm-up)
#> Chain 4:                1.05475 seconds (Sampling)
#> Chain 4:                2.58882 seconds (Total)
#> Chain 4:

A simple plot of the posterior distributions of the J indices for each model shows that there is a real difference; subsampling the data prior to modeling produced better calibrated models: