Introduction

The tune package helps optimize the modeling process. Users can tag arguments in recipes and model objects for optimization. The search routines in tune can discover these arguments and evaluate candidate values until a combination with good performance is found.

As an example, let’s model the Ames housing data:

library(tidymodels)

data(ames)

set.seed(4595)
data_split <- ames %>%
  mutate(Sale_Price = log10(Sale_Price)) %>%
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test  <- testing(data_split)

For simplicity, the sale price of a house will be modeled as a function of its geo-location. These predictors appear to have nonlinear relationships with the outcome:

ames_train %>% 
  dplyr::select(Sale_Price, Longitude, Latitude) %>% 
  tidyr::pivot_longer(cols = c(Longitude, Latitude), 
                      names_to = "predictor", values_to = "value") %>% 
  ggplot(aes(x = value, Sale_Price)) + 
  geom_point(alpha = .2) + 
  geom_smooth(se = FALSE) + 
  facet_wrap(~ predictor, scales = "free_x")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

These two predictors could be modeled using natural splines in conjunction with a linear model. The amount of “wiggliness” in these splines is determined by the degrees of freedom. An appropriate value of this parameter cannot be analytically determined from the data, so it is a tuning parameter (a.k.a. a hyper-parameter). A common approach is to use resampling to estimate model performance over different values of these parameters and use these results to set reasonable values.

We can tag these parameters for optimization using the tune() function:

ames_rec <- 
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>% 
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_ns(Longitude, Latitude, deg_free = tune())

The package can detect these values and optimize them.

However, based on the plot above, the potential amount of non-linearity between the sale price and the predictors might be different. For example, longitude might require more flexibility than latitude. The recipe above would constrain the nonlinearity of the predictors to be the same. We can probably do better than that.

To accomplish this, individual step_ns() terms can be added to the recipe for each predictor. However, we want these to be identifiable; using the same syntax as above, we can’t tell the difference between the two deg_free parameters.

tune() has an option to provide a text annotation so that each tuning parameter has a unique identifier:

ames_rec <- 
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>% 
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_ns(Longitude, deg_free = tune("long df")) %>% 
  step_ns(Latitude,  deg_free = tune("lat df"))

The function dials::parameters() that can detect and collect the parameters that have been flagged for tuning:

parameters(ames_rec)
#> Collection of 2 parameters for tuning
#> 
#>  identifier     type    object
#>     long df deg_free nparam[+]
#>      lat df deg_free nparam[+]

The dials package has default ranges for many parameters. The generic parameter function for deg_free has a fairly small range:

deg_free()
#> Degrees of Freedom (quantitative)
#> Range: [1, 5]

but there is a dials function that is more appropriate for splines:

spline_degree()
#> Piecewise Polynomial Degree (quantitative)
#> Range: [1, 10]

The parameter objects can be easily changed using the update() function:

ames_param <- 
  ames_rec %>% 
  parameters() %>% 
  update(
    `long df` = spline_degree(), 
    `lat df` = spline_degree()
  )
ames_param
#> Collection of 2 parameters for tuning
#> 
#>  identifier     type    object
#>     long df deg_free nparam[+]
#>      lat df deg_free nparam[+]

Model Optimization

Instead of a linear regression, a nonlinear model might provide good performance. A K-nearest-neighbor fit will also be optimized. For this example, the number of neighbors and the distance weighting function will be optimized:

# requires the kknn package
knn_mod <- 
  nearest_neighbor(neighbors = tune(), weight_func = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("regression")

The easiest approach to optimize the pre-processing and model parameters is to bundle these objects into a workflow:

library(workflows)
knn_wflow <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(ames_rec)

From this, the parameter set can be used to modify the range and values of parameters being optimized3:

knn_param <- 
  knn_wflow %>% 
  parameters() %>% 
    update(
    `long df` = spline_degree(c(2, 18)), 
    `lat df` = spline_degree(c(2, 18)),
    neighbors = neighbors(c(3, 50)),
    weight_func = weight_func(values = c("rectangular", "inv", "gaussian", "triangular"))
  )

This parameter collection can be passed to the grid functions via the param_info arguments.

Instead of using grid search, an iterative method called Bayesian optimization can be used. This takes an initial set of results and tries to predict the next tuning parameters to evaluate.

Although no grid is required, the process requires a few additional pieces of information:

  • A description of the search space. At a minimum, the would consist of ranges for numeric values and a list of values for categorical tuning parameters.

  • An acquisition function that helps score potential tuning parameter values.

  • A model for analyzing and making predictions of the best tuning parameter values. A Gaussian Process model is typical and used here.

The code to conduct the search is:

ctrl <- control_bayes(verbose = TRUE)
set.seed(8154)
knn_search <- tune_bayes(knn_wflow, resamples = cv_splits, initial = 5, iter = 20,
                         param_info = knn_param, control = ctrl)
#> 
#> ❯  Generating a set of 5 initial parameter results
#> ✓ Initialization complete
#> 
#> Optimizing rmse using the expected improvement
#> 
#> ── Iteration 1 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08576 (@iter 0)
#> i Gaussian process model
#> ! The Gaussian process model is being fit using 7 features but only has 5
#>   data points to do so. This may cause errors or a poor model fit.
#> ✓ Gaussian process model
#> i Generating 4773 candidates
#> i Predicted candidates
#> i neighbors=3, weight_func=triangular, long df=2, lat df=18
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.09449 (+/-0.00136)
#> 
#> ── Iteration 2 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08576 (@iter 0)
#> i Gaussian process model
#> ! The Gaussian process model is being fit using 7 features but only has 6
#>   data points to do so. This may cause errors or a poor model fit.
#> ✓ Gaussian process model
#> i Generating 4769 candidates
#> i Predicted candidates
#> i neighbors=10, weight_func=triangular, long df=7, lat df=10
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results:    rmse=0.08265 (+/-0.00208)
#> 
#> ── Iteration 3 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ! The Gaussian process model is being fit using 7 features but only has 7
#>   data points to do so. This may cause errors or a poor model fit.
#> ✓ Gaussian process model
#> i Generating 4776 candidates
#> i Predicted candidates
#> i neighbors=6, weight_func=triangular, long df=6, lat df=18
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08724 (+/-0.00175)
#> 
#> ── Iteration 4 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ! The Gaussian process model is being fit using 7 features but only has 8
#>   data points to do so. This may cause errors or a poor model fit.
#> ✓ Gaussian process model
#> i Generating 4765 candidates
#> i Predicted candidates
#> i neighbors=44, weight_func=gaussian, long df=7, lat df=2
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08992 (+/-0.0016)
#> 
#> ── Iteration 5 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4787 candidates
#> i Predicted candidates
#> i neighbors=9, weight_func=triangular, long df=7, lat df=2
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08438 (+/-0.00148)
#> 
#> ── Iteration 6 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4792 candidates
#> i Predicted candidates
#> i neighbors=4, weight_func=inv, long df=7, lat df=10
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08509 (+/-0.00215)
#> 
#> ── Iteration 7 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4775 candidates
#> i Predicted candidates
#> i neighbors=32, weight_func=triangular, long df=7, lat df=8
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08643 (+/-0.00195)
#> 
#> ── Iteration 8 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4769 candidates
#> i Predicted candidates
#> i neighbors=3, weight_func=rectangular, long df=7, lat df=13
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08874 (+/-0.00189)
#> 
#> ── Iteration 9 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4771 candidates
#> i Predicted candidates
#> i neighbors=17, weight_func=inv, long df=7, lat df=9
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08438 (+/-0.0019)
#> 
#> ── Iteration 10 ────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4742 candidates
#> i Predicted candidates
#> i neighbors=14, weight_func=triangular, long df=7, lat df=9
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08311 (+/-0.00202)
#> 
#> ── Iteration 11 ────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4752 candidates
#> i Predicted candidates
#> i neighbors=9, weight_func=gaussian, long df=7, lat df=16
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08704 (+/-0.00225)
#> 
#> ── Iteration 12 ────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rmse=0.08265 (@iter 2)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 4779 candidates
#> i Predicted candidates
#> i neighbors=9, weight_func=triangular, long df=7, lat df=13
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    rmse=0.08397 (+/-0.002)
#> ! No improvement for 10 iterations; returning current results.

Visually, the performance gain was:

autoplot(knn_search, type = "performance", metric = "rmse")

The best results here were:

collect_metrics(knn_search) %>% 
  dplyr::filter(.metric == "rmse") %>% 
  arrange(mean)
#> # A tibble: 17 x 11
#>    neighbors weight_func `long df` `lat df` .metric .estimator   mean     n
#>        <int> <chr>           <int>    <int> <chr>   <chr>       <dbl> <int>
#>  1        10 triangular          7       10 rmse    standard   0.0827    10
#>  2        14 triangular          7        9 rmse    standard   0.0831    10
#>  3         9 triangular          7       13 rmse    standard   0.0840    10
#>  4        17 inv                 7        9 rmse    standard   0.0844    10
#>  5         9 triangular          7        2 rmse    standard   0.0844    10
#>  6         4 inv                 7       10 rmse    standard   0.0851    10
#>  7         5 triangular          8       14 rmse    standard   0.0858    10
#>  8        32 triangular          7        8 rmse    standard   0.0864    10
#>  9         9 gaussian            7       16 rmse    standard   0.0870    10
#> 10         6 triangular          6       18 rmse    standard   0.0872    10
#> 11         3 rectangular         7       13 rmse    standard   0.0887    10
#> 12        36 triangular          9        6 rmse    standard   0.0889    10
#> 13        44 gaussian            7        2 rmse    standard   0.0899    10
#> 14        28 inv                16       10 rmse    standard   0.0919    10
#> 15         3 triangular          2       18 rmse    standard   0.0945    10
#> 16        18 rectangular        13        3 rmse    standard   0.0947    10
#> 17        47 inv                 3       16 rmse    standard   0.0978    10
#> # … with 3 more variables: std_err <dbl>, .config <chr>, .iter <int>

With this intrinsically nonlinear model there is less reliance on the nonlinear terms created by the recipe.


  1. A simple R model formula could have been used here, such as Sale_Price ~ log10(Gr_Liv_Area) + Longitude + Latitude. A recipe is not required.↩︎

  2. tune has default measures of performance that it uses if none are specified. Here the RMSE and R2 are estimated. This can be changed using the metrics option.↩︎

  3. One of the tuning parameters (weight_func) is categorical and, by default, has 10 unique values. The model used to predict new test parameters is a Gaussian process model, and this can become slow to fit when the number of tuning parameters is large or when a categorical parameter generates many dummy variables. We’ve reduced the number of categories for this parameter to speed things up a bit.↩︎