The Ames housing data will be used to to demonstrate how regression models can be made using parsnip. We’ll create the data set and create a simple training/test set split:

library(AmesHousing)
ames <- make_ames()

library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> ── Attaching packages ───────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.1       ✔ recipes   0.1.4
#> ✔ dials     0.0.2       ✔ rsample   0.0.4
#> ✔ dplyr     0.8.0.1     ✔ tibble    2.1.1
#> ✔ infer     0.4.0       ✔ yardstick 0.0.2
#> ✔ purrr     0.3.2
#> ── Conflicts ──────────────────────────────────── tidymodels_conflicts() ──
#> ✖ dials::margin()  masks ggplot2::margin(), randomForest::margin()

set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)

ames_train <- training(data_split)
ames_test  <- testing(data_split)

# Random Forests

We’ll start by fitting a random forest model to a small set of parameters. Let’s say that the model would include predictors: Longitude, Latitude, Lot_Area, Neighborhood, and Year_Sold. A simple random forest model can be specified via

library(parsnip)

rf_defaults <- rand_forest(mode = "regression")
rf_defaults
#> Random Forest Model Specification (regression)

The model will be fit with the ranger package. Since we didn’t add any extra arguments to fit, many of the arguments will be set to their defaults from the specific function that is used by ranger::ranger. The help pages for the model function describes the changes to the default parameters that are made and the translate function can also be used.

parsnip gives two different interfaces to the models: the formula and non-formula interfaces. Let’s start with the non-formula interface:

preds <- c("Longitude", "Latitude", "Lot_Area", "Neighborhood", "Year_Sold")

rf_xy_fit <-
rf_defaults %>%
set_engine("ranger") %>%
fit_xy(
x = ames_train[, preds],
y = log10(ames_train\$Sale_Price)
)
rf_xy_fit
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#>  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type:                             Regression
#> Number of trees:                  500
#> Sample size:                      2199
#> Number of independent variables:  5
#> Mtry:                             2
#> Target node size:                 5
#> Variable importance mode:         none
#> Splitrule:                        variance
#> OOB prediction error (MSE):       0.00859
#> R squared (OOB):                  0.72

The non-formula interface doesn’t do anything to the predictors before giving it to the underlying model function. This particular model does not require indicator variables to be create prior to the model (note that the output shows “Number of independent variables: 5”).

For regression models, the basic predict method can be used and returns a tibble with a column named .pred:

test_results <- ames_test %>%
select(Sale_Price) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
bind_cols(
predict(rf_xy_fit, new_data = ames_test[, preds])
)
test_results
#> # A tibble: 731 x 2
#>    Sale_Price .pred
#>         <dbl> <dbl>
#>  1       5.02  5.20
#>  2       5.39  5.26
#>  3       5.25  5.26
#>  4       5.32  5.32
#>  5       5.06  5.16
#>  6       4.94  5.01
#>  7       5.18  5.11
#>  8       5.58  5.51
#>  9       5.34  5.51
#> 10       5.41  5.53
#> # … with 721 more rows

# summarize performance
test_results %>% metrics(truth = Sale_Price, estimate = .pred)
#> # A tibble: 3 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0935
#> 2 rsq     standard      0.738
#> 3 mae     standard      0.0660

Note that:

• If the model required indicator variables, we would have to create them manually prior to using fit (perhaps using the recipes package).
• We had to manually log the outcome prior to modeling.

Now, for illustration, let’s use the formula method using some new parameter values:

rand_forest(mode = "regression", mtry = 3, trees = 1000) %>%
set_engine("ranger") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1))
#>
#> Type:                             Regression
#> Number of trees:                  1000
#> Sample size:                      2199
#> Number of independent variables:  5
#> Mtry:                             3
#> Target node size:                 5
#> Variable importance mode:         none
#> Splitrule:                        variance
#> OOB prediction error (MSE):       0.00866
#> R squared (OOB):                  0.718

Suppose that there was some feature in the randomForest package that we’d like to evaluate. To do so, the only part of the syntaxt that needs to change is the set_engine argument:

rand_forest(mode = "regression", mtry = 3, trees = 1000) %>%
set_engine("randomForest") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#>
#> Call:
#>  randomForest(x = as.data.frame(x), y = y, ntree = ~1000, mtry = ~3)
#>                Type of random forest: regression
#>                      Number of trees: 1000
#> No. of variables tried at each split: 3
#>
#>           Mean of squared residuals: 0.0127
#>                     % Var explained: 58.6

Look at the formula code that was printed out, one function uses the argument name ntree and the other uses num.trees. parsnip doesn’t require you to know the specific names of the main arguments.

Now suppose that we want to modify the value of mtry based on the number of predictors in the data. Usually, the default value would be floor(sqrt(num_predictors)). To use a pure bagging model would require an mtry value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present (perhaps due to the generation of indicator variables or a variable filter) so that might be difficult to know exactly.

When the model it being fit by parsnip, data descriptors are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using rand_forest), the values of the arguments that you give it are immediately evaluated… unless you delay them. To delay the evaluation of any argument, you can used rlang::expr to make an expression.

Two relevant descriptors for what we are about to do are:

• .preds(): the number of predictor variables in the data set that are associated with the predictors prior to dummy variable creation.
• .cols(): the number of predictor columns after dummy variables (or other encodings) are created.

Since ranger won’t create indicator values, .preds() would be appropriate for using mtry for a bagging model.

For example, let’s use an expression with the .preds() descriptor to fit a bagging model:

rand_forest(mode = "regression", mtry = .preds(), trees = 1000) %>%
set_engine("ranger") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = ~.preds(),      num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1))
#>
#> Type:                             Regression
#> Number of trees:                  1000
#> Sample size:                      2199
#> Number of independent variables:  5
#> Mtry:                             5
#> Target node size:                 5
#> Variable importance mode:         none
#> Splitrule:                        variance
#> OOB prediction error (MSE):       0.00888
#> R squared (OOB):                  0.711

# Penalized Logistic Regression

A linear model might work here too. The linear_reg model can be used. To use regularization/penalization, there are two engines that can do that here: the glmnet and sparklyr packages. The former will be used here and it only implements the non-formula method. parsnip will allow either to be used though.

When regularization is used, the predictors should first be centered and scaled before given to the model. The formula method won’t do that so some other methods will be required. We’ll use recipes package for that (more information here).

norm_recipe <-
recipe(
Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
) %>%
step_other(Neighborhood) %>%
step_dummy(all_nominal()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_log(Sale_Price, base = 10) %>%
# estimate the means and standard deviations
prep(training = ames_train, retain = TRUE)

# Now let's fit the model using the processed version of the data

glmn_fit <-
linear_reg(penalty = 0.001, mixture = 0.5) %>%
set_engine("glmnet") %>%
fit(Sale_Price ~ ., data = juice(norm_recipe))
glmn_fit
#> parsnip model object
#>
#>
#> Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~0.5, lambda = ~0.001)
#>
#>      Df  %Dev Lambda
#> [1,] 11 0.387  0.001

If penalty were not specified, all of the lambda values would be computed.

To get the predictions for this specific value of lambda (aka penalty):

# First, get the processed version of the test set predictors:
test_normalized <- bake(norm_recipe, new_data = ames_test, all_predictors())

test_results <-
test_results %>%
rename(random forest = .pred) %>%
bind_cols(
predict(glmn_fit, new_data = test_normalized) %>%
rename(glmnet = .pred)
)
test_results
#> # A tibble: 731 x 3
#>    Sale_Price random forest glmnet
#>         <dbl>           <dbl>  <dbl>
#>  1       5.02            5.20   5.18
#>  2       5.39            5.26   5.17
#>  3       5.25            5.26   5.26
#>  4       5.32            5.32   5.26
#>  5       5.06            5.16   5.18
#>  6       4.94            5.01   5.18
#>  7       5.18            5.11   5.18
#>  8       5.58            5.51   5.50
#>  9       5.34            5.51   5.49
#> 10       5.41            5.53   5.50
#> # … with 721 more rows

test_results %>% metrics(truth = Sale_Price, estimate = glmnet)
#> # A tibble: 3 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard       0.142
#> 2 rsq     standard       0.399
#> 3 mae     standard       0.104

test_results %>%
gather(model, prediction, -Sale_Price) %>%
ggplot(aes(x = prediction, y = Sale_Price)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4) +
facet_wrap(~model) +
coord_fixed()