Regression.Rmd
The Ames housing data will be used to to demonstrate how regression models can be made using parsnip
. We’ll create the data set and create a simple training/test set split:
library(AmesHousing)
ames <- make_ames()
library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 0.0.1.9000 ──
#> ✔ broom 0.5.0.9001 ✔ purrr 0.2.5
#> ✔ dials 0.0.1.9000 ✔ recipes 0.1.3.9002
#> ✔ dplyr 0.7.7 ✔ rsample 0.0.2
#> ✔ infer 0.3.1 ✔ tibble 1.4.2
#> ✔ probably 0.0.0.9000 ✔ yardstick 0.0.1.9000
#> ── Conflicts ──────────────────────────────────── tidymodels_conflicts() ──
#> ✖ probably::as.factor() masks base::as.factor()
#> ✖ probably::as.ordered() masks base::as.ordered()
#> ✖ dplyr::combine() masks randomForest::combine()
#> ✖ purrr::discard() masks scales::discard()
#> ✖ rsample::fill() masks tidyr::fill()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ ggplot2::margin() masks randomForest::margin()
#> ✖ rsample::prepper() masks recipes::prepper()
#> ✖ recipes::step() masks stats::step()
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
We’ll start by fitting a random forest model to a small set of parameters. Let’s say that the model would include predictors: Longitude
, Latitude
, Lot_Area
, Neighborhood
, and Year_Sold
. A simple random forest model can be specified via
library(parsnip)
rf_defaults <- rand_forest(mode = "regression")
rf_defaults
#> Random Forest Model Specification (regression)
The model will be fit with the ranger
package. Since we didn’t add any extra arguments to fit
, many of the arguments will be set to their defaults from the specific function that is used by ranger::ranger
. The help pages for the model function describes the changes to the default parameters that are made and the translate
function can also be used.
parsnip
gives two different interfaces to the models: the formula and non-formula interfaces. Let’s start with the non-formula interface:
preds <- c("Longitude", "Latitude", "Lot_Area", "Neighborhood", "Year_Sold")
rf_xy_fit <-
rf_defaults %>%
set_engine("ranger") %>%
fit_xy(
x = ames_train[, preds],
y = log10(ames_train$Sale_Price)
)
rf_xy_fit
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 500
#> Sample size: 2199
#> Number of independent variables: 5
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.00866
#> R squared (OOB): 0.727
The non-formula interface doesn’t do anything to the predictors before giving it to the underlying model function. This particular model does not require indicator variables to be create prior to the model (note that the output shows “Number of independent variables: 5”).
For regression models, the basic predict
method can be used and returns a tibble with a column named .pred
:
test_results <- ames_test %>%
select(Sale_Price) %>%
mutate(Sale_Price = log10(Sale_Price)) %>%
bind_cols(
predict(rf_xy_fit, new_data = ames_test[, preds])
)
test_results
#> # A tibble: 731 x 2
#> Sale_Price .pred
#> <dbl> <dbl>
#> 1 5.02 5.20
#> 2 5.24 5.24
#> 3 5.39 5.26
#> 4 5.33 5.42
#> 5 5.27 5.25
#> 6 5.26 5.28
#> 7 5.73 5.56
#> 8 5.18 5.19
#> 9 5.06 5.12
#> 10 4.94 5.00
#> # ... with 721 more rows
# summarize performance
test_results %>% metrics(truth = Sale_Price, estimate = .pred)
#> # A tibble: 3 x 2
#> .metric .estimate
#> <chr> <dbl>
#> 1 rmse 0.0919
#> 2 rsq 0.719
#> 3 mae 0.0639
Note that:
fit
(perhaps using the recipes
package).Now, for illustration, let’s use the formula method using some new parameter values:
rand_forest(mode = "regression", mtry = 3, trees = 1000) %>%
set_engine("ranger") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 1000
#> Sample size: 2199
#> Number of independent variables: 5
#> Mtry: 3
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.00858
#> R squared (OOB): 0.73
Suppose that there was some feature in the randomForest
package that we’d like to evaluate. To do so, the only part of the syntaxt that needs to change is the set_engine
argument:
rand_forest(mode = "regression", mtry = 3, trees = 1000) %>%
set_engine("randomForest") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#>
#> Call:
#> randomForest(x = as.data.frame(x), y = y, ntree = ~1000, mtry = ~3)
#> Type of random forest: regression
#> Number of trees: 1000
#> No. of variables tried at each split: 3
#>
#> Mean of squared residuals: 0.0128
#> % Var explained: 59.8
Look at the formula code that was printed out, one function uses the argument name ntree
and the other uses num.trees
. parsnip
doesn’t require you to know the specific names of the main arguments.
Now suppose that we want to modify the value of mtry
based on the number of predictors in the data. Usually, the default value would be floor(sqrt(num_predictors))
. To use a pure bagging model would require an mtry
value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present (perhaps due to the generation of indicator variables or a variable filter) so that might be difficult to know exactly.
When the model it being fit by parsnip
, data descriptors are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using rand_forest
), the values of the arguments that you give it are immediately evaluated… unless you delay them. To delay the evaluation of any argument, you can used rlang::expr
to make an expression.
Two relevant descriptors for what we are about to do are:
.preds()
: the number of predictor variables in the data set that are associated with the predictors prior to dummy variable creation..cols()
: the number of predictor columns after dummy variables (or other encodings) are created.Since ranger
won’t create indicator values, .preds()
would be appropriate for using mtry
for a bagging model.
For example, let’s use an expression with the .preds()
descriptor to fit a bagging model:
rand_forest(mode = "regression", mtry = .preds(), trees = 1000) %>%
set_engine("ranger") %>%
fit(
log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
)
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = ~.preds(), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 1000
#> Sample size: 2199
#> Number of independent variables: 5
#> Mtry: 5
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.00876
#> R squared (OOB): 0.724
A linear model might work here too. The linear_reg
model can be used. To use regularization/penalization, there are two engines that can do that here: the glmnet
and sparklyr
packages. The former will be used here and it only implements the non-formula method. parsnip
will allow either to be used though.
When regularization is used, the predictors should first be centered and scaled before given to the model. The formula method won’t do that so some other methods will be required. We’ll use recipes
package for that (more information here).
norm_recipe <- recipe(
Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,
data = ames_train
) %>%
step_dummy(all_nominal()) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_log(Sale_Price, base = 10) %>%
# estimate the means and standard deviations
prep(training = ames_train, retain = TRUE)
# Now let's fit the model using the processed version of the data
glmn_fit <-
linear_reg(penalty = 0.001, mixture = 0.5) %>%
set_engine("glmnet") %>%
fit(Sale_Price ~ ., data = juice(norm_recipe))
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:tidyr':
#>
#> expand
#> Loading required package: foreach
#>
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#>
#> accumulate, when
#> Loaded glmnet 2.0-16
glmn_fit
#> parsnip model object
#>
#>
#> Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0.5, lambda = ~0.001)
#>
#> Df %Dev Lambda
#> [1,] 29 0.608 0.001
If penalty
were not specified, all of the lambda
values would be computed.
To get the predictions for this specific value of lambda
(aka penalty
):
# First, get the processed version of the test set predictors:
test_normalized <- bake(norm_recipe, newdata = ames_test, all_predictors())
test_results <-
test_results %>%
rename(`random forest` = .pred) %>%
bind_cols(
predict(glmn_fit, new_data = test_normalized) %>%
rename(glmnet = .pred)
)
test_results
#> # A tibble: 731 x 3
#> Sale_Price `random forest` glmnet
#> <dbl> <dbl> <dbl>
#> 1 5.02 5.20 5.16
#> 2 5.24 5.24 5.17
#> 3 5.39 5.26 5.16
#> 4 5.33 5.42 5.45
#> 5 5.27 5.25 5.25
#> 6 5.26 5.28 5.25
#> 7 5.73 5.56 5.60
#> 8 5.18 5.19 5.18
#> 9 5.06 5.12 5.16
#> 10 4.94 5.00 5.02
#> # ... with 721 more rows
test_results %>% metrics(truth = Sale_Price, estimate = glmnet)
#> # A tibble: 3 x 2
#> .metric .estimate
#> <chr> <dbl>
#> 1 rmse 0.112
#> 2 rsq 0.584
#> 3 mae 0.0807
test_results %>%
gather(model, prediction, -Sale_Price) %>%
ggplot(aes(x = prediction, y = Sale_Price)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4) +
facet_wrap(~model) +
coord_fixed()