One issue with different functions available in R that do the same thing is that they can have different interfaces and arguments. For example, to fit a random forest classification model, we might have:
# From randomForest rf_1 <- randomForest(x, y, mtry = 12, ntree = 2000, importance = TRUE) # From ranger rf_2 <- ranger( y ~ ., data = dat, mtry = 12, num.trees = 2000, importance = 'impurity' ) # From sparklyr rf_3 <- ml_random_forest( dat, intercept = FALSE, response = "y", features = names(dat)[names(dat) != "y"], col.sample.rate = 12, num.trees = 2000 )
Note that the model syntax is very different and that the argument names (and formats) are also different. This is a pain if you go between implementations.
In this example,
The idea of
parsnip is to:
ranger::rangeror other specific packages.
trees) so that users can remember a single name. This will help across model types too so that
treeswill be the same argument across random forest as well as boosting or bagging.
Using the example above, the
parsnip approach would be
The engine can be easily changed and the mode can be determined when
fit is called. To use Spark, the change is simple:
To install it, use: