Developer functions for creating recipes steps — developer

This page provides a comprehensive list of the exported functions for creating recipes steps and guidance on how to use them.

Creating steps

add_step() and add_check() are required when creating a new step. The output of add_step() should be the return value of all steps and should have the following format:

step_example <- function(recipe,
                         ...,
                         role = NA,
                         trained = FALSE,
                         skip = FALSE,
                         id = rand_id("example")) {
  add_step(
    recipe,
    step_example_new(
      terms = enquos(...),
      role = role,
      trained = trained,
      skip = skip,
      id = id
    )
  )
}

rand_id() should be used in the arguments of step_example() to specify the argument, as we see in the above example.

recipes_pkg_check() should be used in step_example() functions together with required_pkgs() to alert users that certain other packages are required. The standard way of using this function is the following format:

recipes_pkg_check(required_pkgs.step_example())

step() and check() are used within the step_*_new() function that you use in your new step. It will be used in the following way:

step_example_new <- function(terms, role, trained, skip, id) {
  step(
    subclass = "example",
    terms = terms,
    role = role,
    trained = trained,
    skip = skip,
    id = id
  )
}

recipes_eval_select() is used within prep.step_*() functions, and are used to turn the terms object into a character vector of the selected variables.

It will most likely be used like so:

col_names <- recipes_eval_select(x$terms, training, info)

check_type() can be used within prep.step_*() functions to check that the variables passed in are the right types. We recommend that you use the types argument as it offers higher flexibility and it matches the types defined by .get_data_types(). When using types we find it better to be explicit, e.g. writing types = c("double", "integer") instead of types = "numeric", as it produces cleaner error messages.

It should be used like so:

check_type(training[, col_names], types = c("double", "integer"))

check_new_data() should be used within bake.step_*(). This function is used to make check that the required columns are present in the data. It should be one of the first lines inside the function.

It should be used like so:

check_new_data(names(object$columns), object, new_data)

check_name() should be used in bake.step_*() functions for steps that add new columns to the data set. The function throws an error if the column names already exist in the data set. It should be called before adding the new columns to the data set.

get_keep_original_cols() and remove_original_cols() are used within steps with the keep_original_cols argument. get_keep_original_cols() is used in prep.step_*() functions for steps that were created before the keep_original_cols argument was added, and acts as a way to throw a warning that the user should regenerate the recipe. remove_original_cols() should be used in bake.step_*() functions to remove the original columns. It is worth noting that remove_original_cols() can remove multiple columns at once and when possible should be put outside for loops.

new_data <- remove_original_cols(new_data, object, names_of_original_cols)

recipes_remove_cols() should be used in prep.step_*() functions, and is used to remove columns from the data set, either by using the object$removals field or by using the col_names argument.

get_case_weights() and are_weights_used() are functions that help you extract case weights and help determine if they are used or not within the step. They will typically be used within the prep.step_*() functions if the step in question supports case weights.

print_step() is used inside print.step_*() functions. This function is replacing the internally deprecated printer() function.

sel2char() is mostly used within tidy.step_*() functions to turn selections into character vectors.

names0() creates a series of num names with a common prefix. The names are numbered with leading zeros (e.g. prefix01-prefix10 instead of prefix1-prefix10). This is useful for many types of steps that produce new columns.

Interacting with recipe objects

detect_step() returns a logical indicator to determine if a given step or check is included in a recipe.

fully_trained() returns a logical indicator if the recipe is fully trained. The function is_trained() can be used to check in any individual steps are trained or not.

.get_data_types() is an S3 method that is used for selections. This method can be extended to work with column types not supported by recipes.

recipes_extension_check() is recommended to be used by package authors to make sure that all steps have prep.step_*(), bake.step_*(), print.step_*(), tidy.step_*(), and required_pkgs.step_*() methods. It should be used as a test, preferably like this:

test_that("recipes_extension_check", {
  expect_snapshot(
    recipes::recipes_extension_check(
      pkg = "pkgname"
    )
  )
})