Recipes can be different from their base R counterparts such as model.matrix. This vignette describes the different methods for encoding categorical predictors with special attention to interaction terms and contrasts.

Creating Dummy Variables

Let’s start, of course, with iris data. This has four numeric columns and a single factor column with three levels: 'setosa', 'versicolor', and 'virginica'. Our initial recipe will have no outcome:

A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.

The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns:

Note that the column that was used to make the new columns (Species) is no longer there. See the section below on obtaining the entire set of C columns.

There are different types of contrasts that can be used for different types of factors. The defaults are:

Looking at ?contrast, there are other options. One alternative is the little known Helmert contrast:

contr.helmert returns Helmert contrasts, which contrast the second level with the first, the third with the average of the first two, and so on.

To get this encoding, the global option for the contrasts can be changed and saved. step_dummy picks up on this and makes the correct calculations:

Note that the column names do not reference a specific level of the species variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.

Interactions with Dummy Variables

Creating interactions with recipes requires the use of a model formula, such as

In R model formulae, using a * between two variables would expand to a*b = a + b + a:b so that the main effects are included. In step_interact, you can do use *, but only the interactions are recorded as columns that needs to be created.

One thing that recipes does differently than base R is to construct the design matrix in sequential iterations. This is relevant when thinking about interactions between continuous and categorical predictors.

For example, if you were to use the standard formula interface, the creation of the dummy variables happens at the same time as the interactions are created:

With recipes, you create them sequentially. This raises an issue: do I have to type out all of the interaction effects by their specific names when using dummy variable?

Note only is this a pain, but it may not be obvious what dummy variables are available (especially when step_other is used).

The solution is to use a selector:

What happens here is that starts_with("Species") is executed on the data that are available when the previous steps have been applied to the data. That means that the dummy variable columns are present. The results of this selectors are then translated to an additive function of the results. In this case, that means that

becomes

The entire interaction formula is shown here:

For interactions between multiple sets of dummy variables, the formula could include multiple selectors (e.g. starts_with("x_"):starts_with("y_")).

Getting All of the Indicator Variables

As mentioned above, if there are C levels of the factor, there will be C - 1 dummy variables created. You might want to get all of them back.

Historically, C - 1 are used so that a linear dependency is avoided in the design matrix; all C dummy variables would add up row-wise to the intercept column and the inverse matrix for linear regression can’t be computed. This technical term for a the design matrix like this is “less than full rank”.

There are models (e.g. glmnet and others) that can avoid this issue so you might want to get all of the columns. To do this, step_dummy has an option called one_hot that will make sure that all C are produced:

The option is named that way since this is that the computer scientists call “one-hot encoding”.

Warning! (again)

This will give you the full set of indicators and, when you use the typical contrast function, it does. It might do some seemingly weird (but legitimate) things when used with other contrasts:

Since this contrast doesn’t make sense using all C columns, it reverts back to the default encoding.

Novel Levels

When a recipe is used with new samples, some factors may have acquired new levels that were not present when prep was run. If step_dummy encounters this situation, a warning is issues (“There are new levels in a factor”) and the indicator variables that correspond to the factor are assigned missing values.

One way around this is to use step_other. This step can convert infrequently occurring levels to a new category (that defaults to “other”). This step can also be used to convert new factor levels to “other” also.

Also, step_integer has functionality similar to LabelEncoder and encodes new values as zero.

The embed package can also handle novel factors levels within a recipe. step_embed and step_tfembed assign a common numeric score to novel levels.