step_tokenfilter creates a *specification* of a recipe step that will convert a list of tokens into a list where the tokens are filtered based on frequency.

step_tokenfilter(recipe, ..., role = NA, trained = FALSE,
columns = NULL, max_times = Inf, min_times = 0,
percentage = FALSE, max_tokens = Inf, res = NULL, skip = FALSE,
id = rand_id("tokenfilter"))

# S3 method for step_tokenfilter
tidy(x, ...)

## Arguments

recipe A recipe object. The step will be added to the sequence of operations for this recipe. One or more selector functions to choose variables. For step_tokenfilter, this indicates the variables to be encoded into a list column. See [recipes::selections()] for more details. For the tidy method, these are not currently used. Not used by this step since no new variables are created. A logical to indicate if the recipe has been baked. A list of tibble results that define the encoding. This is NULL until the step is trained by [recipes::prep.recipe()]. An integer. Maximal number of times a word can appear before getting removed. An integer. Minimum number of times a word can appear before getting removed. A logical. Should max_times and min_times be interpreded as a percentage instead of count. An integer. Will only keep the top max_tokens tokens after filtering done by max_times and min_times. Defaults to Inf meaning all words in training will be used. The words that will be keep will be stored here once this preprocessing step has be trained by [prep.recipe()]. A logical. Should the step be skipped when the recipe is baked by [recipes::bake.recipe()]? While all operations are baked when [recipes::prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations. A character string that is unique to this step to identify it. A step_tokenfilter object.

## Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

## Details

This step allow you to limit the tokens you are looking at by filtering on their occurance in the corpus. You are able to exclude tokens if they appear too many times or too fews times in the data. It can be specified as counts using max_times and min_times or as percentages by setting percentage as TRUE. In addition one can filter to only use the top max_tokens used tokens.

It is advised to filter before using [step_tf] or [step_tfidf] to limit the number of variables created.

[step_untokenize()]

## Examples

library(recipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
step_tokenize(essay0) %>%
step_tokenfilter(essay0)

okc_obj <- okc_rec %>%
prep(training = okc_text, retain = TRUE)

juice(okc_obj, essay0) %>%
slice(1:2)#> # A tibble: 2 x 1
#>   essay0
#>   <list>
#> 1 <chr [184]>
#> 2 <chr [24]>
juice(okc_obj) %>%
slice(2) %>%
pull(essay0)#> [[1]]
#>  [1] "i'm"      "chill"    "and"      "steady"   "br"       "i'm"
#>  [7] "a"        "teacher"  "amp"      "musician" "br"       "i"
#> [13] "like"     "playing"  "outside"  "dislike"  "school"   "nights"
#> [19] "br"       "and"      "i'm"      "very"     "very"     "lucky"
#>
tidy(okc_rec, number = 2)#> # A tibble: 1 x 3
#>   terms  value id
#>   <chr>  <int> <chr>
#> 1 essay0    NA tokenfilter_BxbCytidy(okc_obj, number = 2)#> # A tibble: 1 x 3
#>   terms          value     id
#>   <S3: quosures> <list>    <chr>
#> 1 ~essay0        <int [1]> tokenfilter_BxbCy