`step_tokenize` creates a *specification* of a recipe step that will convert a character predictor into a list of tokens.

step_tokenize(recipe, ..., role = NA, trained = FALSE,
  columns = NULL, options = list(), token = "words",
  custom_token = NULL, skip = FALSE, id = rand_id("tokenize"))

# S3 method for step_tokenize
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables. For `step_tokenize`, this indicates the variables to be encoded into a list column. See [recipes::selections()] for more details. For the `tidy` method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is `NULL` until the step is trained by [recipes::prep.recipe()].

options

A list of options passed to the tokenizer.

token

Unit for tokenizing. Built-in options from the [tokenizers] package are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), "ptb" (Penn Treebank), "skip_ngrams" and "word_stems".

custom_token

User supplied tokenizer. use of this argument will overwrite the token argument. Must take a character vector as input and output a list of character vectors.

skip

A logical. Should the step be skipped when the recipe is baked by [recipes::bake.recipe()]? While all operations are baked when [recipes::prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using `skip = TRUE` as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it

x

A `step_tokenize` object.

Value

An updated version of `recipe` with the new step added to the sequence of existing steps (if any).

Details

Tokenization is the act of splitting a character string into smaller parts to be further analysed. This step uses the `tokenizers` package which includes heuristics to split the text into paragraphs tokens, word tokens amough others. `textrecipes` keeps the tokens in a list-column and other steps will do their tasks on those list-columns before transforming them back to numeric.

Working will `textrecipes` will always start by calling `step_tokenize` followed by modifying and filtering steps.

See also

[step_untokenize]

Examples

library(recipes) data(okc_text) okc_rec <- recipe(~ ., data = okc_text) %>% step_tokenize(essay0) okc_obj <- okc_rec %>% prep(training = okc_text, retain = TRUE) juice(okc_obj, essay0) %>% slice(1:2)
#> # A tibble: 2 x 1 #> essay0 #> <list> #> 1 <chr [184]> #> 2 <chr [24]>
juice(okc_obj) %>% slice(2) %>% pull(essay0)
#> [[1]] #> [1] "i'm" "chill" "and" "steady" "br" "i'm" #> [7] "a" "teacher" "amp" "musician" "br" "i" #> [13] "like" "playing" "outside" "dislike" "school" "nights" #> [19] "br" "and" "i'm" "very" "very" "lucky" #>
tidy(okc_rec, number = 1)
#> # A tibble: 1 x 3 #> terms value id #> <chr> <chr> <chr> #> 1 essay0 <NA> tokenize_bxI1x
tidy(okc_obj, number = 1)
#> # A tibble: 1 x 3 #> terms value id #> <S3: quosures> <chr> <chr> #> 1 ~essay0 words tokenize_bxI1x
okc_obj_chars <- recipe(~ ., data = okc_text) %>% step_tokenize(essay0, token = "characters") %>% prep(training = okc_text, retain = TRUE) juice(okc_obj_chars) %>% slice(2) %>% pull(essay0)
#> [[1]] #> [1] "i" "m" "c" "h" "i" "l" "l" "a" "n" "d" "s" "t" "e" "a" "d" "y" "<" "b" #> [19] "r" ">" "i" "m" "a" "t" "e" "a" "c" "h" "e" "r" "a" "m" "p" "m" "u" "s" #> [37] "i" "c" "i" "a" "n" "<" "b" "r" ">" "i" "l" "i" "k" "e" "p" "l" "a" "y" #> [55] "i" "n" "g" "o" "u" "t" "s" "i" "d" "e" "d" "i" "s" "l" "i" "k" "e" "s" #> [73] "c" "h" "o" "o" "l" "n" "i" "g" "h" "t" "s" "<" "b" "r" ">" "a" "n" "d" #> [91] "i" "m" "v" "e" "r" "y" "v" "e" "r" "y" "l" "u" "c" "k" "y" #>