step_unknown creates a specification of a recipe step that will assign a missing value in a factor level to"unknown".

step_unknown(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  new_level = "unknown",
  objects = NULL,
  skip = FALSE,
  id = rand_id("unknown")
)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables for this step. See selections() for more details.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the quantities for preprocessing have been estimated.

new_level

A single character value that will be assigned to new factor levels.

objects

A list of objects that contain the information on factor levels that will be determined by prep.recipe().

skip

A logical. Should the step be skipped when the recipe is baked by bake.recipe()? While all operations are baked when prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

Value

An updated version of recipe with the new step added to the sequence of any existing operations.

Details

The selected variables are adjusted to have a new level (given by new_level) that is placed in the last position.

Note that if the original columns are character, they will be converted to factors by this step.

If new_level is already in the data given to prep, an error is thrown.

When you tidy() this step, a tibble with columns terms (the columns that will be affected) and value (the factor levels that is used for the new value) is returned.

See also

Examples

library(modeldata)
data(okc)

rec <-
  recipe(~ diet + location, data = okc) %>%
  step_unknown(diet, new_level = "unknown diet") %>%
  step_unknown(location, new_level = "unknown location") %>%
  prep()

table(bake(rec, new_data = NULL) %>% pull(diet),
      okc %>% pull(diet),
      useNA = "always") %>%
  as.data.frame() %>%
  dplyr::filter(Freq > 0)
#>                   Var1                Var2  Freq
#> 1             anything            anything  6174
#> 2                halal               halal    11
#> 3               kosher              kosher    11
#> 4      mostly anything     mostly anything 16562
#> 5         mostly halal        mostly halal    48
#> 6        mostly kosher       mostly kosher    86
#> 7         mostly other        mostly other  1004
#> 8         mostly vegan        mostly vegan   335
#> 9    mostly vegetarian   mostly vegetarian  3438
#> 10               other               other   331
#> 11   strictly anything   strictly anything  5107
#> 12      strictly halal      strictly halal    18
#> 13     strictly kosher     strictly kosher    18
#> 14      strictly other      strictly other   450
#> 15      strictly vegan      strictly vegan   227
#> 16 strictly vegetarian strictly vegetarian   874
#> 17               vegan               vegan   136
#> 18          vegetarian          vegetarian   665
#> 19        unknown diet                <NA> 24360

tidy(rec, number = 1)
#> # A tibble: 1 × 3
#>   terms value        id           
#>   <chr> <chr>        <chr>        
#> 1 diet  unknown diet unknown_ZB2lW