Methods for selecting variables in step functionsSource:
Tips for selecting columns in step functions.
When selecting variables or model terms in
dplyr-like tools are used. The selector functions
can choose variables based on their name, current role, data
type, or any combination of these. The selectors are passed as
any other argument to the step. If the variables are explicitly
named in the step function, this might look like:
The first four arguments indicate which variables should be
used in the PCA while the last argument is a specific argument
step_pca() about the number of components.
These arguments are not evaluated until the
prepfunction for the step is executed.
dplyr-like syntax allows for negative signs to exclude variables (e.g.
-Murder) and the set of selectors will processed in order.
A leading exclusion in these arguments (e.g.
-Murder) has the effect of adding all variables to the list except the excluded variable(s), ignoring role information.
Select helpers from the
tidyselect package can also be used:
recipe(Species ~ ., data = iris) %>% step_center(starts_with("Sepal"), -contains("Width"))
would only select
Columns of the design matrix that may not exist when the step
is coded can also be selected. For example, when using
step_pca(), the number of columns created by feature extraction
may not be known when subsequent steps are defined. In this
matches("^PC") will select all of the columns
whose names start with "PC" once those columns are created.
There are sets of recipes-specific functions that can be used to select
variables based on their role or type:
has_type(). For convenience, there are also functions that are
more specific. The functions
based on type, with nominal variables including both character and factor;
all_outcomes() select based on role.
select intersections of role and type. Any can be used in conjunction with
the previous functions described for selecting variables using their names.
A selection like this:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric(), -all_outcomes())
is equivalent to:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric_predictors())
Both result in all the numeric predictors: carbon, hydrogen, oxygen, nitrogen, and sulfur.
If a role for a variable has not been defined, it will never be selected using role-specific selectors.
Selectors can be used in
step_interact() in similar ways but
must be embedded in a model formula (as opposed to a sequence
of selectors). For example, the interaction specification
~ starts_with("Species"):Sepal.Width. This can be
Species was converted to dummy variables
step_dummy(). The implementation of
step_interact() is special, and is more restricted than
the other step functions. Only the selector functions from
recipes and tidyselect are allowed. User defined selector functions
will not be recognized. Additionally, the tidyselect domain specific
language is not recognized here, meaning that
will not work.
Tips for saving recipes and filtering columns
When creating variable selections:
If you are using column filtering steps, such as
step_corr(), try to avoid hardcoding specific variable names in downstream steps in case those columns are removed by the filter. Instead, use
dplyr::any_of()will be tolerant if a column has been removed.
dplyr::all_of()will fail unless all of the columns are present in the data.
For both of these functions, if you are going to save the recipe as a binary object to use in another R session, try to avoid referring to a vector in your workspace.
some_vars <- names(mtcars)[4:6] # No filter steps, OK for not saving the recipe rec_1 <- recipe(mpg ~ ., data = mtcars) %>% step_log(all_of(some_vars)) %>% prep() # No filter steps, saving the recipe rec_2 <- recipe(mpg ~ ., data = mtcars) %>% step_log(!!!some_vars) %>% prep() # This fails since `wt` is not in the data recipe(mpg ~ ., data = mtcars) %>% step_rm(wt) %>% step_log(!!!some_vars) %>% prep()
## Error in `step_log()`: ## Caused by error in `prep()` at recipes/R/recipe.R:437:8: ## ! Can't subset columns that don't exist. ## x Column `wt` doesn't exist.