For a recipe with at least one preprocessing operation, estimate the required parameters from a training set that can be later applied to other data sets.
prep(x, ...) # S3 method for recipe prep( x, training = NULL, fresh = FALSE, verbose = FALSE, retain = TRUE, log_changes = FALSE, strings_as_factors = TRUE, ... )
x | an object |
---|---|
... | further arguments passed to or from other methods (not currently used). |
training | A data frame or tibble that will be used to estimate parameters for preprocessing. |
fresh | A logical indicating whether already trained operation should be
re-trained. If |
verbose | A logical that controls whether progress is reported as operations are executed. |
retain | A logical: should the preprocessed training set be saved
into the |
log_changes | A logical for printing a summary for each step regarding which (if any) columns were added or removed during training. |
strings_as_factors | A logical: should character columns be converted to
factors? This affects the preprocessed training set (when
|
A recipe whose step objects have been updated with the required
quantities (e.g. parameter estimates, model objects, etc). Also, the
term_info
object is likely to be modified as the operations are
executed.
Given a data set, this function estimates the required quantities and statistics required by any operations.
prep()
returns an updated recipe with the estimates.
Note that missing data handling is handled in the steps; there is no global
na.rm
option at the recipe-level or in prep()
.
Also, if a recipe has been trained using prep()
and then steps
are added, prep()
will only update the new operations. If
fresh = TRUE
, all of the operations will be (re)estimated.
As the steps are executed, the training
set is updated. For example,
if the first step is to center the data and the second is to scale the
data, the step for scaling is given the centered data.
Max Kuhn
data(ames, package = "modeldata") library(dplyr) ames <- mutate(ames, Sale_Price = log10(Sale_Price)) ames_rec <- recipe( Sale_Price ~ Longitude + Latitude + Neighborhood + Year_Built + Central_Air, data = ames ) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 5) prep(ames_rec, verbose = TRUE)#> oper 1 step other [training] #> oper 2 step dummy [training] #> oper 3 step interact [training] #> oper 4 step ns [training] #> The retained training set is ~ 0.49 Mb in memory. #>#> Data Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 5 #> #> Training data contained 2930 data points and no missing data. #> #> Operations: #> #> Collapsing factor levels for Neighborhood [trained] #> Dummy variables from Neighborhood, Central_Air [trained] #> Interactions with Central_Air_Y:Year_Built [trained] #> Natural Splines on Longitude, Latitude [trained]prep(ames_rec, log_changes = TRUE)#> step_other (other_hiYO6): same number of columns #> #> step_dummy (dummy_JgwxH): #> new (9): Neighborhood_College_Creek, Neighborhood_Old_Town, ... #> removed (2): Neighborhood, Central_Air #> #> step_interact (interact_Fe2kG): #> new (1): Central_Air_Y_x_Year_Built #> #> step_ns (ns_LGGmr): #> new (10): Longitude_ns_1, Longitude_ns_2, Longitude_ns_3, ... #> removed (2): Longitude, Latitude #>#> Data Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 5 #> #> Training data contained 2930 data points and no missing data. #> #> Operations: #> #> Collapsing factor levels for Neighborhood [trained] #> Dummy variables from Neighborhood, Central_Air [trained] #> Interactions with Central_Air_Y:Year_Built [trained] #> Natural Splines on Longitude, Latitude [trained]