For a recipe with at least one preprocessing operation that has been trained
by prep()
, apply the computations to new data.
Usage
bake(object, ...)
# S3 method for class 'recipe'
bake(object, new_data, ..., composition = "tibble")
Arguments
- object
A trained object such as a
recipe()
with at least one preprocessing operation.- ...
One or more selector functions to choose which variables will be returned by the function. See
selections()
for more details. If no selectors are given, the default is to usedplyr::everything()
.- new_data
A data frame, tibble, or sparse matrix from the
Matrix
package for whom the preprocessing will be applied. IfNULL
is given tonew_data
, the pre-processed training data will be returned (assuming thatprep(retain = TRUE)
was used). See sparse_data for more information about use of sparse data.- composition
Either
"tibble"
,"matrix"
,"data.frame"
, or"dgCMatrix"``for the format of the processed data set. Also, note that this argument should be called **after** any selectors and the selectors should only resolve to numeric columns if
compositionis set to
"matrix"or
"dgCMatrix". If the data contains sparse columns they will be perseved for
"tibble"and
"data.frame", and efficiently used for
"dgCMatrix"`.
Value
A tibble, matrix, or sparse matrix that may have different columns than the
original columns in new_data
.
Details
bake()
takes a trained recipe and applies its operations to a data set to
create a design matrix. If you are using a recipe as a preprocessor for
modeling, we highly recommend that you use a workflow()
instead of
manually applying a recipe (see the example in recipe()
).
If the data set is not too large, time can be saved by using the retain = TRUE
option of prep()
. This stores the processed version of the training
set. With this option set, bake(object, new_data = NULL)
will return it for
free.
Also, any steps with skip = TRUE
will not be applied to the data when
bake()
is invoked with a data set in new_data
. bake(object, new_data = NULL)
will always have all of the steps applied.
Examples
data(ames, package = "modeldata")
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
ames_rec <-
recipe(Sale_Price ~ ., data = ames[-(1:6), ]) %>%
step_other(Neighborhood, threshold = 0.05) %>%
step_dummy(all_nominal()) %>%
step_interact(~ starts_with("Central_Air"):Year_Built) %>%
step_ns(Longitude, Latitude, deg_free = 2) %>%
step_zv(all_predictors()) %>%
prep()
# return the training set (already embedded in ames_rec)
bake(ames_rec, new_data = NULL)
#> # A tibble: 2,924 × 259
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area
#> <dbl> <int> <int> <int> <dbl>
#> 1 41 4920 2001 2001 0
#> 2 43 5005 1992 1992 0
#> 3 39 5389 1995 1996 0
#> 4 60 7500 1999 1999 0
#> 5 75 10000 1993 1994 0
#> 6 0 7980 1992 2007 0
#> 7 63 8402 1998 1998 0
#> 8 85 10176 1990 1990 0
#> 9 0 6820 1985 1985 0
#> 10 47 53504 2003 2003 603
#> # ℹ 2,914 more rows
#> # ℹ 254 more variables: BsmtFin_SF_1 <dbl>, BsmtFin_SF_2 <dbl>,
#> # Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, First_Flr_SF <int>,
#> # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
#> # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
#> # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
#> # Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>, …
# apply processing to other data:
bake(ames_rec, new_data = head(ames))
#> # A tibble: 6 × 259
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area
#> <dbl> <int> <int> <int> <dbl>
#> 1 141 31770 1960 1960 112
#> 2 80 11622 1961 1961 0
#> 3 81 14267 1958 1958 108
#> 4 93 11160 1968 1968 0
#> 5 74 13830 1997 1998 0
#> 6 78 9978 1998 1998 20
#> # ℹ 254 more variables: BsmtFin_SF_1 <dbl>, BsmtFin_SF_2 <dbl>,
#> # Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, First_Flr_SF <int>,
#> # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
#> # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
#> # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
#> # Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>,
#> # Wood_Deck_SF <int>, Open_Porch_SF <int>, Enclosed_Porch <int>, …
# only return selected variables:
bake(ames_rec, new_data = head(ames), all_numeric_predictors())
#> # A tibble: 6 × 258
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area
#> <dbl> <int> <int> <int> <dbl>
#> 1 141 31770 1960 1960 112
#> 2 80 11622 1961 1961 0
#> 3 81 14267 1958 1958 108
#> 4 93 11160 1968 1968 0
#> 5 74 13830 1997 1998 0
#> 6 78 9978 1998 1998 20
#> # ℹ 253 more variables: BsmtFin_SF_1 <dbl>, BsmtFin_SF_2 <dbl>,
#> # Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, First_Flr_SF <int>,
#> # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
#> # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
#> # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
#> # Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>,
#> # Wood_Deck_SF <int>, Open_Porch_SF <int>, Enclosed_Porch <int>, …
bake(ames_rec, new_data = head(ames), starts_with(c("Longitude", "Latitude")))
#> # A tibble: 6 × 4
#> Longitude_ns_1 Longitude_ns_2 Latitude_ns_1 Latitude_ns_2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.570 -0.0141 0.472 0.394
#> 2 0.570 -0.0142 0.481 0.360
#> 3 0.569 -0.00893 0.484 0.348
#> 4 0.563 0.0212 0.496 0.301
#> 5 0.562 -0.212 0.405 0.634
#> 6 0.562 -0.212 0.407 0.630