The full tidyselect DSL is now allowed inside recipes
step_*() functions. This includes the operators
! and the new
where() function. Additionally, the restriction preventing user defined selectors from being used has been lifted (#572).
If steps that drop/add variables are skipped when baking the test set, the resulting column ordering of the baked test set will now be relative to the original recipe specification rather than relative to the baked training set. This is often more intuitive.
More infrastructure work to make parallel processing on Windows less buggy with PSOCK clusters
fully_trained() now returns
FALSE when an unprepped recipe is used.
prep() gained an option to print a summary of which columns were added and/or removed during execution.
To reduce confusion between
juice(), the latter is superseded in favor of using
bake(object, new_data = NULL). The
new_data argument now has no default, so a
NULL value must be explicitly used in order to emulate the results of
juice() will remain in the package (and used internally) but most communication and training will use
bake(object, new_data = NULL). (#543)
Tim Zhou added a step to use linear models for imputation (#555)
step_pls() was changed so that it uses the Bioconductor mixOmics package. Objects created with previous versions of
recipes can still use
bake(). With the current version, the categorical outcomes can be used but now multivariate models do not. Also, the new method allows for sparse results.
Avoided partial matching on
seq() arguments in internal functions.
Improved error messaging, for example when a user tries to
prep() a tuneable recipe.
step_zv() now handles
NA values so that variables with zero variance plus are removed.
tune pacakge can now use recipes with
check operations (but also requires
tune >= 0.1.0.9000).
tidy method for
step_pca() now has an option for returning the variance statistics for each component.
recipesdoes not directly depend on
dials, it has several S3 methods for generics in
dials. Version 0.0.5 of
dialsadded stricter validation for these methods, so changes were required for
step_cut()enables you to create a factor from a numeric based on provided break (contributed by Edwin Thoen)
yj_transform()to avoid conflicts.
The imputation steps do not change the data type being imputed now. Previously, if the data were integer, the data would be changed to numeric (for some step types). The change is breaking since the underlying data of imputed values are now saved as a list instead of a vector (for some step types).
The data sets were moved to the new
When using a selector that returns no columns,
bake() will now return a tibble with as many rows as the original template data or the
new_data respectively. This is more consistent with how selectors work in dplyr (#411).
Code was added to explicitly register
tunable methods when
recipes is loaded. This is required because of changes occurring in R 4.0.
check_class() checks if a variable is of the designated class. Class is either learned from the train set or provided in the check. (contributed by Edwin Thoen)
Release driven by changes in
tidyr (v 1.0.0).
wdth argument has been renamed to
The use of
varying() will be deprecated in favor of an upcoming function
tune(). No changes are need in this version, but subsequent versions will work with
bake if variable contains values that were not observed in the train set (contributed by Edwin Thoen)
When no outcomes are in the recipe, using
juice(object, all_outcomes() and
bake(object, new_data, all_outcomes() will return a tibble with zero rows and zero columns (instead of failing). (#298). This will also occur when the selectors select no columns.
step_discretize() has arguments moved out of
options too; the main arguments are now
num_breaks (instead of
min_unique. Again, deprecation messages are issued with the old argument structure.
Methods were added for a future generic called
tunable(). This outlines which parameters in a step can/could be tuned.
Release driven by changes in
Since 2018, a warning has been issued when the wrong argument was used in
bake(recipe, newdata). The depredation period is over and
new_data is officially required.
step_other() did not collapse any levels, it would still add an “other” level to the factor. This would lump new factor levels into “other” when data were baked (as
step_novel() does). This no longer occurs since it was inconsistent with
?step_other, which said that
“If no pooling is done the data are unmodified”.
threshold argument of
step_other is greater than one then it specifies the minimum sample size before the levels of the factor are collapsed into the “other” category. #289
Due to changes by CRAN,
step_nnmf() only works on versions of R >= 3.6.0 due to dependency issues.
Small release driven by changes in
sample() in the current r-devel.
A new vignette discussing roles has been added.
To provide infrastructure for finalizing varying parameters, an
update() method for recipe steps has been added. This allows users to alter information in steps that have not yet been trained.
step_interact will no longer fail if an interaction contains an interaction using column that has been previously filtered from the data. A warning is issued when this happens and no interaction terms will be created.
step_corr was made more fault tolerant for cases where the data contain a zero-variance column or columns with missing values.
Set the embedded environment to NULL in
prep.step_dummy to reduce the file size of serialized recipe class objects when using
step_dummynow returns the original variable and the levels of the future dummy variables.
NAroles of existing columns (#296).
Several argument names were changed to be consistent with other
tidymodels packages (e.g.
dials) and the general tidyverse naming conventions.
step_knnimputewas changed to
step_isomaphad the number of neighbors promoted to a main argument called
nbaggout of the options and into a main argument
step_nshas degrees of freedom promoted to a main argument with name
degreepromoted to a main argument.
juiceand other functions has
new_data. For this version only, using
newdatawill only result in a wanring.
prepand a few steps had
All steps gain an
id field that will be used in the future to reference other steps.
retain option to
prep is now defaulted to
verbose = TRUE, the approximate size of the data set is printed. #207
step_integerconverts data to ordered integers similar to
LabelEncoder#123 and #185
step_geodistcan be used to calculate the distance between geocodes and a single reference location.
step_nnmfcomputes the non-negative matrix factorization for data.
prepperwas moved to
step_step_string2factorwill now accept factors and leave them as-is.
step_knnimputenow excludes missing data in the variable to be imputed from the nearest-neighbor calculation. This would have resulted in some missing data to not be imputed (i.e. return another missing value).
step_dummynow produces a warning (instead of failing) when non-factor columns are selected. Only factor columns are used; no conversion is done for character data. issue #186
dummy_namesgained a separator argument. issue #183
seedarguments for more control over randomness.
broomis no longer used to get the
tidygeneric. These are now contained in the
bake if variable range in new data is outside the range that was learned from the train set (contributed by Edwin Thoen)
step_lag can lag variables in the data set (contributed by Alex Hayes).
step_naomit removes rows with missing data for specific columns (contributed by Alex Hayes).
step_rollimpute can be used to impute data in a sequence or series by estimating their values within a moving window.
step_pls can conduct supervised feature extraction for predictors.
step_log gained an
step_log gained a
signed argument (contributed by Edwin Thoen).
The internal functions
printer have been exported to enable other packages to contain steps.
When training new steps after some steps have been previously trained, the
retain = TRUE option should be set on previous invocations of
one_hot = TRUEoption. Thanks to Davis Vaughan.
contrastoption was removed. The step uses the global option for contrasts.
step_other will now convert novel levels of the factor to the “other” level.
step_bin2factor now has an option to choose how the values are translated to the levels (contributed by Michael Levy).
juice can now export basic data frames.
okc data were updated with two additional columns.
Edwin Thoen suggested adding validation checks for certain data characteristics. This fed into the existing notion of expanding
recipes beyond steps (see the non-step steps project). A new set of operations, called
checks, can now be used. These should throw an informative error when the check conditions are not met and return the existing data otherwise.
Steps now have a
skip option that will not apply preprocessing when
bake is used. See the article on skipping steps for more information.
check_missing will validate that none of the specified variables contain missing data.
detect_step can be used to check if a recipe contains a particular preprocessing operation.
step_num2factor can be used to convert numeric data (especially integers) to factors.
step_novel adds a new factor level to nominal variables that will be used when new data contain a level that did not exist when the recipe was prepared.
step_profile can be used to generate design matrix grids for prediction profile plots of additive models where one variable is varied over a grid and all of the others are fixed at a single value.
step_upsample can be used to change the number of rows in the data based on the frequency distributions of a factor variable in the training set. By default, this operation is only applied to the training set;
bake ignores this operation.
step_naomit drops rows when specified columns contain
NA, similar to
step_lag allows for the creation of lagged predictor columns.
bakewas changed from
prepis now defaulted to
step_dummywas fixed that makes sure that the correct binary variables are generated despite the levels or values of the incoming factor. Also,
step_dummynow requires factor inputs.
step_dummyalso has a new default naming function that works better for factors. However, there is an extra argument (
ordinal) now to the functions that can be passed to
step_interactnow allows for selectors (e.g.
starts_with("prefix")to be used in the interaction formula.
dplyr::one_ofwas added to the list of selectors.
step_bsadds B-spline basis functions.
step_unorderconverts ordered factors to unordered factors.
step_countcounts the number of instances that a pattern exists in a string.
step_factor2stringcan be used to move between encodings.
step_lowerimputeis for numeric data where the values cannot be measured below a specific value. For these cases, random uniform values are used for the truncated values.
tidymethods were added for recipes and many (but not all) steps.
bake.recipe, the argument
newdatais now without a default.
juicecan now save the final processed data set in sparse format. Note that, as the steps are processed, a non-sparse data frame is used to store the results.
step_lincombremoves variables involved in linear combinations to resolve them.
step_regexapplies a regular expression to a character or factor vector to create dummy variables.
recipeobjects was changed so that pipes can be used to create the recipe with a formula.
roleargument in factor of a general set of selectors. If no selector is used, all the predictors are returned.