Discretize Numeric VariablesSource:
discretize() converts a numeric vector into a factor with
bins having approximately the same number of data points (based
on a training set).
discretize(x, ...) # S3 method for default discretize(x, ...) # S3 method for numeric discretize( x, cuts = 4, labels = NULL, prefix = "bin", keep_na = TRUE, infs = TRUE, min_unique = 10, ... ) # S3 method for discretize predict(object, new_data, ...)
A numeric vector
Options to pass to
stats::quantile()that should not include
An integer defining how many cuts to make of the data.
A character vector defining the factor levels that will be in the new factor (from smallest to largest). This should have length
cuts+1and should not include a level for missing (see
A single parameter value to be used as a prefix for the factor levels (e.g.
bin2, ...). If the string is not a valid R name, it is coerced to one. If
prefix = NULLthen the factor levels will be labelled according to the output of
A logical for whether a factor level should be created to identify missing values in
keep_nais set to
na.rm = TRUEis used when calling
A logical indicating whether the smallest and largest cut point should be infinite.
An integer defining a sample size line of dignity for the binning. If (the number of unique values)
/(cuts+1)is less than
min_unique, no discretization takes place.
An object of class
A new numeric object to be binned.
discretize returns an object of class
predict.discretize returns a factor
discretize estimates the cut points from
x using percentiles. For example, if
cuts = 3, the
function estimates the quartiles of
x and uses these as
the cut points. If
cuts = 2, the bins are defined as
being above or below the median of
predict method can then be used to turn numeric
vectors into factor vectors.
keep_na = TRUE, a suffix of "_missing" is used as a
factor level (see the examples below).
infs = FALSE and a new value is greater than the
largest value of
x, a missing value will result.
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] median(biomass_tr$carbon) #>  47.1 discretize(biomass_tr$carbon, cuts = 2) #> Bins: 3 (includes missing category) #> Breaks: -Inf, 47.1, Inf discretize(biomass_tr$carbon, cuts = 2, infs = FALSE) #> Bins: 3 (includes missing category) #> Breaks: 14.61, 47.1, 97.18 discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE) #> Bins: 2 #> Breaks: 14.61, 47.1, 97.18 discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin") #> Warning: The prefix 'maybe a bad idea to bin' is not a valid R name. It has been changed to 'maybe.a.bad.idea.to.bin'. #> Bins: 3 (includes missing category) #> Breaks: -Inf, 47.1, Inf carbon_binned <- discretize(biomass_tr$carbon) table(predict(carbon_binned, biomass_tr$carbon)) #> #> bin1 bin2 bin3 bin4 #> 114 115 113 114 carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE) predict(carbon_no_infs, c(50, 100)) #>  bin4 <NA> #> Levels: bin1 bin2 bin3 bin4