RFormula
RFormula selects columns specified by an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. The basic operators are:
~ separate target and terms
concat terms, “+ 0” means removing intercept
remove a term, “- 1” means removing intercept
: interaction (multiplication for numeric values, or binarized categorical values)
. all columns except target
Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:
y ~ a + b means model y ~ w0 + w1 a + w2 b where w0 is the intercept and w1, w2 are coefficients. y ~ a + b + a:b - 1 means model y ~ w1 a + w2 b + w3 a b where w1, w2, w3 are coefficients. RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. As to string input columns, they will first be transformed with StringIndexer using ordering determined by stringOrderType, and the last category after ordering is dropped, then the doubles will be one-hot encoded.
Suppose a string feature column containing values {'b', 'a', 'b', 'a', 'c', 'b'}, we set stringOrderType to control the encoding:
stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula |
'frequencyDesc' | most frequent category ('b') | least frequent category ('c') |
'frequencyAsc' | least frequent category ('c') | most frequent category ('b') |
'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a') |
'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c') |
If the label column is of type string, it will be first transformed to double with StringIndexer using frequencyDesc ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
Note: The ordering option stringOrderType is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in StringIndexer.
Examples
Assume that we have a DataFrame with the columns id, country, hour, and clicked:
id | country | hour | clicked |
7 | "US" | 18 | 1.0 |
8 | "CA" | 12 | 0.0 |
9 | "NZ" | 15 | 0.0 |
If we use RFormula with a formula string of clicked ~ country + hour, which indicates that we want to predict clicked based on country and hour, after transformation we should get the following DataFrame:
id | country | hour | clicked | features | label |
7 | "US" | 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0 |
8 | "CA" | 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0 |
9 | "NZ" | 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0 |
Last updated