RFormula

RFormula selects columns specified by an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. The basic operators are:

~ separate target and terms

  • concat terms, “+ 0” means removing intercept

  • remove a term, “- 1” means removing intercept

    : interaction (multiplication for numeric values, or binarized categorical values)

    . all columns except target

    Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:

y ~ a + b means model y ~ w0 + w1 a + w2 b where w0 is the intercept and w1, w2 are coefficients. y ~ a + b + a:b - 1 means model y ~ w1 a + w2 b + w3 a b where w1, w2, w3 are coefficients. RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. As to string input columns, they will first be transformed with StringIndexer using ordering determined by stringOrderType, and the last category after ordering is dropped, then the doubles will be one-hot encoded.

Suppose a string feature column containing values {'b', 'a', 'b', 'a', 'c', 'b'}, we set stringOrderType to control the encoding:

stringOrderType

Category mapped to 0 by StringIndexer

Category dropped by RFormula

'frequencyDesc'

most frequent category ('b')

least frequent category ('c')

'frequencyAsc'

least frequent category ('c')

most frequent category ('b')

'alphabetDesc'

last alphabetical category ('c')

first alphabetical category ('a')

'alphabetAsc'

first alphabetical category ('a')

last alphabetical category ('c')

If the label column is of type string, it will be first transformed to double with StringIndexer using frequencyDesc ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

Note: The ordering option stringOrderType is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in StringIndexer.

Examples

Assume that we have a DataFrame with the columns id, country, hour, and clicked:

id

country

hour

clicked

7

"US"

18

1.0

8

"CA"

12

0.0

9

"NZ"

15

0.0

If we use RFormula with a formula string of clicked ~ country + hour, which indicates that we want to predict clicked based on country and hour, after transformation we should get the following DataFrame:

id

country

hour

clicked

features

label

7

"US"

18

1.0

[0.0, 0.0, 18.0]

1.0

8

"CA"

12

0.0

[0.0, 1.0, 12.0]

0.0

9

"NZ"

15

0.0

[1.0, 0.0, 15.0]

0.0

import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
  (7, "US", 18, 1.0),
  (8, "CA", 12, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

val formula = new RFormula()
  .setFormula("clicked ~ country + hour")
  .setFeaturesCol("features")
  .setLabelCol("label")

val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()

/*
Output:
+--------------+-----+
|      features|label|
+--------------+-----+
|[0.0,0.0,18.0]|  1.0|
|[1.0,0.0,12.0]|  0.0|
|[0.0,1.0,15.0]|  0.0|
+--------------+-----+

*/

Last updated