# RFormula

RFormula selects columns specified by an R model formula. Currently we support a limited subset of the R operators, including ‘\~’, ‘.’, ‘:’, ‘+’, and ‘-‘. The basic operators are:

\~ separate target and terms

* concat terms, “+ 0” means removing intercept
* remove a term, “- 1” means removing intercept

  : interaction (multiplication for numeric values, or binarized categorical values)

  . all columns except target

  Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:

y \~ a + b means model y \~ w0 + w1 *a + w2* b where w0 is the intercept and w1, w2 are coefficients. y \~ a + b + a:b - 1 means model y \~ w1 *a + w2* b + w3 *a* b where w1, w2, w3 are coefficients. RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. As to string input columns, they will first be transformed with StringIndexer using ordering determined by stringOrderType, and the last category after ordering is dropped, then the doubles will be one-hot encoded.

Suppose a string feature column containing values {'b', 'a', 'b', 'a', 'c', 'b'}, we set stringOrderType to control the encoding:

| stringOrderType | Category mapped to 0 by StringIndexer | Category dropped by RFormula      |
| --------------- | ------------------------------------- | --------------------------------- |
| 'frequencyDesc' | most frequent category ('b')          | least frequent category ('c')     |
| 'frequencyAsc'  | least frequent category ('c')         | most frequent category ('b')      |
| 'alphabetDesc'  | last alphabetical category ('c')      | first alphabetical category ('a') |
| 'alphabetAsc'   | first alphabetical category ('a')     | last alphabetical category ('c')  |

If the label column is of type string, it will be first transformed to double with StringIndexer using frequencyDesc ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

Note: The ordering option stringOrderType is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in StringIndexer.

Examples

Assume that we have a DataFrame with the columns id, country, hour, and clicked:

| id | country | hour | clicked |
| -- | ------- | ---- | ------- |
| 7  | "US"    | 18   | 1.0     |
| 8  | "CA"    | 12   | 0.0     |
| 9  | "NZ"    | 15   | 0.0     |

If we use RFormula with a formula string of clicked \~ country + hour, which indicates that we want to predict clicked based on country and hour, after transformation we should get the following DataFrame:

| id | country | hour | clicked | features          | label |
| -- | ------- | ---- | ------- | ----------------- | ----- |
| 7  | "US"    | 18   | 1.0     | \[0.0, 0.0, 18.0] | 1.0   |
| 8  | "CA"    | 12   | 0.0     | \[0.0, 1.0, 12.0] | 0.0   |
| 9  | "NZ"    | 15   | 0.0     | \[1.0, 0.0, 15.0] | 0.0   |

```
import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
  (7, "US", 18, 1.0),
  (8, "CA", 12, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

val formula = new RFormula()
  .setFormula("clicked ~ country + hour")
  .setFeaturesCol("features")
  .setLabelCol("label")

val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()

/*
Output:
+--------------+-----+
|      features|label|
+--------------+-----+
|[0.0,0.0,18.0]|  1.0|
|[1.0,0.0,12.0]|  0.0|
|[0.0,1.0,15.0]|  0.0|
+--------------+-----+

*/
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://george-jen.gitbook.io/data-science-and-apache-spark/rformula.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
