SQLTransformer

SQLTransformer

implements the transformations which are defined by SQL statement. Currently, support SQL syntax like

SELECT a, a + b AS ab FROM __THIS__
SELECT a, SQRT(b) AS bsqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS csum FROM __THIS__ GROUP BY a, b

SQLTransformer implements the transformations which are defined by SQL statement. Currently, we only support SQL syntax like "SELECT ... FROM __THIS__ ..." where "__THIS__" represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, and can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:

SELECT a, a + b AS ab FROM __THIS__
SELECT a, SQRT(b) AS bsqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS csum FROM __THIS__ GROUP BY a, b

Examples

Assume that we have the following DataFrame with columns id, v1 and v2:

id | v1 | v2
----|-----|-----
0 | 1.0 | 3.0
2 | 2.0 | 5.0
import org.apache.spark.ml.feature.SQLTransformer

val df = spark.createDataFrame(
  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")

val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlTrans.transform(df).show()
/*
+---+---+---+---+----+
| id| v1| v2| v3|  v4|
+---+---+---+---+----+
|  0|1.0|3.0|4.0| 3.0|
|  2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+

*/

Last updated