Transformer/Estimator Parameters

Parameter:

All Transformers and Estimators now share a common API for specifying parameters.
Methods for Pipeline components:
A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:
A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.
Estimators
Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.
Pipeline
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
Split each document’s text into words.
Convert each document’s words into a numerical feature vector.
Learn a prediction model using the feature vectors and labels.
MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.
1
import org.apache.spark.ml.{Pipeline, PipelineModel}
2
import org.apache.spark.ml.classification.LogisticRegression
3
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
4
import org.apache.spark.ml.linalg.Vector
5
import org.apache.spark.sql.Row
6
// Prepare training documents from a list of (id, text, label) tuples.
7
val training = spark.createDataFrame(Seq(
8
(0L, "a b c d e spark", 1.0),
9
(1L, "b d", 0.0),
10
(2L, "spark f g h", 1.0),
11
(3L, "hadoop mapreduce", 0.0)
12
)).toDF("id", "text", "label")
13
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
14
val tokenizer = new Tokenizer()
15
.setInputCol("text")
16
.setOutputCol("words")
17
val hashingTF = new HashingTF()
18
.setNumFeatures(1000)
19
.setInputCol(tokenizer.getOutputCol)
20
.setOutputCol("features")
21
val lr = new LogisticRegression()
22
.setMaxIter(10)
23
.setRegParam(0.001)
24
val pipeline = new Pipeline()
25
.setStages(Array(tokenizer, hashingTF, lr))
26
// Fit the pipeline to training documents.
27
val model = pipeline.fit(training)
28
// Now we can optionally save the fitted pipeline to disk
29
​
30
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
31
// We can also save this unfit pipeline to disk
32
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
33
// And load it back in during production
34
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
35
// Prepare test documents, which are unlabeled (id, text) tuples.
36
val test = spark.createDataFrame(Seq(
37
(4L, "spark i j k"),
38
(5L, "l m n"),
39
(6L, "spark hadoop spark"),
40
(7L, "apache hadoop")
41
)).toDF("id", "text")
42
// Make predictions on test documents.
43
sameModel.transform(test)
44
.select("id", "text", "probability", "prediction")
45
.collect()
46
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
47
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
48
}
49
​
50
/*
51
​
52
running above code produces below output:
53
​
54
(4, spark i j k) --> prob=[0.15964077387874118,0.8403592261212589], prediction=1.0
55
(5, l m n) --> prob=[0.8378325685476612,0.16216743145233875], prediction=0.0
56
(6, spark hadoop spark) --> prob=[0.06926633132976273,0.9307336686702373], prediction=1.0
57
(7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
58
​
59
​
60
*/
Copied!
Last modified 1yr ago
Copy link