LogisticRegression
Logistic regression is a popular method to predict a categorical response. It is a special case of Generalized Linear models that predicts the probability of the outcomes. In spark.ml logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression. Use the family parameter to select between these two algorithms, or leave it unset and Spark will infer the correct variant.
Multinomial logistic regression can be used for binary classification by setting the family param to β€œmultinomial”. It will produce two sets of coefficients and two intercepts.
When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.

Binomial logistic regression

For more background and more details about the implementation of binomial logistic regression, refer to the documentation of logistic regression in spark.mllib.
Examples
The following example shows how to train binomial and multinomial logistic regression models for binary classification with elastic net regularization. elasticNetParam corresponds to Ξ± and regParam corresponds to Ξ».
1
import org.apache.spark.ml.classification.LogisticRegression
2
​
3
// Load training data
4
val training = spark.read.format("libsvm").load("file:///opt/spark/data/mllib/sample_libsvm_data.txt")
5
​
6
val lr = new LogisticRegression()
7
.setMaxIter(10)
8
.setRegParam(0.3)
9
.setElasticNetParam(0.8)
10
​
11
// Fit the model
12
val lrModel = lr.fit(training)
13
​
14
// Print the coefficients and intercept for logistic regression
15
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
16
​
17
// We can also use the multinomial family for binary classification
18
val mlr = new LogisticRegression()
19
.setMaxIter(10)
20
.setRegParam(0.3)
21
.setElasticNetParam(0.8)
22
.setFamily("multinomial")
23
​
24
val mlrModel = mlr.fit(training)
25
​
26
// Print the coefficients and intercepts for logistic regression with multinomial family
27
println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
28
println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
29
​
30
/*
31
Output:
32
Coefficients: (692,[244,263,272,300,301,328,350,351,378,379,405,406,407,428,433,434,455,456,461,462,483,484,489,490,496,511,512,517,539,540,568],[-7.353983524188197E-5,-9.102738505589466E-5,-1.9467430546904298E-4,-2.0300642473486668E-4,-3.1476183314863995E-5,-6.842977602660743E-5,1.5883626898239883E-5,1.4023497091372047E-5,3.5432047524968605E-4,1.1443272898171087E-4,1.0016712383666666E-4,6.014109303795481E-4,2.840248179122762E-4,-1.1541084736508837E-4,3.85996886312906E-4,6.35019557424107E-4,-1.1506412384575676E-4,-1.5271865864986808E-4,2.804933808994214E-4,6.070117471191634E-4,-2.008459663247437E-4,-1.421075579290126E-4,2.739010341160883E-4,2.7730456244968115E-4,-9.838027027269332E-5,-3.808522443517704E-4,-2.5315198008555033E-4,2.7747714770754307E-4,-2.443619763919199E-4,-0.0015394744687597765,-2.3073328411331293E-4]) Intercept: 0.22456315961250325
33
​
34
Multinomial coefficients: 2 x 692 CSCMatrix
35
(0,244) 4.290365458958277E-5
36
(1,244) -4.290365458958294E-5
37
(0,263) 6.488313287833108E-5
38
(1,263) -6.488313287833092E-5
39
(0,272) 1.2140666790834663E-4
40
(1,272) -1.2140666790834657E-4
41
(0,300) 1.3231861518665612E-4
42
(1,300) -1.3231861518665607E-4
43
(0,350) -6.775444746760509E-7
44
(1,350) 6.775444746761932E-7
45
(0,351) -4.899237909429297E-7
46
(1,351) 4.899237909430322E-7
47
(0,378) -3.5812102770679596E-5
48
(1,378) 3.581210277067968E-5
49
(0,379) -2.3539704331222065E-5
50
(1,379) 2.353970433122204E-5
51
(0,405) -1.90295199030314E-5
52
(1,405) 1.90295199030314E-5
53
(0,406) -5.626696935778909E-4
54
(1,406) 5.626696935778912E-4
55
(0,407) -5.121519619099504E-5
56
(1,407) 5.1215196190995074E-5
57
(0,428) 8.080614545413342E-5
58
(1,428) -8.080614545413331E-5
59
(0,433) -4.256734915330487E-5
60
(1,433) 4.256734915330495E-5
61
(0,434) -7.080191510151425E-4
62
(1,434) 7.080191510151435E-4
63
(0,455) 8.094482475733589E-5
64
(1,455) -8.094482475733582E-5
65
(0,456) 1.0433687128309833E-4
66
(1,456) -1.0433687128309814E-4
67
(0,461) -5.4466605046259246E-5
68
(1,461) 5.4466605046259286E-5
69
(0,462) -5.667133061990392E-4
70
(1,462) 5.667133061990392E-4
71
(0,483) 1.2495896045528374E-4
72
(1,483) -1.249589604552838E-4
73
(0,484) 9.810519424784944E-5
74
(1,484) -9.810519424784941E-5
75
(0,489) -4.88440907254626E-5
76
(1,489) 4.8844090725462606E-5
77
(0,490) -4.324392733454803E-5
78
(1,490) 4.324392733454811E-5
79
(0,496) 6.903351855620161E-5
80
(1,496) -6.90335185562012E-5
81
(0,511) 3.946505594172827E-4
82
(1,511) -3.946505594172831E-4
83
(0,512) 2.621745995919226E-4
84
(1,512) -2.621745995919226E-4
85
(0,517) -4.459475951170906E-5
86
(1,517) 4.459475951170901E-5
87
(0,539) 2.5417562428184555E-4
88
(1,539) -2.5417562428184555E-4
89
(0,540) 5.271781246228031E-4
90
(1,540) -5.271781246228032E-4
91
(0,568) 1.860255150352447E-4
92
(1,568) -1.8602551503524485E-4
93
Multinomial intercepts: [-0.12065879445860686,0.12065879445860686]
94
*/
Copied!
The spark.ml implementation of logistic regression also supports extracting a summary of the model over the training set. Note that the predictions and metrics which are stored as DataFrame in LogisticRegressionSummary are annotated @transient and hence only available on the driver.
1
//Continue from code above
2
// Obtain the objective per iteration.
3
val objectiveHistory = trainingSummary.objectiveHistory
4
println("objectiveHistory:")
5
objectiveHistory.foreach(loss => println(loss))
6
​
7
// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
8
val roc = trainingSummary.roc
9
roc.show()
10
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")
11
​
12
// Set the model threshold to maximize F-Measure
13
val fMeasure = trainingSummary.fMeasureByThreshold
14
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
15
val bestThreshold = fMeasure.where(quot;F-Measure" === maxFMeasure)
16
.select("threshold").head().getDouble(0)
17
lrModel.setThreshold(bestThreshold)
18
​
19
/*
20
Output:
21
+---+--------------------+
22
|FPR| TPR|
23
+---+--------------------+
24
|0.0| 0.0|
25
|0.0|0.017543859649122806|
26
|0.0| 0.03508771929824561|
27
|0.0| 0.05263157894736842|
28
|0.0| 0.07017543859649122|
29
|0.0| 0.08771929824561403|
30
|0.0| 0.10526315789473684|
31
|0.0| 0.12280701754385964|
32
|0.0| 0.14035087719298245|
33
|0.0| 0.15789473684210525|
34
|0.0| 0.17543859649122806|
35
|0.0| 0.19298245614035087|
36
|0.0| 0.21052631578947367|
37
|0.0| 0.22807017543859648|
38
|0.0| 0.24561403508771928|
39
|0.0| 0.2631578947368421|
40
|0.0| 0.2807017543859649|
41
|0.0| 0.2982456140350877|
42
|0.0| 0.3157894736842105|
43
|0.0| 0.3333333333333333|
44
+---+--------------------+
45
only showing top 20 rows
46
​
47
areaUnderROC: 1.0
48
*/
49
​
Copied!

Multinomial logistic regression

Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension KΓ—J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length K vector of intercepts is available.
Multinomial coefficients are available as coefficientMatrix and intercepts are available as interceptVector.
coefficients and intercept methods on a logistic regression model trained with multinomial family are not supported. Use coefficientMatrix and interceptVector instead.
The conditional probabilities of the outcome classes k∈1,2,…,K are modeled using the softmax function.
We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting.
For a detailed derivation please see here.
Examples
The following example shows how to train a multiclass logistic regression model with elastic net regularization, as well as extract the multiclass training summary for evaluating the model.
1
import org.apache.spark.ml.classification.LogisticRegression
2
// Load training data
3
val training = spark
4
.read
5
.format("libsvm")
6
.load("file:///opt/spark/data/mllib/sample_multiclass_classification_data.txt")
7
​
8
val lr = new LogisticRegression()
9
.setMaxIter(10)
10
.setRegParam(0.3)
11
.setElasticNetParam(0.8)
12
​
13
// Fit the model
14
val lrModel = lr.fit(training)
15
​
16
// Print the coefficients and intercept for multinomial logistic regression
17
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
18
println(s"Intercepts: \n${lrModel.interceptVector}")
19
​
20
val trainingSummary = lrModel.summary
21
​
22
// Obtain the objective per iteration
23
val objectiveHistory = trainingSummary.objectiveHistory
24
println("objectiveHistory:")
25
objectiveHistory.foreach(println)
26
​
27
// for multiclass, we can inspect metrics on a per-label basis
28
println("False positive rate by label:")
29
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
30
println(s"label $label: $rate")
31
}
32
​
33
println("True positive rate by label:")
34
trainingSummary.truePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
35
println(s"label $label: $rate")
36
}
37
​
38
println("Precision by label:")
39
trainingSummary.precisionByLabel.zipWithIndex.foreach { case (prec, label) =>
40
println(s"label $label: $prec")
41
}
42
​
43
println("Recall by label:")
44
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>
45
println(s"label $label: $rec")
46
}
47
​
48
​
49
println("F-measure by label:")
50
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>
51
println(s"label $label: $f")
52
}
53
​
54
val accuracy = trainingSummary.accuracy
55
val falsePositiveRate = trainingSummary.weightedFalsePositiveRate
56
val truePositiveRate = trainingSummary.weightedTruePositiveRate
57
val fMeasure = trainingSummary.weightedFMeasure
58
val precision = trainingSummary.weightedPrecision
59
val recall = trainingSummary.weightedRecall
60
println(s"Accuracy: $accuracy\nFPR: $falsePositiveRate\nTPR: $truePositiveRate\n" +
61
s"F-measure: $fMeasure\nPrecision: $precision\nRecall: $recall")
62
63
/*
64
Output:
65
Coefficients:
66
3 x 4 CSCMatrix
67
(1,2) -0.7803943459681859
68
(0,3) 0.3176483191238039
69
(1,3) -0.3769611423403096
70
Intercepts:
71
[0.05165231659832854,-0.12391224990853622,0.07225993331020768]
72
objectiveHistory:
73
1.098612288668108
74
1.087602085441699
75
1.0341156572156232
76
1.0289859520256006
77
1.0300389657358995
78
1.0239965158223991
79
1.0236097451839508
80
1.0231082121970012
81
1.023022220302788
82
1.0230018151780262
83
1.0229963739557606
84
False positive rate by label:
85
label 0: 0.22
86
label 1: 0.05
87
label 2: 0.0
88
True positive rate by label:
89
label 0: 1.0
90
label 1: 1.0
91
label 2: 0.46
92
Precision by label:
93
label 0: 0.6944444444444444
94
label 1: 0.9090909090909091
95
label 2: 1.0
96
Recall by label:
97
label 0: 1.0
98
label 1: 1.0
99
label 2: 0.46
100
F-measure by label:
101
label 0: 0.819672131147541
102
label 1: 0.9523809523809523
103
label 2: 0.6301369863013699
104
Accuracy: 0.82
105
FPR: 0.09
106
TPR: 0.82
107
F-measure: 0.8007300232766211
108
Precision: 0.8678451178451179
109
Recall: 0.82
110
*/
Copied!
Last modified 1yr ago