Decision trees

A popular family of classification and regression methods. More information about the implementation can be found further in the section on decision trees.


The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the DataFrame which the Decision Tree algorithm can recognize..
import{IndexToString, StringIndexer, VectorIndexer}
// Load the data stored in LIBSVM format as a DataFrame.
val data ="libsvm").load("file:///opt/spark/data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
// Train model. This also runs the indexers.
val model =
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display."predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n ${treeModel.toDebugString}")
|predictedLabel|label| features|
| 0.0| 0.0|(692,[121,122,123...|
| 0.0| 0.0|(692,[123,124,125...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[125,126,127...|
only showing top 5 rows
Test Error = 0.030303030303030276
Learned classification tree model:
DecisionTreeClassificationModel (uid=dtc_a286075ebc4c) of depth 2 with 5 nodes
If (feature 406 <= 22.0)
If (feature 99 in {2.0})
Predict: 0.0
Else (feature 99 not in {2.0})
Predict: 1.0
Else (feature 406 > 22.0)
Predict: 0.0