Data Visualization with Vegas Viz and Scala with Spark ML
If you are python programmer working on data science, you are certainly very familiar with Matplotlib to visualize your classification or regression result, especially when you are using jupyter notebook. Matplotlib.pyplot is standard tool for data visualization, but there is one problem, it is only for Python.
If you code in Scala, you will need to use alternative, one of the great tools for data visualization with Scala is Vegas.
vegas-viz.org - This website is for sale! - vegas viz Resources and Information.
Vegas works with Jupyter-scala kernel, which is not necessarily tightly integrated with Apache Spark like for example, Spylon-kernel does. jupyter-scala kernel is light weight, great for pure Scala coding without Apache Spark, I have done pure coding with Scala with Jupyter-scala kernel when I do not need to use Apache Spark.
Currently, as I am typing this text, Vegas-viz is only available on Maven repository, up to Scala version 2.11, if you are running Scala 2.12, it is not yet available from Maven repository, meaning you can not download needed jar files from Maven.
The use case I would like to demonstrate is to integrate light weight Jupyter-scala kernel, now it is called Almond
almond · A Scala kernel for Jupyter
and Apache Spark.
To make Vegas-viz work, you need up to Scala 2.11, you can setup Apache Spark 2.4.4 with Hadoop 2.7 that includes Scala 2.11.
This assume you have Jupyter-notebook already. Then you just follow the install Almond instructions available on Almond website to install Jupyter-scala kernel.
Once installed, you are set to play with Jupyter-scala on Jupyter-notebook and Vegas-viz data visualization tool.
Here is a simple Scala code on Linear Regression from Apache Spark ML library to run under Almond/Jupyter-scala kernel on Jupyter-notebook.
1
//start with pulling Vegas-viz library jars from Maven repository, your machines needs to connect to the
2
// internet
3
import $ivy.`org.vegas-viz::vegas:0.3.11`
4
import $ivy.`org.vegas-viz::vegas-spark:0.3.11`
5
//You will see lots of downloads from running the above 2 lines
6
//Once done, you can import Vegas library for plotting
7
import vegas._
8
import vegas.render.WindowRenderer._
9
//Almond/Jupyter-scala does not integrated with Spark, so you will need to integrate it manually
10
//by download necessary jars from Maven, for the Spark libraries need to accomplish this demo
11
import $ivy.`org.jupyter-scala::spark:0.4.2`
12
import $ivy.`org.apache.spark::spark-sql:2.4.4`
13
//Again lots of downloads from above 2 lines, once download is done, you are ready to import Spark libs
14
import org.apache.spark.SparkContext._
15
import org.apache.spark.rdd._
16
import org.apache.spark.util.LongAccumulator
17
import org.apache.log4j._
18
import org.apache.spark.sql._
19
//Create Spark session
20
Logger.getLogger(“org”).setLevel(Level.ERROR)
21
val spark = SparkSession
22
.builder
23
.appName(“Vegas”)
24
.master(“local[*]”)
25
.config(“spark.sql.warehouse.dir”, “file:///tmp”)
26
.getOrCreate()
27
//and SparkContext sc
28
val sc = spark.sparkContext
29
//Even a simple LinearRegression, it is from Spark ML library, and Jupyter-Scala does not have it.
30
//Then download from Maven for Spark 2.4.4
31
import $ivy.`org.apache.spark::spark-mllib:2.4.4`
32
//now import Vegas-viz library
33
import vegas._
34
import vegas.data.External._
35
//Generate random dataset
36
import scala.math._
37
var X=Array[Double]()
38
var Y=Array[Double]()
39
//Create 100 random pairs of data points in X and Y Array with Double datatype, ranging from 0 to 10
40
for (i<-0 until 100)
41
{
42
var x=random*10
43
var b=random*10
44
X:+=x
45
Y:+=3*x+b
46
}
47
//You will notice the dataset is random, how Y=3X+b, it will fit nicely with LinearRegression
48
//That is not the point, the point is to play with Vegas-viz data visualization
49
//Create features, label Spark dataframe that sorted by features
50
import spark.implicits._
51
val df = sc.parallelize(X zip Y).toDF(“features”,”label”).sort(“features”)
52
df.show(3,false)
53
+ — — — — — — — — — -+ — — — — — — — — — +
54
|features |label |
55
+ — — — — — — — — — -+ — — — — — — — — — +
56
|0.12428231703539572|2.6330542403970045|
57
|0.12875124825893702|9.111634435646987 |
58
|0.21021408668144725|10.037147132043941|
59
+ — — — — — — — — — -+ — — — — — — — — — +
60
only showing top 3 rows
61
//Import Vegas plotting libraries
62
import vegas._
63
import vegas.render.WindowRenderer._
64
import vegas.sparkExt._
65
//Plot original Spark dataframe to see how these 100 data points visually
66
val plot = Vegas(“linear regression”,width=800, height=600).
67
withDataFrame(df).
68
mark(Point).
69
encodeX(“features”, Quantitative).
70
encodeY(“label”, Quantitative).
71
mark(Point).
72
show
Copied!
1
//LinearRegression is to find the best fit straight line in the form of
2
// y_pred=coeffient*X+Intercept
3
//In simple term, coeffient is weight, intercept is bias
4
// Below is to call Apache Spark ML LinearRegression API to finish the job
5
import org.apache.spark.ml.feature.VectorAssembler
6
//Vectorize feature columns into “feature_new”
7
val vectorAssembler = new VectorAssembler()
8
.setInputCols(Array(“features”))
9
.setOutputCol(“features_new”)
10
//Create a dataframe that feature column in vector and sorted, that is required by Spark ML API
11
var vector_df = vectorAssembler.transform(df)
12
vector_df = vector_df.select(“features_new”, “label”).sort(“features_new”)
13
vector_df.show(3,false)
14
15
/*
16
output:
17
+---------------------+------------------+
18
|features_new |label |
19
+---------------------+------------------+
20
|[0.12428231703539572]|2.6330542403970045|
21
|[0.12875124825893702]|9.111634435646987 |
22
|[0.21021408668144725]|10.037147132043941|
23
+---------------------+------------------+
24
only showing top 3 rows
25
26
27
*/
28
//Split 100 datapoints to 70 for training and 30 for testing
29
val splits = vector_df.randomSplit(Array(0.7,0.3))
30
val train_df = splits(0)
31
val test_df = splits(1)
32
print(test_df.count())
33
//Create LinearRegression model with train_df
34
import org.apache.spark.ml.regression.LinearRegression
35
val lr = new LinearRegression()
36
.setFeaturesCol(“features_new”)
37
.setLabelCol(“label”)
38
.setRegParam(0.3)
39
.setElasticNetParam(0.8)
40
.setMaxIter(10)
41
val lr_model = lr.fit(train_df)
42
// Then test the trained model with test_df
43
val lr_predictions = lr_model.transform(test_df)
44
//Then display testing results
45
lr_predictions.select(“prediction”,”label”,”features_new”).sort(“features_new”).show(3,false)
46
47
/*
48
Output:
49
+-----------------+------------------+---------------------+
50
|prediction |label |features_new |
51
+-----------------+------------------+---------------------+
52
|6.184323007474771|9.111634435646987 |[0.12875124825893702]|
53
|6.877166443408621|6.914045632096169 |[0.3713831792261324] |
54
|8.485151210258538|7.8911343006901555|[0.9344951735265017] |
55
+-----------------+------------------+---------------------+
56
only showing top 3 rows
57
58
*/
59
60
//evaluate the training quality, show R square score
61
import org.apache.spark.ml.evaluation.RegressionEvaluator
62
val lr_evaluator = new RegressionEvaluator().setPredictionCol(“prediction”).
63
setLabelCol(“label”).setMetricName(“r2”)
64
print(“R Squared (R2) on test data =” + lr_evaluator.evaluate(lr_predictions))
65
66
/*
67
Output:
68
R Squared (R2) on test data =0.9248598537955983
69
*/
70
71
//Now, based on 100 original data points, to generate 100 predicted value from weight and bias
72
// that comes from trained model.
73
// weight is coeffient, which is lr_model.coefficients(0)
74
// bias is intercept, which is lr_model.intercept
75
// Best fit linear equation is
76
// y_pred= weight*x+bias
77
// which is
78
//y_pred= lr_model.coefficients(0)*x+ lr_model.intercept
79
// now generate 100 y_pred from 100 original X based on
80
// y_pred= lr_model.coefficients(0)*x+ lr_model.intercept
81
var y_pred=Array[Double]()
82
for (i<-0 until X.length)
83
{
84
y_pred:+=(lr_model.coefficients(0))*X(i)+lr_model.intercept
85
}
86
//Create df_pref dataframe, sorted from Array X zip y_pred
87
val df_pred = sc.parallelize(X zip y_pred).toDF(“features”,”prediction”).sort(“features”)
88
// Plot it out together with original 100 datapoints
89
Vegas.layered(“linear regression”,width=800, height=600).
90
withLayers(
91
Layer().
92
withDataFrame(df).
93
mark(Point).
94
encodeX(“features”, Quantitative).
95
encodeY(“label”, Quantitative),
96
Layer().
97
withDataFrame(df_pred).
98
mark(Line).
99
encodeX(“features”, Quantitative).
100
encodeY(“prediction”, Quantitative)
101
).show
102
Copied!
With Vegas-viz, you can visualize anything that you see fit from dataset you have on Scala.
As always, code used in this writing is in my github site:
GitHub - geyungjen/jentekllc: Apache Spark Application Development -- George Jen, Jen Tek LLC
GitHub
Last modified 1yr ago
Copy link