📔
📔
📔
📔
Data Science with Apache Spark
Search
⌃
K
📔
📔
📔
📔
Data Science with Apache Spark
Search
⌃
K
Preface
Contents
Basic Prerequisite Skills
Computer needed for this course
Spark Environment Setup
Dev environment setup, task list
JDK setup
Download and install Anaconda Python and create virtual environment with Python 3.6
Download and install Spark
Eclipse, the Scala IDE
Install findspark, add spylon-kernel for scala
ssh and scp client
Summary
Development environment on MacOS
Production Spark Environment Setup
VirtualBox VM
VirtualBox only shows 32bit on AMD CPU
Configure VirtualBox NAT as Network Adapter on Guest VM and Allow putty ssh Through Port Forwarding
Docker deployment of Spark Cluster
Create customized Apache Spark Docker container
Dockerfile
docker-compose and docker-compose.yml
Launch custom built Docker container with docker-compose
Entering Docker Container
Setup Hadoop, Hive and Spark on Linux without docker
Hadoop Preparation
Hadoop setup
Configure $HADOOP_HOME/etc/hadoop
HDFS
Start and stop Hadoop
Work with Hadoop and HDFS file system
Connect to Hadoop web interface port 50070 and 8088
Install Hive
hive home
Initialize hive schema
Start hive metastore service.
hive-site.xml
Hive client
Setup Apache Spark
Spark Home
Jupyter-notebook server
Python 3 Warm Up
Basics
Iterables/Collections
Strings
List
Tuple
Dictionary
Set
Conditional statement
for loop
while loop
Functions and methods
map and filter
map and filter takes function as input
lambda
Python Class
Input and if statement
Input from a file
Output to a file
try except
Python coding exercise
Scala Warm Up
Start Spylon-kernel on Jupyter-notebook
Type of Variable: Mutable or immutable
Block statement
Scala Data Type
Array in Scala
Methods
Functions
Anonymous function
Scala map and filter methods
Class
Objects
Trait
Tuple in Scala
List/Seq
Set in Scala
Scala Map
Scala if statement
Scala for loop
Scala While Loop
Scala Exceptions + try catch finally
Scala coding exercise
Run a program to estimate pi
Common Spark command line
Run Scala code with spark-submit
Python with Apache Spark using Jupyter notebook
Spark Core Introduction
Spark and Scala Version
Basic Spark Package
Resilient Distributed Datasets (RDDs)
RDD Operations
Passing Function to Spark
Printing elements of an RDD
Working with key value pair
RDD Transformation Functions
RDD Action Functions
SPARK SQL
SQL
Datasets and DataFrames
SparkSession
Creating DataFrames
Running SQL Queries Programmatically
Issue from running Cartesian Join Query
Creating Datasets
Interoperating with RDD
Untyped User-Defined Aggregate Functions
Generic Load/Save Functions
Manually specify file option
Run SQL on files directly
Save Mode
Saving to Persistent Tables
Bucketing, Sorting and Partitioning
Apache Arrow
Install Python Arrow Module PyArrow
Issue might happen import PyArrow
Enabling for Conversion to/from Pandas in Python
Connect to any data source the same consistent way
Spark SQL Implementation Example in Scala
Run scala code in Eclipse IDE
Hive Integration, run SQL or HiveQL queries on existing warehouses.
Example: Enrich JSON
Integrate Tableau Data Visualization with Hive Data Warehouse and Apache Spark SQL
Connect Tableau to Spark SQL running in VM with VirtualBox with NAT
Issues with connecting from Tableau to Spark SQL
SPARK Streaming
Discretized Streams (DStreams)
Transformations on DStreams
map(func)
filter(func)
repartition(numPartitions)
union(otherStream)
reduce(func)
count()
countByValue()
reduceByKey(func, [numTasks])
join(otherStream, [numTasks])
cogroup(otherStream, [numTasks])
transform(func)
updateStateByKey(func)
Scala Tips for updateStateByKey
repartition(numPartitions)
DStream Window Operations
DStream Window Transformation
countByWindow(windowLength, slideInterval)
reduceByWindow(func, windowLength, slideInterval)
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
countByValueAndWindow(windowLength, slideInterval, [numTasks])
window(windowLength, slideInterval)
Window DStream print(n)
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
Build Twitter Scala API Library for Spark Streaming using sbt
Spark Streaming with Twitter, you can get public tweets by using Twitter API.
Spark streaming use case with Python
Spark Graph Computing
Spark Graph Computing Continue
Graphx
Package org.apache.spark.graphx
Edge Class
EdgeContext Class
EdgeDirection Class
EdgeRDD Class
EdgeTriplet Class
Graph Class
GraphLoader Object
GraphOps Class
GraphXUtils Object
PartitionStrategy Trait
Pregel Object
TripletFields Class
VertexRDD Class
Package org.apache.spark.graphx.impl
AggregatingEdgeContext Class
EdgeRDDImpl Class
Class GraphImpl<VD,ED>
Class VertexRDDImpl<VD>
Package org.apache.spark.graphx.lib
Class ConnectedComponents
Class LabelPropagation
Class PageRank
Class ShortestPaths
Class StronglyConnectedComponents
Class SVDPlusPlus
Class SVDPlusPlus.Conf
Class TriangleCount
Package org.apache.spark.graphx.util
Class BytecodeUtils
Class GraphGenerators
Graphx Example 1
Graphx Example 2
Graphx Example 3
Spark Graphx Describes Organization Chart Easy and Fast
Page Rank with Apache Spark Graphx
bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases
Tree and Graph Traversal with and without Spark Graphx
Graphx Graph Traversal with Pregel Explained
Spark Machine Learning
Binary Classification
Multiclass Classification
Regression
Correlation
Image Data Source
ML DataFrame is SQL DataFrame
ML Transformer
ML Estimator
ML Pipeline
Transformer/Estimator Parameters
Extracting, transforming and selecting features
TF-IDF
Word2Vec
FeatureHasher
Tokenizer
CountVectorizer
StopWordRemover
n-gram
Binarizer
PCA
PolynomialExpansion
StringIndexer
Discrete Cosine Transform (DCT)
One-hot encoding
StandardScaler
IndexToString
VectorIndexer
Interaction
Normalizer
MinMaxScaler
MaxAbScaler
Bucketizer
ElementwiseProduct
SQLTransformer
VectorAssembler
VectorSizeHint
QuantileDiscretizer
Imputer
VectorSlicer
RFormula
ChiSqSelector
Locality Sensitive Hashing
MinHash for Jaccard Distance
Classification and Regression
LogisticRegression
OneVsRest
Naive Bayes classifiers
Decision trees
Random forests
Gradient-boosted trees (GBTs)
Multilayer perceptron classifier
Linear Support Vector Machine
Linear Regression
Generalized linear regression
Isotonic regression
Decision Tree Regression
Random Forest Regression
Gradient-boosted tree regression
Survival regression
Clustering
k-means
Latent Dirichlet allocation or LDA
Bisecting k-means
A Gaussian Mixture Model
Collaborative filtering
Frequent Pattern Mining
FP-Growth
PrefixSpan
ML Tuning: model selection and hyperparameter tuning
Model selection (a.k.a. hyperparameter tuning)
Cross-Validation
Train-Validation Split
Spark Machine Learning Applications
Apache Spark SQL & Machine Learning on Genetic Variant Classifications
Data Visualization with Vegas Viz and Scala with Spark ML
Apache Spark Machine Learning with Dremio Data Lake Engine
Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning
Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier
Setup TensorFlow, Keras, Theano, Pytorch/torchvision on the CentOS VM
Virus Xray Image Classification with Tensorflow Keras Python and Apache Spark Scala
Appendix -- Video Presentations
References
Powered By
GitBook
Type of Variable: Mutable or immutable
Type of Variable: Mutable or immutable
Immutable Variable, contents can only be assigned initially, can not be modified
val x: String="Jack"
val numbers: Int = 10
val pi: Float=3.14
Mutable Variable:
var x: String="Dave"
x="John"
var num: Int=1
Num=2
var pi=3.14
pi=3.1416
Previous
Start Spylon-kernel on Jupyter-notebook
Next
Block statement
Last modified
3yr ago