Contents

all you need is to just click the links...

Setting Up

Computer needed for this course

Spark Environment Setup

JDK setup

Download and install Anaconda Python and create virtual environment with Python 3.6

Download and install Spark

Scala IDE

Install findspark, add spylon-kernel for scala

Summary

Docker deployment of Spark Cluster

Create customized Apache Spark Docker container

Dockerfile

docker-compose and docker-compose.yml

Launch custom built Docker container with docker-compose

Setup Hadoop, Hive and Spark on Linux without docker

Hadoop configuration

Hadoop setup

Configure $HADOOP_HOME/etc/hadoop

HDFS

Start Hadoop

Work with Hadoop and HDFS file system

Connect to Hadoop web interface port 50070

Install Hive

hive home

Initialize hive schema

Start hive metastore service

Hive client

Setup Apache Spark

Spark Home

Python and Scala Prep

Basics

Iterables/Collections

Strings

List

Tuple

Dictionary

Set

Conditional statement

Loop statement -- For statement

Functions and methods

map and filter

map and filter takes function as input

lambda

Data structure

Input and if statement

Input from a file

Output to a file

Python coding exercise

Type of Variable: Mutable or immutable

Scala Data Type

Array in Scala

Methods

Class

Objects

Trait

Scala if statement

Scala for loop

Scala While Loop

Scala Exceptions + try catch finally

Scala coding exercise

Run a program to estimate pi

Run Scala code with Apache Spark

Python with Apache Spark using Jupyter notebook

Spark Core

Spark and Scala Version

Basic Spark Package

Resilient Distributed Datasets (RDDs)

RDD Operations

Passing Function to Spark

Printing elements of an RDD

Working with key value pair

RDD Transformation Funcitons

RDD Action Functions

Spark SQL

SQL

datasets vs dataframe

SparkSession

Creating DataFrames

Running SQL Queries Programmatically

Creating Datasets

Interoperating with RDD

Untyped User-Defined Aggregate Functions

Generic Load/Save Functions

Manually specify file option

Run SQL on files directly

Save Mode

Saving to Persistent Tables

Bucketing, Sorting and Partitioning

Apache Arrow

Install Python Arrow Module PyArrow

Issue might happen import PyArrow

Enabling for Conversion to/from Pandas in Python

Connect to any data source the same consistent way

Spark SQL Implementation Example in Scala

Run Scala code in Eclipse IDE

Hive Integration, run SQL or HiveQL queries on existing warehouses.

Example: Enrich JSON

Spark Streaming

map(func)

filter(func)

repartition(numPartitions)

union(otherStream)

reduce(func)

count()

countByValue()

reduceByKey(func, [numTasks])

join(otherStream, [numTasks])

cogroup(otherStream, [numTasks])

transform(func)

updateStateByKey(func)

repartition(numPartitions)

countByWindow(windowLength, slideInterval)

reduceByWindow(func, windowLength, slideInterval)

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

countByValueAndWindow(windowLength, slideInterval, [numTasks])

window(windowLength, slideInterval)

Window DStream Join Operations

Window DStream print(n)

saveAsTextFiles(prefix, [suffix])

saveAsObjectFiles(prefix, [suffix])

saveAsHadoopFiles(prefix, [suffix])

foreachRDD(func)

Spark Streaming with Twitter, you can get public tweets by using Twitter API.

Spark streaming use case with Python

Spark Graphx

Edge Class

EdgeContext Class

EdgeDirection Class

EdgeRDD Class

EdgeTriplet Class

Graph Class

GraphLoader Object

GraphOps Class

GraphXUtils Object

PartitionStrategy Trait

Pregel Object

TripletFields Class

VertexRDD Class

AggregatingEdgeContext Class

EdgeRDDImpl Class

Class GraphImpl

Class VertexRDDImpl

Class ConnectedComponents

Class LabelPropagation

Class PageRank

Class ShortestPaths

Class StronglyConnectedComponents

Class SVDPlusPlus

Class SVDPlusPlus.Conf

Class TriangleCount

Class BytecodeUtils

Class GraphGenerators

Graphx Example 1

Graphx Example 2

Graphx Example 3

Spark Graphx Describes Organization Chart Easy and Fast

Page Rank with Apache Spark Graphx

Bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases

Tree and Graph Traversal with and without Spark Graphx

Spark Machine Learning

Image Data Source

ML DataFrame is SQL DataFrame

ML Transformer

ML Estimator

ML Pipeline

Transformer/Estimator Parameters

Extracting, transforming and selecting features

TF-IDF

Word2Vec

FeatureHasher

Tokenizer

CountVectorizer

StopWordRemover

n-gram

Binarizer

PCA

PolynomialExpansion

StringIndexer

Discrete Cosine Transform (DCT)

One-hot encoding

StandardScaler

IndexToString

VectorIndexer

Interaction

Normalizer

MinMaxScaler

MaxAbScaler

Bucketizer

ElementwiseProduct

SQLTransformer

VectorAssembler

VectorSizeHint

QuantileDiscretizer

Imputer

VectorSlicer

RFormula

ChiSqSelector

Locality Sensitive Hashing

MinHash for Jaccard Distance

LogisticRegression

OneVsRest

Naive Bayes classifiers

Decision trees

Random forests

Gradient-boosted trees (GBTs)

Multilayer perceptron classifier

Linear Support Vector Machine

Linear Regression

Generalized linear regression

Isotonic regression

Decision Tree Regression

Random Forest Regression

Gradient-boosted tree regression

Survival regression

k-means

Latent Dirichlet allocation or LDA

Bisecting k-means

Collaborative filtering

Frequent Pattern Mining

FP-Growth

PrefixSpan

ML Tuning: model selection and hyperparameter tuning

Model selection (a.k.a. hyperparameter tuning)

Train-Validation Split

Data Visualization with Vegas Viz and Scala with Spark ML

Apache Spark Machine Learning with Dremio Data Lake Engine

Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning

Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier

Appendix

Last updated