Data Science with Apache Spark

CtrlK

Contents

all you need is to just click the links...

Setting Up

Computer needed for this course

Spark Environment Setup

Dev environment setup, task list

Download and install Anaconda Python and create virtual environment with Python 3.6

Download and install Spark

Install findspark, add spylon-kernel for scala

Production Spark Environment Setup

Docker deployment of Spark Cluster

Create customized Apache Spark Docker container

docker-compose and docker-compose.yml

Launch custom built Docker container with docker-compose

Setup Hadoop, Hive and Spark on Linux without docker

Hadoop configuration

Configure $HADOOP_HOME/etc/hadoop

Work with Hadoop and HDFS file system

Connect to Hadoop web interface port 50070

Initialize hive schema

Start hive metastore service

Setup Apache Spark

Python and Scala Prep

Python 3 Warm Up

Iterables/Collections

Conditional statement

Loop statement -- For statement

Functions and methods

map and filter takes function as input

Input and if statement

Input from a file

Output to a file

Python coding exercise

Scala Warm Up

Type of Variable: Mutable or immutable

Scala Data Type

Scala if statement

Scala While Loop

Scala Exceptions + try catch finally

Scala coding exercise

Run a program to estimate pi

Run Scala code with Apache Spark

Python with Apache Spark using Jupyter notebook

Spark Core

Spark Core Introduction

Spark and Scala Version

Basic Spark Package

Resilient Distributed Datasets (RDDs)

Passing Function to Spark

Printing elements of an RDD

Working with key value pair

RDD Transformation Funcitons

RDD Action Functions

Spark SQL

SPARK SQL Introduction

datasets vs dataframe

Creating DataFrames

Running SQL Queries Programmatically

Creating Datasets

Interoperating with RDD

Untyped User-Defined Aggregate Functions

Generic Load/Save Functions

Manually specify file option

Run SQL on files directly

Saving to Persistent Tables

Bucketing, Sorting and Partitioning

Install Python Arrow Module PyArrow

Issue might happen import PyArrow

Enabling for Conversion to/from Pandas in Python

Connect to any data source the same consistent way

Spark SQL Implementation Example in Scala

Run Scala code in Eclipse IDE

Hive Integration, run SQL or HiveQL queries on existing warehouses.

Example: Enrich JSON

Spark Streaming

SPARK Streaming Introduction

Discretized Streams (DStreams)

Transformations on DStreams

repartition(numPartitions)

union(otherStream)

reduceByKey(func, [numTasks])

join(otherStream, [numTasks])

cogroup(otherStream, [numTasks])

transform(func)

updateStateByKey(func)

repartition(numPartitions)

DStream Window Operations

DStream Window Transformation

countByWindow(windowLength, slideInterval)

reduceByWindow(func, windowLength, slideInterval)

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

countByValueAndWindow(windowLength, slideInterval, [numTasks])

window(windowLength, slideInterval)

Window DStream Join Operations

Window DStream print(n)

saveAsTextFiles(prefix, [suffix])

saveAsObjectFiles(prefix, [suffix])

saveAsHadoopFiles(prefix, [suffix])

foreachRDD(func)

Spark Streaming with Twitter, you can get public tweets by using Twitter API.

Spark streaming use case with Python

Spark Graphx

Spark Graph Computing

Spark Graph Computing Continue

Graphx

Package org.apache.spark.graphx

EdgeContext Class

EdgeDirection Class

EdgeTriplet Class

GraphLoader Object

GraphXUtils Object

PartitionStrategy Trait

TripletFields Class

VertexRDD Class

Package org.apache.spark.graphx.impl

AggregatingEdgeContext Class

EdgeRDDImpl Class

Class GraphImpl

Class VertexRDDImpl

Package org.apache.spark.graphx.lib

Class ConnectedComponents

Class LabelPropagation

Class ShortestPaths

Class StronglyConnectedComponents

Class SVDPlusPlus

Class SVDPlusPlus.Conf

Class TriangleCount

Package org.apache.spark.graphx.util

Class BytecodeUtils

Class GraphGenerators

Graphx Example 1

Graphx Example 2

Graphx Example 3

Spark Graphx Describes Organization Chart Easy and Fast

Page Rank with Apache Spark Graphx

Bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases

Tree and Graph Traversal with and without Spark Graphx

Spark Machine Learning

Spark Machine Learning Introduction

Binary Classification

Multiclass Classification

Regression

Correlation

Image Data Source

ML DataFrame is SQL DataFrame

Transformer/Estimator Parameters

Extracting, transforming and selecting features

CountVectorizer

StopWordRemover

PolynomialExpansion

Discrete Cosine Transform (DCT)

One-hot encoding

ElementwiseProduct

VectorAssembler

QuantileDiscretizer

Locality Sensitive Hashing

MinHash for Jaccard Distance

Classification and Regression

LogisticRegression

Naive Bayes classifiers

Gradient-boosted trees (GBTs)

Multilayer perceptron classifier

Linear Support Vector Machine

Linear Regression

Generalized linear regression

Isotonic regression

Decision Tree Regression

Random Forest Regression

Gradient-boosted tree regression

Survival regression

Clustering

Latent Dirichlet allocation or LDA

Bisecting k-means

A Gaussian Mixture Model

Collaborative filtering

Frequent Pattern Mining

ML Tuning: model selection and hyperparameter tuning

Model selection (a.k.a. hyperparameter tuning)

Cross-Validation

Train-Validation Split

Spark Machine Learning Applications

Data Visualization with Vegas Viz and Scala with Spark ML

Apache Spark Machine Learning with Dremio Data Lake Engine

Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning

Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier

Appendix

Video presentation

References

PreviousPreface NextBasic Prerequisite Skills

Last updated 5 years ago

Was this helpful?