Contents
all you need is to just click the links...
Setting Up
Computer needed for this course
Download and install Anaconda Python and create virtual environment with Python 3.6
Install findspark, add spylon-kernel for scala
Docker deployment of Spark Cluster
Create customized Apache Spark Docker container
docker-compose and docker-compose.yml
Launch custom built Docker container with docker-compose
Setup Hadoop, Hive and Spark on Linux without docker
Configure $HADOOP_HOME/etc/hadoop
Work with Hadoop and HDFS file system
Connect to Hadoop web interface port 50070
Python and Scala Prep
Loop statement -- For statement
map and filter takes function as input
Type of Variable: Mutable or immutable
Scala Exceptions + try catch finally
Run Scala code with Apache Spark
Python with Apache Spark using Jupyter notebook
Spark Core
Resilient Distributed Datasets (RDDs)
Spark SQL
Running SQL Queries Programmatically
Untyped User-Defined Aggregate Functions
Bucketing, Sorting and Partitioning
Install Python Arrow Module PyArrow
Issue might happen import PyArrow
Enabling for Conversion to/from Pandas in Python
Connect to any data source the same consistent way
Spark SQL Implementation Example in Scala
Hive Integration, run SQL or HiveQL queries on existing warehouses.
Spark Streaming
cogroup(otherStream, [numTasks])
countByWindow(windowLength, slideInterval)
reduceByWindow(func, windowLength, slideInterval)
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
countByValueAndWindow(windowLength, slideInterval, [numTasks])
window(windowLength, slideInterval)
Window DStream Join Operations
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
Spark Streaming with Twitter, you can get public tweets by using Twitter API.
Spark streaming use case with Python
Spark Graphx
Class StronglyConnectedComponents
Spark Graphx Describes Organization Chart Easy and Fast
Page Rank with Apache Spark Graphx
Bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases
Tree and Graph Traversal with and without Spark Graphx
Spark Machine Learning
Transformer/Estimator Parameters
Extracting, transforming and selecting features
Discrete Cosine Transform (DCT)
Multilayer perceptron classifier
Gradient-boosted tree regression
Latent Dirichlet allocation or LDA
ML Tuning: model selection and hyperparameter tuning
Model selection (a.k.a. hyperparameter tuning)
Data Visualization with Vegas Viz and Scala with Spark ML
Apache Spark Machine Learning with Dremio Data Lake Engine
Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning
Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier
Appendix
Last updated