# Contents

## Setting Up

[Computer needed for this course](https://george-jen.gitbook.io/data-science-and-apache-spark/computer_needed)

[Spark Environment Setup](https://george-jen.gitbook.io/data-science-and-apache-spark/spark_setup)

### [Dev environment setup, task list](https://george-jen.gitbook.io/data-science-and-apache-spark/dev_setup4)

[JDK setup](https://george-jen.gitbook.io/data-science-and-apache-spark/jdk_setup5)

[Download and install Anaconda Python and create virtual environment with Python 3.6](https://george-jen.gitbook.io/data-science-and-apache-spark/conda_setup6)

[Download and install Spark](https://george-jen.gitbook.io/data-science-and-apache-spark/download-and-install-spark)

[Scala IDE](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-ide)

[Install findspark, add spylon-kernel for scala](https://george-jen.gitbook.io/data-science-and-apache-spark/install-findspark-add-spylon-kernel-for-scala)

[Summary](https://george-jen.gitbook.io/data-science-and-apache-spark/summary2)

### [Production Spark Environment Setup](https://george-jen.gitbook.io/data-science-and-apache-spark/production-spark-environment-setup)

[Docker deployment of Spark Cluster](https://george-jen.gitbook.io/data-science-and-apache-spark/docker-deployment-of-spark-cluster)

[Create customized Apache Spark Docker container](https://george-jen.gitbook.io/data-science-and-apache-spark/create-customized-apache-spark-docker-container)

[Dockerfile](https://george-jen.gitbook.io/data-science-and-apache-spark/untitled-12)

[docker-compose and docker-compose.yml](https://george-jen.gitbook.io/data-science-and-apache-spark/docker-compose-and-docker-compose.yml)

[Launch custom built Docker container with docker-compose](https://george-jen.gitbook.io/data-science-and-apache-spark/launch-custom-built-docker-container-with-docker-compose)

[Setup Hadoop, Hive and Spark on Linux without docker](https://george-jen.gitbook.io/data-science-and-apache-spark/setup-hadoop-hive-and-spark-on-linux-without-docker)

[Hadoop configuration](https://george-jen.gitbook.io/data-science-and-apache-spark/hadoop-configuration)

[Hadoop setup](https://george-jen.gitbook.io/data-science-and-apache-spark/hadoop-setup)

[Configure $HADOOP\_HOME/etc/hadoop](https://george-jen.gitbook.io/data-science-and-apache-spark/configure-usdhadoop_home-etc-hadoop)

[HDFS](https://george-jen.gitbook.io/data-science-and-apache-spark/hdfs)

[Start Hadoop](https://george-jen.gitbook.io/data-science-and-apache-spark/start-hadoop)

[Work with Hadoop and HDFS file system](https://george-jen.gitbook.io/data-science-and-apache-spark/work-with-hadoop-and-hdfs-file-system)

[Connect to Hadoop web interface port 50070](https://george-jen.gitbook.io/data-science-and-apache-spark/connect-to-hadoop-web-interface-port-50070)

[Install Hive](https://george-jen.gitbook.io/data-science-and-apache-spark/install-hive)

[hive home](https://george-jen.gitbook.io/data-science-and-apache-spark/hive-home)

[Initialize hive schema](https://george-jen.gitbook.io/data-science-and-apache-spark/initialize-hive-schema)

[Start hive metastore service](https://george-jen.gitbook.io/data-science-and-apache-spark/start-hive-metastore-service.)

[Hive client](https://george-jen.gitbook.io/data-science-and-apache-spark/hive-client)

[Setup Apache Spark](https://george-jen.gitbook.io/data-science-and-apache-spark/setup-apache-spark)

[Spark Home](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-home)

## Python and Scala Prep

### [Python 3 Warm Up](https://george-jen.gitbook.io/data-science-and-apache-spark/python-3-warm-up)

[Basics](https://george-jen.gitbook.io/data-science-and-apache-spark/python-basics)

[Iterables/Collections](https://george-jen.gitbook.io/data-science-and-apache-spark/iterables-collections)

[Strings](https://george-jen.gitbook.io/data-science-and-apache-spark/python-strings)

[List](https://george-jen.gitbook.io/data-science-and-apache-spark/python-list)

[Tuple](https://george-jen.gitbook.io/data-science-and-apache-spark/python-tuple)

[Dictionary](https://george-jen.gitbook.io/data-science-and-apache-spark/python-dictionary)

[Set](https://george-jen.gitbook.io/data-science-and-apache-spark/python-set)

[Conditional statement](https://george-jen.gitbook.io/data-science-and-apache-spark/conditional-statement)

[Loop statement -- For statement](https://george-jen.gitbook.io/data-science-and-apache-spark/loop-statement-for-statement)

[Functions and methods](https://george-jen.gitbook.io/data-science-and-apache-spark/functions-and-methods)

[map and filter](https://george-jen.gitbook.io/data-science-and-apache-spark/map-and-filter)

[map and filter takes function as input](https://george-jen.gitbook.io/data-science-and-apache-spark/map-and-filter-takes-function-as-input)

[lambda](https://george-jen.gitbook.io/data-science-and-apache-spark/lambda)

[Data structure](https://george-jen.gitbook.io/data-science-and-apache-spark/data-structure)

[Input and if statement](https://george-jen.gitbook.io/data-science-and-apache-spark/input-and-if-statement)

[Input from a file](https://george-jen.gitbook.io/data-science-and-apache-spark/input-from-a-file)

[Output to a file](https://george-jen.gitbook.io/data-science-and-apache-spark/output-to-a-file)

[Python coding exercise](https://george-jen.gitbook.io/data-science-and-apache-spark/python-coding-excercise)

### [Scala Warm Up](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-warm-up)

[Type of Variable: Mutable or immutable](https://george-jen.gitbook.io/data-science-and-apache-spark/type-of-variable-mutable-or-immutable)

[Scala Data Type](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-data-type)

[Array in Scala](https://george-jen.gitbook.io/data-science-and-apache-spark/array-in-scala)

[Methods](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-methods)

[Class](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-class)

[Objects](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-objects)

[Trait](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-trait)

[Scala if statement](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-if-statement)

[Scala for loop](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-for-loop)

[Scala While Loop](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-while-loop)

[Scala Exceptions + try catch finally](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-exceptions-+-try-catch-finally)

[Scala coding exercise](https://george-jen.gitbook.io/data-science-and-apache-spark/scala-coding-excercise)

[Run a program to estimate pi](https://george-jen.gitbook.io/data-science-and-apache-spark/run-a-program-to-estimate-pi)

[Run Scala code with Apache Spark](https://george-jen.gitbook.io/data-science-and-apache-spark/run-scala-code-with-apache-spark)

[Python with Apache Spark using Jupyter notebook](https://george-jen.gitbook.io/data-science-and-apache-spark/python-with-apache-spark-using-jupyter-notebook)

## Spark Core

### [Spark Core Introduction](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-core-introduction)

[Spark and Scala Version](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-and-scala-version)

[Basic Spark Package](https://george-jen.gitbook.io/data-science-and-apache-spark/basic-spark-package)

[Resilient Distributed Datasets (RDDs)](https://george-jen.gitbook.io/data-science-and-apache-spark/resilient-distributed-datasets-rdds)

[RDD Operations](https://george-jen.gitbook.io/data-science-and-apache-spark/rdd-operations)

[Passing Function to Spark](https://george-jen.gitbook.io/data-science-and-apache-spark/passing-function-to-spark)

[Printing elements of an RDD](https://george-jen.gitbook.io/data-science-and-apache-spark/printing-elements-of-an-rdd)

[Working with key value pair](https://george-jen.gitbook.io/data-science-and-apache-spark/working-with-key-value-pair)

[RDD Transformation Funcitons](https://george-jen.gitbook.io/data-science-and-apache-spark/rdd-transformation-funcitons)

[RDD Action Functions](https://george-jen.gitbook.io/data-science-and-apache-spark/rdd-action-functions)

## Spark SQL

### [SPARK SQL Introduction](https://george-jen.gitbook.io/data-science-and-apache-spark/untitled-57)

[SQL](https://george-jen.gitbook.io/data-science-and-apache-spark/sql)

[datasets vs dataframe](https://george-jen.gitbook.io/data-science-and-apache-spark/datasets-and-dataframes)

[SparkSession](https://george-jen.gitbook.io/data-science-and-apache-spark/sparksession)

[Creating DataFrames](https://george-jen.gitbook.io/data-science-and-apache-spark/creating-dataframes)

[Running SQL Queries Programmatically](https://george-jen.gitbook.io/data-science-and-apache-spark/running-sql-queries-programmatically)

[Creating Datasets](https://george-jen.gitbook.io/data-science-and-apache-spark/creating-datasets)

[Interoperating with RDD](https://george-jen.gitbook.io/data-science-and-apache-spark/interoperating-with-rdd)

[Untyped User-Defined Aggregate Functions](https://george-jen.gitbook.io/data-science-and-apache-spark/untyped-user-defined-aggregate-functions)

[Generic Load/Save Functions](https://george-jen.gitbook.io/data-science-and-apache-spark/generic-load-save-functions)

[Manually specify file option](https://george-jen.gitbook.io/data-science-and-apache-spark/manually-specify-file-option)

[Run SQL on files directly](https://george-jen.gitbook.io/data-science-and-apache-spark/run-sql-on-files-directly)

[Save Mode](https://george-jen.gitbook.io/data-science-and-apache-spark/save-mode)

[Saving to Persistent Tables](https://george-jen.gitbook.io/data-science-and-apache-spark/saving-to-persistent-tables)

[Bucketing, Sorting and Partitioning](https://george-jen.gitbook.io/data-science-and-apache-spark/bucketing-sorting-and-partitioning)

[Apache Arrow](https://george-jen.gitbook.io/data-science-and-apache-spark/apache-arrow)

[Install Python Arrow Module PyArrow](https://george-jen.gitbook.io/data-science-and-apache-spark/install-python-arrow-module-pyarrow)

[Issue might happen import PyArrow](https://george-jen.gitbook.io/data-science-and-apache-spark/issue-might-happen-import-pyarrow)

[Enabling for Conversion to/from Pandas in Python](https://george-jen.gitbook.io/data-science-and-apache-spark/enabling-for-conversion-to-from-pandas)

[Connect to any data source the same consistent way](https://george-jen.gitbook.io/data-science-and-apache-spark/connect-to-any-data-source-the-same-consistent-way)

[Spark SQL Implementation Example in Scala](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-sql-implementation-example-in-scala)

[Run Scala code in Eclipse IDE](https://george-jen.gitbook.io/data-science-and-apache-spark/run-scala-code-in-eclipse-ide)

[Hive Integration, run SQL or HiveQL queries on existing warehouses.](https://george-jen.gitbook.io/data-science-and-apache-spark/hive-integration-run-sql-or-hiveql-queries-on-existing-warehouses.)

[Example: Enrich JSON](https://george-jen.gitbook.io/data-science-and-apache-spark/enrich-json)

## Spark Streaming

### [SPARK Streaming Introduction](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-streaming)

#### [Discretized Streams (DStreams)](https://george-jen.gitbook.io/data-science-and-apache-spark/discretized-streams-dstreams)

#### [Transformations on DStreams](https://george-jen.gitbook.io/data-science-and-apache-spark/transformations-on-dstreams)

[map(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/map-func)

[filter(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/filter-func)

[repartition(numPartitions)](https://george-jen.gitbook.io/data-science-and-apache-spark/repartition-numpartitions)

[union(otherStream)](https://george-jen.gitbook.io/data-science-and-apache-spark/union-otherstream)

[reduce(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/reduce-func)

[count()](https://george-jen.gitbook.io/data-science-and-apache-spark/stream-count)

[countByValue()](https://george-jen.gitbook.io/data-science-and-apache-spark/countbyvalue)

[reduceByKey(func, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/reducebykey-func-numtasks)

[join(otherStream, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/join-otherstream-numtasks)

[cogroup(otherStream, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/cogroup-otherstream-numtasks)

[transform(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/transform-func)

[updateStateByKey(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/updatestatebykey-func)

[repartition(numPartitions)](https://george-jen.gitbook.io/data-science-and-apache-spark/repartition-numpartitions)

#### [DStream Window Operations](https://george-jen.gitbook.io/data-science-and-apache-spark/dstream-window-operations)

#### [DStream Window Transformation](https://george-jen.gitbook.io/data-science-and-apache-spark/dstream-window-transformation)

[countByWindow(windowLength, slideInterval)](https://george-jen.gitbook.io/data-science-and-apache-spark/countbywindow-windowlength-slideinterval)

[reduceByWindow(func, windowLength, slideInterval)](https://george-jen.gitbook.io/data-science-and-apache-spark/reducebywindow-func-windowlength-slideinterval)

[reduceByKeyAndWindow(func, windowLength, slideInterval, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/reducebykeyandwindow-func-windowlength-slideinterval-numtasks)

[reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/reducebykeyandwindow-func-invfunc-windowlength-slideinterval-numtasks)

[countByValueAndWindow(windowLength, slideInterval, \[numTasks\])](https://george-jen.gitbook.io/data-science-and-apache-spark/countbyvalueandwindow-windowlength-slideinterval-numtasks)

[window(windowLength, slideInterval)](https://george-jen.gitbook.io/data-science-and-apache-spark/window-windowlength-slideinterval)

[Window DStream Join Operations](https://george-jen.gitbook.io/data-science-and-apache-spark/broken-reference)

[Window DStream print(n)](https://george-jen.gitbook.io/data-science-and-apache-spark/window-dstream-print-n)

[saveAsTextFiles(prefix, \[suffix\])](https://george-jen.gitbook.io/data-science-and-apache-spark/untitled-92)

[saveAsObjectFiles(prefix, \[suffix\])](https://george-jen.gitbook.io/data-science-and-apache-spark/saveasobjectfiles-prefix-suffix)

[saveAsHadoopFiles(prefix, \[suffix\])](https://george-jen.gitbook.io/data-science-and-apache-spark/saveashadoopfiles-prefix-suffix)

[foreachRDD(func)](https://george-jen.gitbook.io/data-science-and-apache-spark/foreachrdd-func)

[Spark Streaming with Twitter, you can get public tweets by using Twitter API.](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-streaming-with-twitter-you-can-get-public-tweets-by-using-twitter-api.)

[Spark streaming use case with Python](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-streaming-use-case-with-python)

## Spark Graphx

### [Spark Graph Computing](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-graph-computing)

### [Spark Graph Computing Continue](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-graph-computing-continue)

### [Graphx](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-1)

#### [Package org.apache.spark.graphx](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx)

[Edge Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edge-class)

[EdgeContext Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edgecontext-class)

[EdgeDirection Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edgedirection-class)

[EdgeRDD Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edgerdd-class)

[EdgeTriplet Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edgetriplet-class)

[Graph Class](https://george-jen.gitbook.io/data-science-and-apache-spark/graph-class)

[GraphLoader Object](https://george-jen.gitbook.io/data-science-and-apache-spark/graphloader-object)

[GraphOps Class](https://george-jen.gitbook.io/data-science-and-apache-spark/graphops-class)

[GraphXUtils Object](https://george-jen.gitbook.io/data-science-and-apache-spark/graphxutils-object)

[PartitionStrategy Trait](https://george-jen.gitbook.io/data-science-and-apache-spark/partitionstrategy-trait)

[Pregel Object](https://george-jen.gitbook.io/data-science-and-apache-spark/pregel-object)

[TripletFields Class](https://george-jen.gitbook.io/data-science-and-apache-spark/tripletfields-class)

[VertexRDD Class](https://george-jen.gitbook.io/data-science-and-apache-spark/vertexrdd-class)

#### [Package org.apache.spark.graphx.impl](https://george-jen.gitbook.io/data-science-and-apache-spark/untitled)

[AggregatingEdgeContext Class](https://george-jen.gitbook.io/data-science-and-apache-spark/aggregatingedgecontext-class)

[EdgeRDDImpl Class](https://george-jen.gitbook.io/data-science-and-apache-spark/edgerddimpl-class)

[Class GraphImpl](https://george-jen.gitbook.io/data-science-and-apache-spark/class-graphimpl-less-than-vd-ed-greater-than)

[Class VertexRDDImpl](https://george-jen.gitbook.io/data-science-and-apache-spark/class-vertexrddimpl-less-than-vd-greater-than)

#### [Package org.apache.spark.graphx.lib](https://george-jen.gitbook.io/data-science-and-apache-spark/package-org.apache.spark.graphx.lib-1)

[Class ConnectedComponents](https://george-jen.gitbook.io/data-science-and-apache-spark/class-connectedcomponents)

[Class LabelPropagation](https://george-jen.gitbook.io/data-science-and-apache-spark/class-labelpropagation)

[Class PageRank](https://george-jen.gitbook.io/data-science-and-apache-spark/class-pagerank)

[Class ShortestPaths](https://george-jen.gitbook.io/data-science-and-apache-spark/class-shortestpaths)

[Class StronglyConnectedComponents](https://george-jen.gitbook.io/data-science-and-apache-spark/class-stronglyconnectedcomponents)

[Class SVDPlusPlus](https://george-jen.gitbook.io/data-science-and-apache-spark/class-svdplusplus)

[Class SVDPlusPlus.Conf](https://george-jen.gitbook.io/data-science-and-apache-spark/class-svdplusplus.conf)

[Class TriangleCount](https://george-jen.gitbook.io/data-science-and-apache-spark/class-trianglecount)

#### [Package org.apache.spark.graphx.util](https://george-jen.gitbook.io/data-science-and-apache-spark/package-org.apache.spark.graphx.lib)

[Class BytecodeUtils](https://george-jen.gitbook.io/data-science-and-apache-spark/class-bytecodeutils)

[Class GraphGenerators](https://george-jen.gitbook.io/data-science-and-apache-spark/class-graphgenerators)

[Graphx Example 1](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-examples)

[Graphx Example 2](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-example-2)

[Graphx Example 3](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-example-3)

[Spark Graphx Describes Organization Chart Easy and Fast](https://george-jen.gitbook.io/data-science-and-apache-spark/untitled-98)

[Page Rank with Apache Spark Graphx](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-application-case-2)

[Bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases](https://george-jen.gitbook.io/data-science-and-apache-spark/graphx-application-case-3)

[Tree and Graph Traversal with and without Spark Graphx](https://george-jen.gitbook.io/data-science-and-apache-spark/tree-and-graph-traversal)

## Spark Machine Learning

### [Spark Machine Learning Introduction](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-machine-learning)

#### [Binary Classification](https://george-jen.gitbook.io/data-science-and-apache-spark/binary-classification)

#### [Multiclass Classification](https://george-jen.gitbook.io/data-science-and-apache-spark/multiclass-classification)

#### [Regression](https://george-jen.gitbook.io/data-science-and-apache-spark/regression)

#### [Correlation](https://george-jen.gitbook.io/data-science-and-apache-spark/correlation)

[Image Data Source](https://george-jen.gitbook.io/data-science-and-apache-spark/image-data-source)

[ML DataFrame is SQL DataFrame](https://george-jen.gitbook.io/data-science-and-apache-spark/ml-dataframe)

[ML Transformer](https://george-jen.gitbook.io/data-science-and-apache-spark/ml-transformer)

[ML Estimator](https://george-jen.gitbook.io/data-science-and-apache-spark/ml-estimator)

[ML Pipeline](https://george-jen.gitbook.io/data-science-and-apache-spark/ml-pipeline)

[Transformer/Estimator Parameters](https://george-jen.gitbook.io/data-science-and-apache-spark/transformer-estimator-parameters)

[Extracting, transforming and selecting features](https://george-jen.gitbook.io/data-science-and-apache-spark/extracting-transforming-and-selecting-features)

[TF-IDF](https://george-jen.gitbook.io/data-science-and-apache-spark/tf-idf)

[Word2Vec](https://george-jen.gitbook.io/data-science-and-apache-spark/word2vec)

[FeatureHasher](https://george-jen.gitbook.io/data-science-and-apache-spark/featurehasher)

[Tokenizer](https://george-jen.gitbook.io/data-science-and-apache-spark/tokenizer)

[CountVectorizer](https://george-jen.gitbook.io/data-science-and-apache-spark/countvectorizer)

[StopWordRemover](https://george-jen.gitbook.io/data-science-and-apache-spark/stopwordremover)

[n-gram](https://george-jen.gitbook.io/data-science-and-apache-spark/n-gram)

[Binarizer](https://george-jen.gitbook.io/data-science-and-apache-spark/binarizer)

[PCA](https://george-jen.gitbook.io/data-science-and-apache-spark/pca)

[PolynomialExpansion](https://george-jen.gitbook.io/data-science-and-apache-spark/polynomialexpansion)

[StringIndexer](https://george-jen.gitbook.io/data-science-and-apache-spark/stringindexer)

[Discrete Cosine Transform (DCT)](https://george-jen.gitbook.io/data-science-and-apache-spark/discrete-cosine-transform-dct)

[One-hot encoding](https://george-jen.gitbook.io/data-science-and-apache-spark/one-hot-encoding)

[StandardScaler](https://george-jen.gitbook.io/data-science-and-apache-spark/standardscaler)

[IndexToString](https://george-jen.gitbook.io/data-science-and-apache-spark/indextostring)

[VectorIndexer](https://george-jen.gitbook.io/data-science-and-apache-spark/vectorindexer)

[Interaction](https://george-jen.gitbook.io/data-science-and-apache-spark/interaction)

[Normalizer](https://george-jen.gitbook.io/data-science-and-apache-spark/normalizer)

[MinMaxScaler](https://george-jen.gitbook.io/data-science-and-apache-spark/minmaxscaler)

[MaxAbScaler](https://george-jen.gitbook.io/data-science-and-apache-spark/maxabscaler)

[Bucketizer](https://george-jen.gitbook.io/data-science-and-apache-spark/bucketizer)

[ElementwiseProduct](https://george-jen.gitbook.io/data-science-and-apache-spark/elementwiseproduct)

[SQLTransformer](https://george-jen.gitbook.io/data-science-and-apache-spark/sqltransformer)

[VectorAssembler](https://george-jen.gitbook.io/data-science-and-apache-spark/vectorassembler)

[VectorSizeHint](https://george-jen.gitbook.io/data-science-and-apache-spark/vectorsizehint)

[QuantileDiscretizer](https://george-jen.gitbook.io/data-science-and-apache-spark/quantilediscretizer)

[Imputer](https://george-jen.gitbook.io/data-science-and-apache-spark/imputer)

[VectorSlicer](https://george-jen.gitbook.io/data-science-and-apache-spark/vectorslicer)

[RFormula](https://george-jen.gitbook.io/data-science-and-apache-spark/rformula)

[ChiSqSelector](https://george-jen.gitbook.io/data-science-and-apache-spark/chisqselector)

[Locality Sensitive Hashing](https://george-jen.gitbook.io/data-science-and-apache-spark/locality-sensitive-hashing)

[MinHash for Jaccard Distance](https://george-jen.gitbook.io/data-science-and-apache-spark/minhash-for-jaccard-distance)

#### [Classification and Regression](https://george-jen.gitbook.io/data-science-and-apache-spark/classification-and-regression)

[LogisticRegression](https://george-jen.gitbook.io/data-science-and-apache-spark/logisticregression)

[OneVsRest](https://george-jen.gitbook.io/data-science-and-apache-spark/onevsrest)

[Naive Bayes classifiers](https://george-jen.gitbook.io/data-science-and-apache-spark/naive-bayes-classifiers)

[Decision trees](https://george-jen.gitbook.io/data-science-and-apache-spark/decision-trees)

[Random forests](https://george-jen.gitbook.io/data-science-and-apache-spark/random-forests)

[Gradient-boosted trees (GBTs)](https://george-jen.gitbook.io/data-science-and-apache-spark/gradient-boosted-trees-gbts)

[Multilayer perceptron classifier](https://george-jen.gitbook.io/data-science-and-apache-spark/multilayer-perceptron-classifier)

[Linear Support Vector Machine](https://george-jen.gitbook.io/data-science-and-apache-spark/linear-support-vector-machine)

[Linear Regression](https://george-jen.gitbook.io/data-science-and-apache-spark/linear-regression)

[Generalized linear regression](https://george-jen.gitbook.io/data-science-and-apache-spark/generalized-linear-regression)

[Isotonic regression](https://george-jen.gitbook.io/data-science-and-apache-spark/isotonic-regression)

[Decision Tree Regression](https://george-jen.gitbook.io/data-science-and-apache-spark/decision-tree-regression)

[Random Forest Regression](https://george-jen.gitbook.io/data-science-and-apache-spark/random-forest-regression)

[Gradient-boosted tree regression](https://george-jen.gitbook.io/data-science-and-apache-spark/gradient-boosted-tree-regression)

[Survival regression](https://george-jen.gitbook.io/data-science-and-apache-spark/survival-regression)

#### [Clustering](https://george-jen.gitbook.io/data-science-and-apache-spark/clustering)

[k-means](https://george-jen.gitbook.io/data-science-and-apache-spark/k-means)

[Latent Dirichlet allocation or LDA](https://george-jen.gitbook.io/data-science-and-apache-spark/latent-dirichlet-allocation-or-lda)

[Bisecting k-means](https://george-jen.gitbook.io/data-science-and-apache-spark/bisecting-k-means)

#### [A Gaussian Mixture Model](https://george-jen.gitbook.io/data-science-and-apache-spark/a-gaussian-mixture-model)

[Collaborative filtering](https://george-jen.gitbook.io/data-science-and-apache-spark/collaborative-filtering)

[Frequent Pattern Mining](https://george-jen.gitbook.io/data-science-and-apache-spark/frequent-pattern-mining)

[FP-Growth](https://george-jen.gitbook.io/data-science-and-apache-spark/fp-growth)

[PrefixSpan](https://george-jen.gitbook.io/data-science-and-apache-spark/prefixspan)

[ML Tuning: model selection and hyperparameter tuning](https://george-jen.gitbook.io/data-science-and-apache-spark/ml-tuning-model-selection-and-hyperparameter-tuning)

[Model selection (a.k.a. hyperparameter tuning)](https://george-jen.gitbook.io/data-science-and-apache-spark/model-selection-a.k.a.-hyperparameter-tuning)

#### [Cross-Validation](https://george-jen.gitbook.io/data-science-and-apache-spark/cross-validation)

[Train-Validation Split](https://george-jen.gitbook.io/data-science-and-apache-spark/train-validation-split)

#### [Spark Machine Learning Applications](https://george-jen.gitbook.io/data-science-and-apache-spark/spark-machine-learning-applications)

[Data Visualization with Vegas Viz and Scala with Spark ML](https://george-jen.gitbook.io/data-science-and-apache-spark/data-visualization-with-vegas-viz-and-scala-with-spark-ml)

[Apache Spark Machine Learning with Dremio Data Lake Engine](https://george-jen.gitbook.io/data-science-and-apache-spark/apache-spark-machine-learning-with-dremio-data-lake-engine)

[Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning](https://george-jen.gitbook.io/data-science-and-apache-spark/dremio-data-lake-engine-apache-arrow-flight-connector-with-spark-machine-learning)

[Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier](https://george-jen.gitbook.io/data-science-and-apache-spark/neural-network-with-apache-spark-machine-learning-multilayer-perceptron-classifier)

## Appendix

### [Video presentation](https://george-jen.gitbook.io/data-science-and-apache-spark/appendix-video-presentations)

## [References](https://george-jen.gitbook.io/data-science-and-apache-spark/references)
