📔
Data Science with Apache Spark
  • Preface
  • Contents
  • Basic Prerequisite Skills
  • Computer needed for this course
  • Spark Environment Setup
  • Dev environment setup, task list
  • JDK setup
  • Download and install Anaconda Python and create virtual environment with Python 3.6
  • Download and install Spark
  • Eclipse, the Scala IDE
  • Install findspark, add spylon-kernel for scala
  • ssh and scp client
  • Summary
  • Development environment on MacOS
  • Production Spark Environment Setup
  • VirtualBox VM
  • VirtualBox only shows 32bit on AMD CPU
  • Configure VirtualBox NAT as Network Adapter on Guest VM and Allow putty ssh Through Port Forwarding
  • Docker deployment of Spark Cluster
  • Create customized Apache Spark Docker container
  • Dockerfile
  • docker-compose and docker-compose.yml
  • Launch custom built Docker container with docker-compose
  • Entering Docker Container
  • Setup Hadoop, Hive and Spark on Linux without docker
  • Hadoop Preparation
  • Hadoop setup
  • Configure $HADOOP_HOME/etc/hadoop
  • HDFS
  • Start and stop Hadoop
  • Work with Hadoop and HDFS file system
  • Connect to Hadoop web interface port 50070 and 8088
  • Install Hive
  • hive home
  • Initialize hive schema
  • Start hive metastore service.
  • hive-site.xml
  • Hive client
  • Setup Apache Spark
  • Spark Home
  • Jupyter-notebook server
  • Python 3 Warm Up
  • Basics
  • Iterables/Collections
  • Strings
  • List
  • Tuple
  • Dictionary
  • Set
  • Conditional statement
  • for loop
  • while loop
  • Functions and methods
  • map and filter
  • map and filter takes function as input
  • lambda
  • Python Class
  • Input and if statement
  • Input from a file
  • Output to a file
  • try except
  • Python coding exercise
  • Scala Warm Up
  • Start Spylon-kernel on Jupyter-notebook
  • Type of Variable: Mutable or immutable
  • Block statement
  • Scala Data Type
  • Array in Scala
  • Methods
  • Functions
  • Anonymous function
  • Scala map and filter methods
  • Class
  • Objects
  • Trait
  • Tuple in Scala
  • List/Seq
  • Set in Scala
  • Scala Map
  • Scala if statement
  • Scala for loop
  • Scala While Loop
  • Scala Exceptions + try catch finally
  • Scala coding exercise
  • Run a program to estimate pi
  • Common Spark command line
  • Run Scala code with spark-submit
  • Python with Apache Spark using Jupyter notebook
  • Spark Core Introduction
  • Spark and Scala Version
  • Basic Spark Package
  • Resilient Distributed Datasets (RDDs)
  • RDD Operations
  • Passing Function to Spark
  • Printing elements of an RDD
  • Working with key value pair
  • RDD Transformation Functions
  • RDD Action Functions
  • SPARK SQL
  • SQL
  • Datasets and DataFrames
  • SparkSession
  • Creating DataFrames
  • Running SQL Queries Programmatically
  • Issue from running Cartesian Join Query
  • Creating Datasets
  • Interoperating with RDD
  • Untyped User-Defined Aggregate Functions
  • Generic Load/Save Functions
  • Manually specify file option
  • Run SQL on files directly
  • Save Mode
  • Saving to Persistent Tables
  • Bucketing, Sorting and Partitioning
  • Apache Arrow
  • Install Python Arrow Module PyArrow
  • Issue might happen import PyArrow
  • Enabling for Conversion to/from Pandas in Python
  • Connect to any data source the same consistent way
  • Spark SQL Implementation Example in Scala
  • Run scala code in Eclipse IDE
  • Hive Integration, run SQL or HiveQL queries on existing warehouses.
  • Example: Enrich JSON
  • Integrate Tableau Data Visualization with Hive Data Warehouse and Apache Spark SQL
  • Connect Tableau to Spark SQL running in VM with VirtualBox with NAT
  • Issues with connecting from Tableau to Spark SQL
  • SPARK Streaming
  • Discretized Streams (DStreams)
  • Transformations on DStreams
  • map(func)
  • filter(func)
  • repartition(numPartitions)
  • union(otherStream)
  • reduce(func)
  • count()
  • countByValue()
  • reduceByKey(func, [numTasks])
  • join(otherStream, [numTasks])
  • cogroup(otherStream, [numTasks])
  • transform(func)
  • updateStateByKey(func)
  • Scala Tips for updateStateByKey
  • repartition(numPartitions)
  • DStream Window Operations
  • DStream Window Transformation
  • countByWindow(windowLength, slideInterval)
  • reduceByWindow(func, windowLength, slideInterval)
  • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
  • reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
  • countByValueAndWindow(windowLength, slideInterval, [numTasks])
  • window(windowLength, slideInterval)
  • Window DStream print(n)
  • saveAsTextFiles(prefix, [suffix])
  • saveAsObjectFiles(prefix, [suffix])
  • saveAsHadoopFiles(prefix, [suffix])
  • foreachRDD(func)
  • Build Twitter Scala API Library for Spark Streaming using sbt
  • Spark Streaming with Twitter, you can get public tweets by using Twitter API.
  • Spark streaming use case with Python
  • Spark Graph Computing
  • Spark Graph Computing Continue
  • Graphx
  • Package org.apache.spark.graphx
  • Edge Class
  • EdgeContext Class
  • EdgeDirection Class
  • EdgeRDD Class
  • EdgeTriplet Class
  • Graph Class
  • GraphLoader Object
  • GraphOps Class
  • GraphXUtils Object
  • PartitionStrategy Trait
  • Pregel Object
  • TripletFields Class
  • VertexRDD Class
  • Package org.apache.spark.graphx.impl
  • AggregatingEdgeContext Class
  • EdgeRDDImpl Class
  • Class GraphImpl<VD,ED>
  • Class VertexRDDImpl<VD>
  • Package org.apache.spark.graphx.lib
  • Class ConnectedComponents
  • Class LabelPropagation
  • Class PageRank
  • Class ShortestPaths
  • Class StronglyConnectedComponents
  • Class SVDPlusPlus
  • Class SVDPlusPlus.Conf
  • Class TriangleCount
  • Package org.apache.spark.graphx.util
  • Class BytecodeUtils
  • Class GraphGenerators
  • Graphx Example 1
  • Graphx Example 2
  • Graphx Example 3
  • Spark Graphx Describes Organization Chart Easy and Fast
  • Page Rank with Apache Spark Graphx
  • bulk synchronous parallel with Google Pregel Graphx Implementation Use Cases
  • Tree and Graph Traversal with and without Spark Graphx
  • Graphx Graph Traversal with Pregel Explained
  • Spark Machine Learning
  • Binary Classification
  • Multiclass Classification
  • Regression
  • Correlation
  • Image Data Source
  • ML DataFrame is SQL DataFrame
  • ML Transformer
  • ML Estimator
  • ML Pipeline
  • Transformer/Estimator Parameters
  • Extracting, transforming and selecting features
  • TF-IDF
  • Word2Vec
  • FeatureHasher
  • Tokenizer
  • CountVectorizer
  • StopWordRemover
  • n-gram
  • Binarizer
  • PCA
  • PolynomialExpansion
  • StringIndexer
  • Discrete Cosine Transform (DCT)
  • One-hot encoding
  • StandardScaler
  • IndexToString
  • VectorIndexer
  • Interaction
  • Normalizer
  • MinMaxScaler
  • MaxAbScaler
  • Bucketizer
  • ElementwiseProduct
  • SQLTransformer
  • VectorAssembler
  • VectorSizeHint
  • QuantileDiscretizer
  • Imputer
  • VectorSlicer
  • RFormula
  • ChiSqSelector
  • Locality Sensitive Hashing
  • MinHash for Jaccard Distance
  • Classification and Regression
  • LogisticRegression
  • OneVsRest
  • Naive Bayes classifiers
  • Decision trees
  • Random forests
  • Gradient-boosted trees (GBTs)
  • Multilayer perceptron classifier
  • Linear Support Vector Machine
  • Linear Regression
  • Generalized linear regression
  • Isotonic regression
  • Decision Tree Regression
  • Random Forest Regression
  • Gradient-boosted tree regression
  • Survival regression
  • Clustering
  • k-means
  • Latent Dirichlet allocation or LDA
  • Bisecting k-means
  • A Gaussian Mixture Model
  • Collaborative filtering
  • Frequent Pattern Mining
  • FP-Growth
  • PrefixSpan
  • ML Tuning: model selection and hyperparameter tuning
  • Model selection (a.k.a. hyperparameter tuning)
  • Cross-Validation
  • Train-Validation Split
  • Spark Machine Learning Applications
  • Apache Spark SQL & Machine Learning on Genetic Variant Classifications
  • Data Visualization with Vegas Viz and Scala with Spark ML
  • Apache Spark Machine Learning with Dremio Data Lake Engine
  • Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning
  • Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier
  • Setup TensorFlow, Keras, Theano, Pytorch/torchvision on the CentOS VM
  • Virus Xray Image Classification with Tensorflow Keras Python and Apache Spark Scala
  • Appendix -- Video Presentations
  • References
Powered by GitBook
On this page

Was this helpful?

GraphOps Class

class GraphOps[VD, ED] extends Serializable

Instance constructor:

new GraphOps(graph: Graph[VD, ED])(implicit arg0: ClassTag[VD], arg1: ClassTag[ED])

def collectEdges(edgeDirection: EdgeDirection): VertexRDD[Array[Edge[ED]]]

Returns an RDD that contains for each vertex v its local edges, i.e., the edges that are incident on v, in the user-specified direction.

def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]

Collect the neighbor vertex ids for each vertex.

def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]

Collect the neighbor vertex attributes for each vertex.

def connectedComponents(maxIterations: Int): Graph[VertexId, ED]

Compute the connected component membership of each vertex and return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.

def connectedComponents(): Graph[VertexId, ED]

Compute the connected component membership of each vertex and return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.

def convertToCanonicalEdges(mergeFunc: (ED, ED) ⇒ ED = (e1, e2) => e1): Graph[VD, ED]

Convert bi-directional edges into uni-directional ones.

lazy val degrees: VertexRDD[Int]

The degree of each vertex in the graph.

def filter[VD2, ED2](preprocess: (Graph[VD, ED]) ⇒ Graph[VD2, ED2], epred: (EdgeTriplet[VD2, ED2]) ⇒ Boolean = (x: EdgeTriplet[VD2, ED2]) => true, vpred: (VertexId, VD2) ⇒ Boolean = (v: VertexId, d: VD2) => true)(implicit arg0: ClassTag[VD2], arg1: ClassTag[ED2]): Graph[VD, ED]

Filter the graph by computing some values to filter on, and applying the predicates.

lazy val inDegrees: VertexRDD[Int]

The in-degree of each vertex in the graph.

def joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) ⇒ VD)(implicit arg0: ClassTag[U]): Graph[VD, ED]

Join the vertices with an RDD and then apply a function from the vertex and RDD entry to a new vertex value.

lazy val numEdges: Long

The number of edges in the graph.

lazy val
numVertices: Long
The number of vertices in the graph.

lazy val outDegrees: VertexRDD[Int]

The out-degree of each vertex in the graph.

def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]

Run a dynamic version of PageRank returning a graph with vertex attributes containing the PageRank and edge attributes containing the normalized edge weight.

def personalizedPageRank(src: VertexId, tol: Double, resetProb: Double = 0.15): Graph[Double, Double]

Run personalized PageRank for a given vertex, such that all random walks are started relative to the source node.

def pickRandomVertex(): VertexId

Picks a random vertex from the graph and returns its ID.

def pregel[A](initialMsg: A, maxIterations: Int = Int.MaxValue, activeDirection: EdgeDirection = EdgeDirection.Either)(vprog: (VertexId, VD, A) ⇒ VD, sendMsg: (EdgeTriplet[VD, ED]) ⇒ Iterator[(VertexId, A)], mergeMsg: (A, A) ⇒ A)(implicit arg0: ClassTag[A]): Graph[VD, ED]

Execute a Pregel-like iterative vertex-parallel abstraction.

def removeSelfEdges(): Graph[VD, ED]

Remove self edges.

def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]

Run PageRank for a fixed number of iterations returning a graph with vertex attributes containing the PageRank and edge attributes the normalized edge weight.

def staticParallelPersonalizedPageRank(sources: Array[VertexId], numIter: Int, resetProb: Double = 0.15): Graph[Vector, Double]

Run parallel personalized PageRank for a given array of source vertices, such that all random walks are started relative to the source vertices

def staticPersonalizedPageRank(src: VertexId, numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]

Run Personalized PageRank for a fixed number of iterations with with all iterations originating at the source node returning a graph with vertex attributes containing the PageRank and edge attributes the normalized edge weight.

def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]

Compute the strongly connected component (SCC) of each vertex and return a graph with the vertex value containing the lowest vertex id in the SCC containing that vertex.

def triangleCount(): Graph[Int, ED]

Compute the number of triangles passing through each vertex.

Example:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
                       (4L, ("peter", "student"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")

// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)

val graph_operation=new GraphOps(graph)
val direction: EdgeDirection = EdgeDirection.Out
graph_operation.collectNeighborIds(direction).take(5).foreach(println)

/*
Output:
(4,[J@646df199)
(0,[J@4641c01)
(2,[J@1f267828)
(3,[J@4a55a32)
(7,[J@fcf9206)
graph_operation: org.apache.spark.graphx.GraphOps[(String, String),String] = org.apache.spark.graphx.GraphOps@2bef03ea
direction: org.apache.spark.graphx.EdgeDirection = EdgeDirection.Out

*/

Reference:

PreviousGraphLoader ObjectNextGraphXUtils Object

Last updated 5 years ago

Was this helpful?

https://github.com/apache/spark/blob/v2.4.5/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala