Tree and Graph Traversal with and without Spark Graphx

Tree data structure

Tree and linked list are basic data structure concept taught in computer science class. Tree traversal (also known as walking the tree) is a form of graph traversal and refers to the process of visiting (checking and/or updating) each node in a tree data structure, exactly once per node. Such traversals are classified by the order in which the nodes are visited.

There are 2 different strategies to traverse a tree:

· Depth first traversal: In simple term, walk the tree vertically

· Breadth first traversal: In simple term, walk the tree horizontally

Depth first traversal, there are following strategies:

· Pre-order

o Access the data part of the current node.

o Traverse the left subtree by recursively calling the pre-order function.

o Traverse the right subtree by recursively calling the pre-order function.

· In-order

o Traverse the left subtree by recursively calling the in-order function.

o Access the data part of the current node.

o Traverse the right subtree by recursively calling the in-order function.

· Post-order

o Traverse the left subtree by recursively calling the post-order function.

o Traverse the right subtree by recursively calling the post-order function.

o Access the data part of the current node.

Breadth first traversal:

· Traverse the tree in level-order, visit every node on a level before going to a lower level. This search is referred to as breadth-first search (BFS), as the search tree is broadened as much as possible on each depth before going to the next depth.

Tree traversal Implementation

Tree traversal can be implemented in any programming language of choices. Python appears to be easiest language when coming to implement, because its dynamic typing, or in simple term, you do not need to worry about the data type, because a node in a tree normally is an instance of a class. To start, following is the tree traversal implementation in Python

Python implementation

Dynamic typing vs static typing

Programming language with dynamic typing means the compiler or interpreter takes care the data type of a variable, programmer does not need to specify datatype. Python is such language. Partly because in Python, everything is object or subclass of object class.

Programming language with static typing means compiler or interpreter only enforce the data type and will give you an error if data type mismatches. Programmer is responsible for making sure proper data type is used to declare a variable. C++, Java and Scala are such language

It is easier to write tree traversal program in Python because of, in my view, the advantage of no data type needed to define tree node class. If you do it in other program language, you will need to take care the typing before you can build a tree.

Scala Implementation

Following is my attempt to write the same tree traversal in Scala. Why Scala? Because it is the language natively supported by Apache Spark Graphx, it is essentially Java code after compiling to java byte code.

Since I already have code written in Python, I just port it to Scala

Scalability

In theory, the above tree traversal codes in Python or Scala do what they were supposed to for a tree that has 7 nodes including root and leaves. Are they scalable? For a tree that has millions or billions of nodes? Of cause not. Unless the codes are written to cut the tree into pieces and let multiple computer (worker nodes) to process each part of the tree and produce the final traversal result.

How to write tree traversal code that is scalable, it is actually easy when writing tree traversal code to run on Apache Spark Graphx

Graph traversal with Apache Spark Graphx Pregel

What is the difference between a tree and a graph?

Tree is a graph, except tree does not have loops while regular graph does.

To construct the same tree in Apache Spark Graphx:

The above graph is a tree, because there is no loop. The earlier standalone Python and Scala programs can only traverse tree, i.e., a graph that does not have loop. If a graph has loop, for example, in social media, “Jack is a friend of Join, Join is a friend of Robert, Robert is a friend of Jack”, then this would be a loop. If a graph has a loop, the earlier traversal strategy will be end up recurse endlessly and will eventually fail due to out of memory in the stack.

The code that runs on Apache Spark Graphx has taken care of the case if the Graph has loop and will not loop indefinitely, by using marking strategy, meaning it will skip the node that has been visited earlier.

Let’s modify the tree to make it a graph with loop, by adding an edge Edge(7L,1L,8)

Summary:

Tree and Graph traversal on Spark Graphx is scalable. Because the traversal is running on the Spark cluster with multiple worker nodes. In fact, the traversal code is running under Google’s Pregel framework, which is designed to run large graph computing, for Pregel, you can see my prior writing:

https://www.linkedin.com/posts/dr-george-jen-257a7626_bulk-synchronous-parallel-google-pregel-spark-activity-6654510643561013248-G3Zi

As always, codes used in this writing are in my GitHub repo.

https://github.com/geyungjen/jentekllc

Thank you for your time viewing this writing.

Last updated

Was this helpful?