Working with key value pair

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are such as grouping or aggregating the elements by a key.

In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file

val rdd=sc.parallelize(Seq("a","b","x","a"))
val pairs = rdd.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect.foreach(println)
/*
(x,1)
(b,1)
(a,2)
*/

Last updated