Working with key value pair
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are such as grouping or aggregating the elements by a key.
In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.
For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file
1
val rdd=sc.parallelize(Seq("a","b","x","a"))
2
val pairs = rdd.map(s => (s, 1))
3
val counts = pairs.reduceByKey((a, b) => a + b)
4
counts.collect.foreach(println)
5
/*
6
(x,1)
7
(b,1)
8
(a,2)
9
*/
Copied!
Last modified 1yr ago
Copy link