Example: Enrich JSON
Enrich JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.
One of the Json representation is a collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative arra
Json is one of the major data representations for both Human being and machine. Social media companies sends you streaming data if you request them through their API, for example, Twitter can stream you tweets in Json.
However, after all, Json is just a text string, just a text string that has to follow given syntaxt in order for the machine to parse it. We human being can read a Json even it is syntactically incorrect.
The use case I would like to demonstrate is to add information to a given Json string and produce a new json string contains additional information, i.e, to enrich a Json.
Following is record 1
{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}
Following is record 2:
{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}] }
Following is dimension definition:
"i1” is "Nike Shoes"
"i2" is "Umbrella“
"i3" is "Jeans“
Now define enrichment task: Turn the record 1, based upon dimension defined earlier, add a “name” attribute:
{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}
Into
{
"visitorId": "v1",
"products": [{
"id": "i1",
“name”: “Nike Shoes”,
"interest": 0.68
}, {
"id": "i2",
“name”: “Unbrella”,
"interest": 0.42
}]
}
Also, enrich record 2 the same way:
{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}
Into:
{
"visitorId": "v2",
"products": [{
"id": "i1",
“name”: “Nike Shoes”,
"interest": 0.78
}, {
"id": "i3",
“name”: “Jeans”,
"interest": 0.11
}]
I am going to use Scala for the Json enrichment task. Why Scala? Two reasons:
Scala has wealth of Json library that can parse and extract Json string easily.
Apache Spark has its own library and methods to read and parse Json through Spark SQL.
I will use 2 approaches to accomplish the Json enrichment task defined earlier.
Approach that does not use Apache Spark
Approach that uses Apache Spark SQL
Here is the non spark approach, by using json4s library:
Running the prior Scala code will produce below output:
{"visitorId":"v1","products":[{"id":"i1","name":"Nike Shoes","interest":0.68},{"id":"i2","name":"Umbrella","interest":0.42}]}
{"visitorId":"v2","products":[{"id":"i1","name":"Nike Shoes","interest":0.78},{"id":"i3","name":"Jeans","interest":0.11}]}
Next, I will demonstrate to do the same enrichment on the same Json record and produce the same output, by using Apache Spark SQL.
Running above code will produce below output:
Enriched Json String is:
{"visitorId":"v1","products":[{"name":"Jeans"},{"id":"i1","name":"Nike Shoes","interest":0.68},{"id":"i2","name":"Umbrella","interest":0.42}]}
Original Json String is:
{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}
Enriched Json String is:
{"visitorId":"v2","products":[{"id":"i3","name":"Jeans","interest":0.11},{"id":"i1","name":"Nike Shoes","interest":0.78},{"name":"Umbrella"}]}
Both approaches, Scala only without Spark and Scala with Spark produce the same result. Which method would I recommend? In the production environment with large number of Json records, thinking about millions or billions of Json records to be processed, Apache Spark is a way to go.
A scala program in itself is no different from a Java program, in fact, they are the same because both will be compiled into Java byte code and run on JVM. Scala program is just a monolithic program without parallelism unless you code it that way. Writing a Scala program with Apache Spark will take advantages of Spark distributed computing framework and rich library of Spark SQL, that processes, for example, Json enrichment task in a few SQL queries executed by Apache Spark, in parallel, across Spark worker nodes.
Last updated
Was this helpful?