Interoperating with RDD
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection-based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val people = sc
.textFile("file:///home/dv6/spark/spark/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => (attributes(0),attributes(1)))
.toDS().withColumnRenamed("_1","name")
.withColumnRenamed("_2","age")
people.show()
/*
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
*/The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.
Last updated
Was this helpful?