You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source. Data sources are specified by their fully qualified name (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short names (json, parquet, jdbc, orc, libsvm, csv, text). DataFrames loaded from any data source type can be converted into other types using this syntax.
val peopleDF = spark.read.format("json")
.load("file:///home/dv6/spark/spark/examples/src/main/resources/people.json")
peopleDF.select("name", "age")
.write.format("parquet").save("file:///tmp/namesAndAges.parquet")
The extra options are also used during write operation. For example, you can control bloom filters and dictionary encodings for ORC data sources. The following ORC example will create bloom filter on favorite_color and use dictionary encoding for name and favorite_color. For Parquet, there exists parquet.enable.dictionary, too. To find more detailed information about the extra ORC/Parquet options, visit the official Apache ORC/Parquet websites.
On each line, first field is label or target, subsequent fields are column position:value, where the value is non zero. The column with zero will not show.