Image Data Source

Image Data Source

Parquet, CSV, JSON, JDBC and images (JPG and PNG)

This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. The loaded DataFrame has one StructType column: “image”, containing image data stored as image schema.

The schema of the image column is:

origin: 
StringType (represents the file path of the image)
height: 
IntegerType (height of the image)
width: 
IntegerType (width of the image)
nChannels: 
IntegerType (number of image channels)
mode: 
IntegerType (OpenCV-compatible type)
data: 
BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
 val df = spark.read.format("image").option("dropInvalid", true).load("/home/dv6/spark/spark/data/mllib/images/origin/kittens")
 df.select("image.origin", "image.width", "image.height").show(truncate=false)
 
 /*
 
 Output:
 
 +-------------------------------------------------------------------------------------+-----+------+
|origin                                                                               |width|height|
+-------------------------------------------------------------------------------------+-----+------+
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/54893.jpg              |300  |311   |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/DP802813.jpg           |199  |313   |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg|300  |200   |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/DP153539.jpg           |300  |296   |
+-------------------------------------------------------------------------------------+-----+------+
 
 */

Last updated