Image Data Source

Image Data Source

Parquet, CSV, JSON, JDBC and images (JPG and PNG)
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. The loaded DataFrame has one StructType column: “image”, containing image data stored as image schema.

The schema of the image column is:

origin:
StringType (represents the file path of the image)
height:
IntegerType (height of the image)
width:
IntegerType (width of the image)
nChannels:
IntegerType (number of image channels)
mode:
IntegerType (OpenCV-compatible type)
data:
BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
val df = spark.read.format("image").option("dropInvalid", true).load("/home/dv6/spark/spark/data/mllib/images/origin/kittens")
df.select("image.origin", "image.width", "image.height").show(truncate=false)
/*
Output:
+-------------------------------------------------------------------------------------+-----+------+
|origin |width|height|
+-------------------------------------------------------------------------------------+-----+------+
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg|300 |200 |
|file:///home/dv6/spark/spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
+-------------------------------------------------------------------------------------+-----+------+
*/