Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning

There are following components in the writing:

Apache Arrow

Apache Arrow Flight

Dremio server

Dremio Flight Connector

Apache Spark Machine Learning.

Let’s itemize all components in the writing:

Apache Arrow:

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

Apache Arrow is columnar in memory data structure, allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs.

Conventionally, this is how to share data across different systems:

With obvious disadvantages:

Each system has its own internal memory format 70–80% computation wasted on serialization and deserialization

Similar functionality implemented in multiple projects

With Apache Arrow as common data structures:

With obvious advantages when it comes to data transport amongst different systems that use Arrow memory data structure framework:

All systems utilize the same memory format No overhead for cross-system communication Projects can share functionality (eg, Parquet-to-Arrow reader)

Apache Arrow Flight

A Framework for Fast Data Transport Flight initially is focused on optimized transport of the Arrow columnar format (i.e. "Arrow record batches") over gRPC, Google’s popular HTTP/2-based general-purpose RPC library and framework. While we have focused on integration with gRPC, as a development framework Flight is not intended to be exclusive to gRPC.

One of the biggest features that sets apart Flight from other data transport frameworks is parallel transfers, allowing data to be streamed to or from a cluster of servers simultaneously. This enables developers to more easily create scalable data services that can serve a growing client base.

A simple Flight setup might consist of a single server to which clients connect and make DoGet requests.

Flight Basics

The Arrow Flight libraries provide a development framework for implementing a service that can send and receive data streams. A Flight server supports several basic kinds of requests:

Dremio

Dremio is the data lake engine

www.dremio.com

Dremio’s query engine is built on Apache Arrow, which is an in memory columnar data structure. Its SQL engine allows you to use SQL to query structured data such as relational database tables or non-structure such as key value pairs entities such as JSON, it is a distributed/clustered and in memory columnar query engine, that can run on one node or many nodes.

Dremio Flight Connector

Dremio Flight Connector is an implementation of Apache Arrow Flight Framework that allows a client, such as a Java program or Python script to request data from Dremio server using Apache Arrow Flight protocol, that inherits the data transport Apache Arrow data structure.

This means you can get data from Dremio or any systems that use Apache Arrow as memory data cache without ODBC or JDBC and without serialization and deserialization overhead that come with both.

Below is the integration demonstration of Dremio Data Lake Engine Server, Dremio Flight Connector and Apache Machine Learning Spam Detection.

Setup:

In my last writing, I have built Dremio server open source version. I am going to leverage Dremio server already built for this writing.

Apache Spark Machine Learning with Dremio Data Lake Engine

First task, I need to build Dremio Flight Connector that allows getting data from Dremio via Apache Flight, which is Arrow Memory Data Transport Protocol.

(1). To start, git clone Dremio Flight Connector from GitHub:

(2). Then follow build instruction in the readme, make sure your mvn version is up to date and simply run: mvn clean install or if you do not have current or recent version of mvn: mvnw clean install

(3). Once build complete, get the jar file in the target folder inside the folder created by git, such as:

dremio-flight-connector-0.11.0-SNAPSHOT-shaded.jar

and copy it to <dremio home dir>/jars/

(4). Then modify /conf/dremio-env

(5). Restart dremio, assume you are in <dremio home dir>,

(6). To make sure Dremio flight connector is up and running:

(i) Check java process to make sure “-Ddremio.flight.enabled=true” is inside the dremio command line

(ii) Check dremio server.log file, /log/server.log, look for below:

(iii) Lastly, to make sure port 47470 (flight connector) is listened by Dremio, by telnet:

Apache Spark as client of Dremio server:

When all are done, you are in business to continue your data science work.

The dataset used in this writing is SMSSpamCollection, which text messaging data, which has labeled whether a SMS text is a Spam (a Spam) or Ham (not a Spam).

Following is from the provider of this dataset as description:

The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.

First, I placed the data file SMSSpamCollection into HDFS folder /dremio

The file is tab (\t) delimited.

In dremio UI, load up and catalog this file as data source so this file can be queried by SQL, as if it were a table.

Once dataset file can be loaded by Dremio, which can now supply to client, such as Python script with Dremio Flight Connector, without ODBC, following is the Python code does the following:

Import pyarrow and flight library

Connect to Dremio via Arrow flight connector, including authentication

Fetching data from SMSSpamCollection file SQL and loading to a Pandas dataframe

Starts Apache Spark and convert the Pandas dataframe to SparkSQL dataframe

Conduct machine learning model traing and testing with Apache Spark ML

LogisticRegression and NeiveBayes

Following is the Python code that connects to Dremio with Dremio Flight Connector and does ML with Apache Spark:

Summary:

Following is the model metrics:

In this exercise, LogisticRegression appears similar to NaiveBayes. The objective of this writing is to demonstrate how to use Dremio Flight Connector to get data from Dremio server with Apache Arrow Flight protocol without ODBC or JDBC to run application with Apache Spark such as Machine Learning with Spark.

As always, code used in this writing is in my GitHub repo.

Last updated

Was this helpful?