Neural Network with Apache Spark Machine Learning Multilayer Perceptron Classifier

Biology Neuron vs Digital Perceptron:

Neuron

The perceptron is a mathematical replica of a biological neuron. While in actual neurons the dendrite receives electrical signals from the axons of other neurons.

This is also modeled in the perceptron by multiplying each input value by a coefficient called weight, sometime, plus another value called bias. An actual neuron fires an output signal only when the total strength of the input signals exceed a certain threshold.

Perceptron

Imitating, in a perceptron, weighted sum of the inputs to represent the total strength of the input signals is calculated the, and then is applied a step function (or called activate function) on the sum to determine its output.

Multilayer perceptron classifier from Apache Spark Machine Learning

Multilayer perceptron classifier (MLPC) from Apache Spark ML is a classifier based on the feedforward artificial neural network. MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network.

Note:

Each circle in above represents a perceptron:

Each layer is fully connected means neurons (perceptron) are all neurons on the next layer

Feedforward artificial neural network means signals move forward, there is no loop back.

Math representation of the neural network

Neurons in the input layer represent the input data. All other neurons map inputs to outputs by a linear combination of the inputs with the neuron’s weights w and bias b and applying an activation function or step function. This can be written in matrix form for MLPC with K+1 layers as follows:

Activate/Step function:

Neurons in intermediate layers (hidden layers) use sigmoid (logistic) function:

Neurons in the output layer use SoftMax function:

Notes on number neurons/perceptron

The number of neurons N in the output layer corresponds to the number of classes to be classified, The number of neurons in the first layer needs to be equal to number of features (columns)

Build a neural network with Apache Spark Multilayer perceptron classifier (MPLC)

As part of the effort, I need a dataset to train and test the MPLC. The obvious one to go to would classify images of hand written digits data set called MNIST and I am going to code in Scala.

Download MNIST dataset

Down the 4 data files from

http://yann.lecun.com/exdb/mnist/

However, they are binary files, in ubyte format, and not readily be loaded into Apache Spark dataframe and must be preprocessed.

Data Preprocessing

Ubyte to CSV to libsvm

Ubyte file format is formally known as IDX format.

http://www.fon.hum.uva.nl/praat/manual/IDX_file_format.html

The IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format according to http://yann.lecun.com/exdb/mnist/ is:

magic number

size in dimension 1

size in dimension 2

size in dimension 3

….

size in dimension N

data

The magic number is four bytes long.

The first 2 bytes are always 0.

The third byte codes the type of the data:

0x08: unsigned byte

0x09: signed byte

0x0B: short (2 bytes)

0x0C: int (4 bytes)

0x0D: float (4 bytes)

0x0E: double (8 bytes)

The fourth byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices….

The sizes in each dimension are 4-byte integers (big endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

Convert the 4 MNIST files

These 2 files are for training (1 is feature data, 1 is label data)

train-images-idx3-ubyte

train-labels-idx1-ubyte

These 2 files are for testing (1 is feature data, 1 is label)

t10k-images-idx3-ubyte

t10k-labels-idx1-ubyte

File exploration

To understand the data definition of the feature data, hex dump the file and show first 20 lines

The first 2 bytes are always 0.

First 2 bytes are 00 00

Third byte is 08 (0x08: unsigned byte)

Fourth byte is 03 (The fourth byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices….), so this is 3 dimension.

Since this is 3-dimension dataset:

Next 4-byte integer is 0000ea60, which is the size of 1st dimension. 0x 0000ea60 = 60000

Next 4-byte integer is 0000001c, which is the size of 2nd dimension. 0x 0000001c = 28

Next 4-byte integer is 0000001c, which is the size of 3rd dimension. 0x 0000001c = 28

Actual feature data follows after that.

This means the training feature data is 60000 rows, each row is a flattened matrix of 28*28 = 784 (feature) columns. This also means, 1st 16 bytes of the training image data are not actual data, but metadata. These 16 bytes will be thrown away when extract actual data from file in the code later on.

Now do the same on the label data file:

First 2 bytes are 00 00

Third byte is 08 (0x08: unsigned byte)

Fourth byte is 01 (The fourth byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices….), this is 1 dimension.

Since this is 1-dimension dataset:

Next 4-byte integer is 0000ea60, which is the size of 1st dimension. 0x 0000ea60 = 60000

Actual feature data follows after that.

This means the training label data is 60000 rows and 1 column. This also means, 1st 8 bytes of the training image label data are not actual data, but metadata. These 8 bytes will be thrown away when extract actual data from file in the code later on.

Test data files are the same format, no need to analyze. Except testing data is 10000 rows while training data is 60000 rows.

Helper Scala utility function to convert MNIST ubyte files to CSV file

Following Scala code is going to read the image and label data files in ubyte/IDX format and convert into text file delimited by “,”, called CSV file.

About libsvm format file

It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:

label index1:value1 index2:value2 …

where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

By the way, LIBSVM format does not have to have sparse data storage, data storage can be dense. Sparse data storage means it will not store fields with zero value, it only stores fields with non-zero value. Dense data storage means it stores fields with all values including zero values. Consequently, file size with dense storage will be larger than the one with sparse storage, but conceivably, it takes less work to process a libsvm file with dense storage than sparse storage.

I also wanted to create a utility to convert csv file into libsvm file, for simplicity, it only convert into libsvm file in dense storage.

For illustrative purpose, convert the following CSV file:

Label, value1, value2, value 3, … value N

Into libsvm file:

Label index1:value1 index2:value2 … indexN:valueN

Scala Helper utility function to convert Spark ML CSV file to libsvm file

Following is the Scala code:

RDD class method saveAsTextFile() is likely to create multiple parts of the files, you will need to come up a way to automatically merge these parts into one file, or you can do it manually. That is the nature of Spark application that runs on cluster of multiple worker nodes.

I have merged (outside this writing resultant parts files) into mnist_train.libsvm and mnist_test.libsvm.

Caveats

If you really want the Scala code to save RDD into a single file, you will have to copy RDD that spreads on multiple worker nodes into driver worker node that you launch your Spark application on by following code (not recommended, very slow, do not run it)

Create a neural network to train and test MNIST image of hand written digit

Create a neural network to train and test MNIST image of hand written digit, that has already been converted to libsvm file format from original ubyte raw format that Scala is not able to comprehend.

Summary

While Apache Spark Multilayer perceptron classifier is no replacement of TensorFlow, in fact, Apache has its own deep learning library MXNet that is more comparable to TensorFlow, building neural network with Multilayer perceptron classifier under specific use case make good sense especially on the data that is already with Spark distributed computing cluster and in concert with Spark SQL and Spark Streaming.

Last updated

Was this helpful?