Apache Spark SQL & Machine Learning on Genetic Variant Classifications
Introduction
Context

Objective
Class Label
ML Project Steps
Data Acquisition
Load the csv file into Spark SQL DataFrame
Data Exploration:
Evaluate schema:
Data Preprocessing
NULL replacement
Encode the String value to integer using StringIndexer
Make all integer column to double
Feature Vectorization
Normalize feature values in vector column features, using MinMax Scaler
Train/Test Split, randomly split 70% of rows for training, 30% for test
Training and Prediction
Algorithm Selection
Multilayer Perceptron Classifier, there are 45 feature columns, and binary classification, I construct a neural network of 4 layer as below:
Train and test the neural network
Logistic Regression Classifier
Take away:
PreviousSpark Machine Learning ApplicationsNextData Visualization with Vegas Viz and Scala with Spark ML
Last updated