Preface

Apache Spark is one of the greatest open source all in one enterprise big data/analytic engine. It combines distributed/clustered computing, high availability, disruption resilience and fault tolerance. It is in memory computing. It encapsulates sophisticated SQL query capability on structure data like relational database tables and non structure data like NoSQL key value pairs, along with robust streaming, rich machine learning and statistics features, paired with a graph computing engine for applications such as social network and internet advertisement revenue driven search engines.

Spark is powerful. However, it is non trivial, especially to those developers who are new to Spark, many of the API functions lack runnable, end to end, invoking example. Because of the nature of functional programming language, invoking many API functions successfully takes effort and learning curve.

The audiences of this book are developers who are new to Spark.

I write this book from my teaching notes on Apache Spark. This book intends to cover all Spark API functions/methods with example codes that are executable and that are working, coupled with concise input data and output results, with goal to provide quick references to developers who can extract section of working command lines with correct input arguments to the API calls to be used in their code, saving their time from trial and error attempts that I have come through personally when I am writing this book.

As of now, there are 10 projects in the area of Spark SQL, Spark Streaming, Spark Machine Learning, and Spark Graphx are included in this ebook as demos for conducting data science and data engineering project with Apache Spark.

As always, codes written by me used in this book are in my GitHub repo:

https://github.com/geyungjen/jentekllc

George Jen

Jen Tek LLC

Draft, work in progress

NextContents

Last updated 5 years ago