# Preface

Apache Spark is one of the greatest open source all in one enterprise big data/analytic engine. It combines distributed/clustered computing, high availability, disruption resilience and fault tolerance. It is in memory computing. It encapsulates sophisticated SQL query capability on structure data like relational database tables and non structure data like NoSQL key value pairs, along with robust streaming, rich machine learning and statistics features, paired with a graph computing engine for applications such as social network and internet advertisement revenue driven search engines.&#x20;

Spark is powerful. However, it is non trivial, especially to those developers who are new to Spark, many of the API functions  lack runnable, end to end, invoking example.  Because of the nature of functional programming language, invoking many API functions successfully takes effort and learning curve.

The audiences of this book are developers who are new to Spark.&#x20;

I write this book from my teaching notes on Apache Spark. This book intends to cover all Spark API functions/methods with example codes that are executable and that are working, coupled with concise input data and output results, with goal to provide quick references to developers who can extract section of working command lines with correct input arguments to the API calls to be used in their code, saving their time from trial and error attempts that I have come through personally when I am writing this book.

As of now, there are 10 projects in the area of Spark SQL, Spark Streaming, Spark Machine Learning, and Spark Graphx are included in this ebook as demos for conducting data science and data engineering project with Apache Spark.

As always, codes written by me used in this book are in my GitHub repo:

<https://github.com/geyungjen/jentekllc>

George Jen

Jen Tek LLC

*Draft, work in progress*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://george-jen.gitbook.io/data-science-and-apache-spark/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
