Spark Core Introduction

Driver program -- Application code runs on Spark

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.

RDD

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

RDDs automatically recover from node failures.

Variables

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

Spark supports two types of shared variables:

Broadcast variables, which can be used to cache a value in memory on all nodes

Accumulators, which are variables that are only “added” to, such as counters and sums.

Last updated