SPARK Streaming
SPARK Streaming
Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. You can also define your own custom data sources.
Spark Streaming runs on Spark's standalone cluster mode or other supported cluster resource managers.
It also includes a local run mode for development. In production, Spark Streaming uses ZooKeeper and HDFS for high availability.

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

A simple streaming test:
On a linux or mac machine that supports ncat:
Run below at the OS command line
Then enter something like below:
Assume in parallel, you have below code ready to run:

or the Python code that has already started:

I have created detailed video presentation on Spark Streaming, the links to the video presentation are in Appendix
Last updated
Was this helpful?