Python with Apache Spark using Jupyter notebook

Now let’s run the Python version of pi program. Start Anaconda Navigator, select Virtual Environment spark

Click Jupyter Notebook

In the Jupyter Notebook, need to import findspark and run findspark.init(), which will find where the SPARK_HOME points to.

Following is the Python script that runs pi.py, you can simply run:

python pi.py

#!/usr/bin/env python
# coding: utf-8
from __future__ import print_function
import findspark
findspark.init()
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("PythonPi").getOrCreate()
partitions = 1
n = 100000 * partitions
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()

PreviousRun Scala code with spark-submit NextSpark Core Introduction

Last updated 5 years ago

Was this helpful?