Enabling for Conversion to/from Pandas in Python

Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.

In addition, optimizations enabled by

spark.sql.execution.arrow.enabled 

could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by

spark.sql.execution.arrow.fallback.enabled.

Example Python code

import findspark
findspark.init()
import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
# 0    1
# 1    4
# 2    9
# dtype: int64

# Create a Spark DataFrame, 'spark' is an existing SparkSession

df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(col("x")*col("x")).show()

'''
0    1
1    4
2    9
dtype: int64
+-------+
|(x * x)|
+-------+
|      1|
|      4|
|      9|
+-------+
'''

Some issue:

here is the error when running creaeDataframe from pandas dataframe, when spark.sql.execution.arrow.enabled is true

Work around, set OS environment variable

Then run Python code

Last updated

Was this helpful?