Enabling for Conversion to/from Pandas in Python
Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.
In addition, optimizations enabled by
spark.sql.execution.arrow.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by
spark.sql.execution.arrow.fallback.enabled.Example Python code
import findspark
findspark.init()
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Declare the function and create the UDF
def multiply_func(a, b):
return a * b
multiply = pandas_udf(multiply_func, returnType=LongType())
# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64
# Create a Spark DataFrame, 'spark' is an existing SparkSession
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
# Execute function as a Spark vectorized UDF
df.select(col("x")*col("x")).show()
'''
0 1
1 4
2 9
dtype: int64
+-------+
|(x * x)|
+-------+
| 1|
| 4|
| 9|
+-------+
'''
Some issue:
here is the error when running creaeDataframe from pandas dataframe, when spark.sql.execution.arrow.enabled is true
Work around, set OS environment variable
Then run Python code
Last updated
Was this helpful?