# Enabling for Conversion to/from Pandas in Python

Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas\_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow\.enabled to true. This is disabled by default.

In addition, optimizations enabled by&#x20;

```
spark.sql.execution.arrow.enabled 
```

could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by&#x20;

```
spark.sql.execution.arrow.fallback.enabled.
```

Example Python code

```
import findspark
findspark.init()
import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
# 0    1
# 1    4
# 2    9
# dtype: int64

# Create a Spark DataFrame, 'spark' is an existing SparkSession

df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(col("x")*col("x")).show()

'''
0    1
1    4
2    9
dtype: int64
+-------+
|(x * x)|
+-------+
|      1|
|      4|
|      9|
+-------+
'''


```

Some issue:

```
import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()
```

here is the error when running creaeDataframe from pandas dataframe, when spark.sql.execution.arrow\.enabled is true

```
/home/dv6/spark/spark/python/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile.
: java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.<init>(ArrowConverters.scala:229)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$.getBatchesFromStream(ArrowConverters.scala:228)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:216)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:214)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$.readArrowStreamFromFile(ArrowConverters.scala:214)
	at org.apache.spark.sql.api.python.PythonSQLUtils$.readArrowStreamFromFile(PythonSQLUtils.scala:46)
	at org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile(PythonSQLUtils.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
```

Work around, set OS environment variable

```
export ARROW_PRE_0_15_IPC_FORMAT=1
```

Then run Python code

```
(spark) dv6@dv6:~$ export ARROW_PRE_0_15_IPC_FORMAT=1
(spark) dv6@dv6:~$ python
Python 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 21:14:29)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import findspark
>>> findspark.init()
>>> import pandas as pd
>>>
>>> from pyspark.sql.functions import col, pandas_udf
>>> from pyspark.sql.types import LongType
>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.getOrCreate()
20/04/12 12:29:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/04/12 12:29:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/04/12 12:29:20 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
20/04/12 12:29:20 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
20/04/12 12:29:20 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
>>> import numpy as np
>>> import pandas as pd
>>>
>>> # Enable Arrow-based columnar data transfers
... spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> # Generate a Pandas DataFrame
... pdf = pd.DataFrame(np.random.rand(100, 3))
>>> pdf
           0         1         2
0   0.937892  0.387147  0.590136
1   0.007276  0.961907  0.156945
2   0.212474  0.048127  0.936995
3   0.074513  0.579899  0.803862
4   0.324786  0.352669  0.602877
..       ...       ...       ...
95  0.164290  0.376453  0.388663
96  0.014815  0.709746  0.615609
97  0.797867  0.563372  0.132668
98  0.755495  0.589192  0.793425
99  0.505420  0.672960  0.452064

[100 rows x 3 columns]
>>> df = spark.createDataFrame(pdf)
>>> # Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
... result_pdf = df.select("*").toPandas()
>>>

```
