Pandas API on Spark¶
- Options and settings
- From/to pandas and PySpark DataFrames
- Transform and apply a function
- Type Support in Pandas API on Spark
- Type Hints in Pandas API on Spark
- From/to other DBMSes
- Best Practices
- Leverage PySpark APIs
- Check execution plans
- Use checkpoint
- Avoid shuffling
- Avoid computation on single partition
- Avoid reserved column names
- Do not use duplicated column names
- Specify the index column in conversion from Spark DataFrame to pandas-on-Spark DataFrame
- Use
distributed
ordistributed-sequence
default index - Reduce the operations on different DataFrame/Series
- Use pandas API on Spark directly whenever possible
- FAQ