Building upon the solid foundation of previous versions, Spark 3 brings many exciting features and optimizations that have the potential to revolutionize how you work with data.
Whether processing massive datasets, training machine learning models, or performing real-time analytics, Spark 3 equips you with the necessary tools.
The following blog discusses some of the critical changes Spark 3 introduces.
Performance enhancements
Spark 3 has made significant performance improvements in areas such as query execution and data shuffling, including:
- A Vectorized Execution Engine, which can significantly improve the performance of certain types of queries.
- Spark 3 also introduces the Adaptive Query Execution (AQE) framework, which optimizes query execution dynamically at runtime using the following three features:
- Dynamically coalescing shuffle partitions, coalescing contiguous small tasks.
- Dynamically switching join strategies, changing the join strategy based on runtime statistics.
- Dynamically optimizing skew joins, optimizing joins with unbalanced data.
For further details about the AQE, see Adaptive Query Execution: Speeding Up Spark SQL At Runtime.
New features and functionality
In addition to the performance enhancements, there have been numerous new features and functionality, including:
- Improved support for modern data sources such as Delta Lake and enhanced support for older Sources such as Parquet/CSV and many connector enhancements.
- Extended SQL support, including better handling of null values and improved support for complex data types and ANSI SQL compliance.
- Significant improvements in Pandas APIs, including Python type hints and additional Pandas User-Defined Functions (UDF).
- Better Python error handling, simplifying PySpark exceptions.
- A New UI for structured streaming.
- Up to 40x speedups for calling R UDFs.
- Support for High-Performance S3A committers.
The exact performance benefits depend on the workload, but the gains are significant if a large portion of your Spark job is spent writing data to S3. There could also be a positive impact on AWS deployments and running data-intensive jobs such as Batch Resolver.
Bug fixes, deprecations, and stability improvements
Spark 3 also benefits from numerous bug fixes, deprecations, and stability improvements:
- Spark 3 includes the deprecation of legacy APIs, which can simplify the development process by reducing the number of options available. The following are examples of functionalities commonly used on a Quantexa deployment:
- Deprecation of
UserDefinedAggregateFunction
, commonly called UDAF. - Change in behavior of
spark.emptyDataFrame
, where the empty Dataframe is now optimized to use LocalRelationinstead of an empty Resilient Distributed Dataset (RDD). You can find out more about this change with this commit. - Deprecation of
untyped UserDefinedFunction
, also known as udf(AnyRef, DataType)
. - Schema change when grouping on a nested column using groupBy
- Schema change when using groupByKey
- Deprecated Python 2 support.
- Deprecated R < 3.4 support.
Major cloud providers support
Major cloud providers, such as Google Cloud Platform, Amazon Web Services, and Microsoft Azure, have stopped supporting Apache Spark 2.x and now provide support only for Apache Spark 3.x.
Spark 3 impact on Quantexa deployments
No significant code changes are required to adopt this version of Apache Spark. For more information, see the Spark migration guide and Upgrading a Quantexa project from Spark 2.4 to 3.0.
However, it is essential to note that upgrading can also come with some challenges, such as compatibility issues with existing code, the need to retrain models, and the need to perform thorough testing before deploying to production.
References