Spark Settings for Success

Clare_Jones · October 2023

This page is set up to introduce spark-submit job settings and what considerations you should take when setting them. Please check out the Spark Documentation for all the possible ways to set your spark job: https://spark.apache.org/docs/latest/configuration.html

Spark Settings

Spark setting are highly dependent on datasource type and size and should be tuned separately.

Initial ElasticLoad spark settings for 40 datanode cluster could look like this with the following settings:

spark.drive.memory=20g

spark.executor.instances=20

spark.executor.memory=20g

spark.executor.cores=2

Description of Key Spark Settings

The spark.driver.memory configuration in Apache Spark determines the amount of memory allocated to the Spark driver, which is the central control program for a Spark application. The driver is responsible for coordinating tasks, managing the overall execution of the application, and collecting results.

The spark.executor.instances configuration in Apache Spark specifies the initial number of executor instances to allocate for a Spark application when it starts. Each executor instance represents a separate process that can run tasks in parallel.

The spark.executor.memory setting determines how much memory each Spark executor has available for storing data, caching, and performing computations.

The spark.executor.cores configuration in Apache Spark specifies the number of CPU cores to be allocated to each executor in a Spark application. It plays a crucial role in determining the degree of parallelism for your Spark tasks and impacts how your application utilizes the available CPU resources.

Tips for Spark Settings

It is advisable to maintain…

Ratio 1:1 Number of Datanodes : NumberOfExecutors*ExecutorCores

Some performance gains can be seen when ratio changes to Ratio 1:3 Number of Datanodes : NumberOfExecutors*ExecutorCores
- Example: 40 Datanode/spark.executor.instances=30*spark.executor.cores=4 (1:3)
Increasing number of data nodes bring down total loading time and reduces bulk rejection error.

Update Default Parallelism…

Usingspark.default.parallelism is a powerful way of tuning your spark job. This configuration defines the default number of partitions to be used for distributed data processing operations when the number of partitions is not explicitly specified. It plays a crucial role in determining the degree of parallelism for your Spark application.

It should ideally be a multiple of the number of CPU cores in your cluster.
If you enable dynamic allocation (below, Spark can adjust the number of partitions dynamically based on workload. In such cases, you may set a conservative initial value for parallelism.
Operations that involve shuffling data between partitions (e.g., join, groupByKey) often benefit from more partitions, as this can reduce the amount of data movement and improve performance.
Consider the distribution of your data. Uneven data distribution can lead to load imbalance among partitions, affecting overall job performance. Adjusting the number of partitions can help address such issues.

Enable Dynamic Allocation…

The spark.dynamicAllocation.enabled is a boolean configuration option that determines whether dynamic allocation of executor resources is enabled or disabled for a Spark application. Dynamic allocation allows Spark to adjust the number of executor instances and their resources (CPU cores and memory) based on the workload and resource demands of the application.

When set to true, dynamic allocation is enabled, allowing Spark to add or remove executor instances dynamically as needed during the course of the application's execution.
When set to false, dynamic allocation is disabled, and the number of executor instances remains fixed throughout the application's lifetime, as determined by the initial configuration.

Note: This will scale to take all available resources, so be careful when using if your project is very cost-conscious or if you have to share limited resources with other teams outside your own.

Dan_Pryer · October 2023

Thanks for the useful post @Clare_Jones !

@Vicky_222 this might be a useful read off the back of our call for your Assessment 2 on Monday 😊

Spark Settings for Success

Spark Settings

Description of Key Spark Settings

Tips for Spark Settings

It is advisable to maintain…

Update Default Parallelism…

Enable Dynamic Allocation…

Comments

Dan Pryer - Senior Engineer

Useful Links

Topics