This is a public Topic for those getting started with the Community and Quantexa Platform - content posted here will be visible to all.
Check out our Community Topic Guide to find what you're looking for.
Please note: for any questions related to Quantexa Academy click here
Or for any developer support please visit the Quantexa Platform Support Topic
Spark Settings for Success
This page is set up to introduce spark-submit job settings and what considerations you should take when setting them. Please check out the Spark Documentation for all the possible ways to set your spark job:
Spark Settings
Spark setting are highly dependent on datasource type and size and should be tuned separately.
Initial ElasticLoad spark settings for 40 datanode cluster
could look like this with the following settings:
spark.drive.memory=20g
spark.executor.instances=20
spark.executor.memory=20g
spark.executor.cores=2
Description of Key Spark Settings
The spark.driver.memory
configuration in Apache Spark determines the amount of memory allocated to the Spark driver, which is the central control program for a Spark application. The driver is responsible for coordinating tasks, managing the overall execution of the application, and collecting results.
The spark.executor.instances
configuration in Apache Spark specifies the initial number of executor instances to allocate for a Spark application when it starts. Each executor instance represents a separate process that can run tasks in parallel.
The spark.executor.memory
setting determines how much memory each Spark executor has available for storing data, caching, and performing computations.
The spark.executor.cores
configuration in Apache Spark specifies the number of CPU cores to be allocated to each executor in a Spark application. It plays a crucial role in determining the degree of parallelism for your Spark tasks and impacts how your application utilizes the available CPU resources.
Tips for Spark Settings
It is advisable to maintain…
Ratio 1:1
Number of Datanodes : NumberOfExecutors*ExecutorCores
- Some performance gains can be seen when ratio changes to
Ratio 1:3
Number of Datanodes : NumberOfExecutors*ExecutorCores
- Example:
40 Datanode/spark.executor.instances=30*spark.executor.cores=4 (1:3)
- Example:
- Increasing number of data nodes bring down total loading time and reduces bulk rejection error.
Update Default Parallelism…
Usingspark.default.parallelism
is a powerful way of tuning your spark job. This configuration defines the default number of partitions to be used for distributed data processing operations when the number of partitions is not explicitly specified. It plays a crucial role in determining the degree of parallelism for your Spark application.
- It should ideally be a multiple of the number of CPU cores in your cluster.
- If you enable dynamic allocation (below, Spark can adjust the number of partitions dynamically based on workload. In such cases, you may set a conservative initial value for parallelism.
- Operations that involve shuffling data between partitions (e.g.,
join
,groupByKey
) often benefit from more partitions, as this can reduce the amount of data movement and improve performance. - Consider the distribution of your data. Uneven data distribution can lead to load imbalance among partitions, affecting overall job performance. Adjusting the number of partitions can help address such issues.
Enable Dynamic Allocation…
The spark.dynamicAllocation.enabled
is a boolean configuration option that determines whether dynamic allocation of executor resources is enabled or disabled for a Spark application. Dynamic allocation allows Spark to adjust the number of executor instances and their resources (CPU cores and memory) based on the workload and resource demands of the application.
- When set to
true
, dynamic allocation is enabled, allowing Spark to add or remove executor instances dynamically as needed during the course of the application's execution. - When set to
false
, dynamic allocation is disabled, and the number of executor instances remains fixed throughout the application's lifetime, as determined by the initial configuration.
Note: This will scale to take all available resources, so be careful when using if your project is very cost-conscious or if you have to share limited resources with other teams outside your own.
Comments
-
Thanks for the useful post @Clare_Jones !
@Vicky_222 this might be a useful read off the back of our call for your Assessment 2 on Monday 😊
Dan Pryer - Senior Engineer
R&D - Decision Systems / Detection Packs
Did my reply answer your question? Then why not mark it as having answered in the bottom right corner of my post! 😁
0