This is a public Topic for those getting started with the Community and Quantexa Platform - content posted here will be visible to all.

Spark Settings for Success

Options
Clare_Jones
Clare_Jones Posts: 5 QUANTEXA TEAM
edited November 2023 in Getting Started

This page is set up to introduce spark-submit job settings and what considerations you should take when setting them. Please check out the Spark Documentation for all the possible ways to set your spark job: https://spark.apache.org/docs/latest/configuration.html

Spark Settings

Spark setting are highly dependent on datasource type and size and should be tuned separately.

Initial ElasticLoad spark settings for 40 datanode cluster could look like this with the following settings:

spark.drive.memory=20g

spark.executor.instances=20

spark.executor.memory=20g

spark.executor.cores=2

Description of Key Spark Settings

The spark.driver.memory configuration in Apache Spark determines the amount of memory allocated to the Spark driver, which is the central control program for a Spark application. The driver is responsible for coordinating tasks, managing the overall execution of the application, and collecting results.

The spark.executor.instances configuration in Apache Spark specifies the initial number of executor instances to allocate for a Spark application when it starts. Each executor instance represents a separate process that can run tasks in parallel.

The spark.executor.memory setting determines how much memory each Spark executor has available for storing data, caching, and performing computations.

The spark.executor.cores configuration in Apache Spark specifies the number of CPU cores to be allocated to each executor in a Spark application. It plays a crucial role in determining the degree of parallelism for your Spark tasks and impacts how your application utilizes the available CPU resources.

Tips for Spark Settings

It is advisable to maintain…

Ratio 1:1 Number of Datanodes : NumberOfExecutors*ExecutorCores

  • Some performance gains can be seen when ratio changes to Ratio 1:3 Number of Datanodes : NumberOfExecutors*ExecutorCores
    • Example: 40 Datanode/spark.executor.instances=30*spark.executor.cores=4 (1:3)
  • Increasing number of data nodes bring down total loading time and reduces bulk rejection error.

Update Default Parallelism…

Usingspark.default.parallelism is a powerful way of tuning your spark job. This configuration defines the default number of partitions to be used for distributed data processing operations when the number of partitions is not explicitly specified. It plays a crucial role in determining the degree of parallelism for your Spark application.

  • It should ideally be a multiple of the number of CPU cores in your cluster.
  • If you enable dynamic allocation (below, Spark can adjust the number of partitions dynamically based on workload. In such cases, you may set a conservative initial value for parallelism.
  • Operations that involve shuffling data between partitions (e.g., join, groupByKey) often benefit from more partitions, as this can reduce the amount of data movement and improve performance.
  • Consider the distribution of your data. Uneven data distribution can lead to load imbalance among partitions, affecting overall job performance. Adjusting the number of partitions can help address such issues.

Enable Dynamic Allocation…

The spark.dynamicAllocation.enabled is a boolean configuration option that determines whether dynamic allocation of executor resources is enabled or disabled for a Spark application. Dynamic allocation allows Spark to adjust the number of executor instances and their resources (CPU cores and memory) based on the workload and resource demands of the application.

  • When set to true, dynamic allocation is enabled, allowing Spark to add or remove executor instances dynamically as needed during the course of the application's execution.
  • When set to false, dynamic allocation is disabled, and the number of executor instances remains fixed throughout the application's lifetime, as determined by the initial configuration.

Note: This will scale to take all available resources, so be careful when using if your project is very cost-conscious or if you have to share limited resources with other teams outside your own.

Tagged:

Comments

  • Dan_Pryer
    Dan_Pryer Posts: 1,663 QUANTEXA TEAM
    Options

    Thanks for the useful post @Clare_Jones !

    @Vicky_222 this might be a useful read off the back of our call for your Assessment 2 on Monday 😊

    Dan Pryer - Senior Data Engineer

    R&D - Decision Systems / Detection Packs

    Did my reply answer your question? Then why not mark it as having answered in the bottom right corner of my post! 😁

User Profiles
Getting Started Topic Owners
Ask our Topic Owners about anything you need to know to kick-start your Quantexa experience
VP Customer Success
Head of Customer Success Management - EMEA