spark

3 Topics

New guide 📖 Scoring Concepts: Write-Once Steps
Discover how to efficiently use Write-Once Steps in the Assess framework for data transformation. This detailed guide complements the Write-Once Steps documentation and helps you determine when to apply Write-Once Steps effectively in both Batch and Dynamic Scoring contexts. Key topics covered: The potential cost of using the wrong method When to use Write-Once Steps vs. Logical Sources Strategies for scoring networks in Batch (SparkScoringContext) and Dynamic (DynamicScoringContext) environments Gain a deeper understanding of how to avoid duplicating logic across contexts and streamline your data engineering workflows. Read the full article (login required) to explore practical scenarios and best practices for scoring networks with Write-Once Steps: 5. Scoring Concepts: Write-Once Steps - Quantexa Community This article serves as an extension of the Product documentation of Write-Once Steps and provides a guide on which situation should the Write-Once Steps be used. Introduction The Write-Once Step is a data transformation step in the Assess framework, which is executable in both Batch and Dynamic contexts, meaning you do not…
Zhenyu_Li
10 months ago Place Getting Started
38Views
0likes
0Comments
Spark Settings for Success
This page is set up to introduce spark-submit job settings and what considerations you should take when setting them. Please check out the Spark Documentation for all the possible ways to set your spark job: https://spark.apache.org/docs/latest/configuration.html Spark Settings Spark setting are highly dependent on datasource type and size and should be tuned separately. Initial ElasticLoad spark settings for 40 datanode cluster could look like this with the following settings: spark.drive.memory=20g spark.executor.instances=20 spark.executor.memory=20g spark.executor.cores=2 Description of Key Spark Settings The spark.driver.memory configuration in Apache Spark determines the amount of memory allocated to the Spark driver, which is the central control program for a Spark application. The driver is responsible for coordinating tasks, managing the overall execution of the application, and collecting results. The spark.executor.instances configuration in Apache Spark specifies the initial number of executor instances to allocate for a Spark application when it starts. Each executor instance represents a separate process that can run tasks in parallel. The spark.executor.memory setting determines how much memory each Spark executor has available for storing data, caching, and performing computations. The spark.executor.cores configuration in Apache Spark specifies the number of CPU cores to be allocated to each executor in a Spark application. It plays a crucial role in determining the degree of parallelism for your Spark tasks and impacts how your application utilizes the available CPU resources. Tips for Spark Settings It is advisable to maintain… Ratio 1:1 Number of Datanodes : NumberOfExecutors*ExecutorCores Some performance gains can be seen when ratio changes to Ratio 1:3 Number of Datanodes : NumberOfExecutors*ExecutorCores Example: 40 Datanode/spark.executor.instances=30*spark.executor.cores=4 (1:3) Increasing number of data nodes bring down total loading time and reduces bulk rejection error. Update Default Parallelism… Using spark.default.parallelism is a powerful way of tuning your spark job. This configuration defines the default number of partitions to be used for distributed data processing operations when the number of partitions is not explicitly specified. It plays a crucial role in determining the degree of parallelism for your Spark application. It should ideally be a multiple of the number of CPU cores in your cluster. If you enable dynamic allocation (below, Spark can adjust the number of partitions dynamically based on workload. In such cases, you may set a conservative initial value for parallelism. Operations that involve shuffling data between partitions (e.g., join , groupByKey ) often benefit from more partitions, as this can reduce the amount of data movement and improve performance. Consider the distribution of your data. Uneven data distribution can lead to load imbalance among partitions, affecting overall job performance. Adjusting the number of partitions can help address such issues. Enable Dynamic Allocation… The spark.dynamicAllocation.enabled is a boolean configuration option that determines whether dynamic allocation of executor resources is enabled or disabled for a Spark application. Dynamic allocation allows Spark to adjust the number of executor instances and their resources (CPU cores and memory) based on the workload and resource demands of the application. When set to true , dynamic allocation is enabled, allowing Spark to add or remove executor instances dynamically as needed during the course of the application's execution. When set to false , dynamic allocation is disabled, and the number of executor instances remains fixed throughout the application's lifetime, as determined by the initial configuration. Note: This will scale to take all available resources, so be careful when using if your project is very cost-conscious or if you have to share limited resources with other teams outside your own.
Clare_Jones
2 years ago Place Getting Started
209Views
0likes
1Comment
Read now: What it's Like to Perform an Upgrade
Read all about our experience and key takeaways when upgrading a repository from version 2.1 to 2.3. The upgrade was performed and released in early 2023 shortly after version 2.3 became available. The main motivation behind this particular upgrade was to upgrade the batch tier software including versions of EMR & Spark which Quantexa 2.3 supported. The flexibility of the new Data Viewer and other newly released features were also important value-adds. What's it Like to Perform an Upgrade? - Quantexa Community This blog describes the experience and key takeaways when upgrading a repository from version 2.1 to 2.3. The upgrade was performed and released in early 2023 shortly after version 2.3 became available. Why upgrade Quantexa versions? As with any software, the advantages of upgrading to newer versions include: Access to new…
Max_Mills
2 years ago Place Getting Started
111Views
0likes
2Comments