This is a public Topic for those who want to stay up to date with Quantexa career opportunities, job fairs & career discussions - content posted here will be visible to all.

Need help to make LoadElasticScript job load data faster

agarwalab
agarwalab Posts: 46 Enthusiast

We are loading data into Elastic Search hosted on Kubernetes.

Elastic Search cluster data nodes = 10
Data volume = 500 million
Data processing speed = 1.9 million records processed in 6.5 hours.

Seems my job is not utilizing 10 data nodes completely. Could you please suggest me the needful to make this job run faster?

Please see below

etlConfig configuration

elastic {
dnb {
documentType: "dnb"
runId: 149
metadataPath: "gs://dnb-p2d-s-sto-g-inbound/resolver-output-data/metadata.parquet"
jobSettings {
dataPath {
hdfsRoot: "/dnb-p2d-s-sto-g-inbound/resolver-output-data/20230620/149/dnb"
}
indexSettings {
name: "dnb-r-2-5-0"
indexCreationOptions {
search {
additionalSettings {
"index.mapping.nested_objects.limit": "1000000"
"index.mapping.total_fields.limit": "100000"
"index.mapping.depth.limit": "100"
"index.mapping.nested_fields.limit": "100"
"number_of_shards" : "100"
}
}
}
}
metricsOptions {
collectLoadSizes: true
collectIndexSizes: true
}
}
elasticSettings {
elasticNodes: {
searchNodes = ["elasticsearch.stg.p2d.prod.gcpdnb.net:443"]
resolverNodes = ["elasticsearch.stg.p2d.prod.gcpdnb.net:443"]
}
auth {
user = "elastic"
password = "p2d@esk@admin"
}
https {
enabled: true
}
clientRetrySettings {
timeoutInSeconds = 180
retries = 10
retryWait = 10
}
bulk {
sizeInMb = 1000
entries = 50
retries = 10
retryWait = 10
}
}
useSynonyms = ${synonyms.useSynonyms}
incrementalMode = false
deleteSettings {
batchSize = 1000
indexTypes: ["doc2rec", "address", "business", "individual", "telephone", "email"]
}
updateMode = ${incrementalMode.enabled}
}

Spark-submit configuration

spark-submit
--class com.quantexa.scriptrunner.QuantexaSparkScriptRunner
--master yarn
--executor-cores 8
--num-executors 36
--executor-memory 8G
--driver-memory 200G
--conf spark.executor.memoryOverhead=$E_OVERHEAD
--conf spark.dynamicAllocation.enabled=false --conf spark.sql.autoBroadcastJoinThreshold=-1
--conf "spark.es.nodes.wan.only=true" --conf "spark.yarn.dist.archives=$LIBPOSTALHOME/joint.tar.gz,$LIBPOSTALHOME/libpostal_datadir.tar.gz"
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -DlibpostalDataDir=./libpostal_datadir.tar.gz"
--conf "spark.executor.extraLibraryPath=./joint.tar.gz"
--conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/"
--conf spark.task.maxFailures=10
--conf "spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED"
--conf "spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED"
--conf "spark.driver.extraClassPath=/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*"
--conf spark.sql.shuffle.partitions=2005
--conf spark.default.parallelism=2005
--conf spark.sql.debug.maxToStringFields=1000
--jars /home/hadoop/p2d/jars/allds/data-source-all-shadow-dependency-$VERSION.jar /home/hadoop/p2d/jars/allds/data-source-all-shadow-projects-$VERSION.jar \g

Tagged:


HTML img Tag gradient divider