This article discusses the key considerations for solution architects and data engineers when deploying Elasticsearch (sometimes referred to as Elastic) for Quantexa.
It's useful, though not essential, to have a high level understanding of Quantexa in general, such as how we use Document, Entity and Compounds.
💡Notes / Recommendations
Notes and recommendations, marked with a lightbulb symbol, relate to standard implementations only and may not apply to all deployments.
Quantexa uses Elasticsearch as the primary application datastore for cleansed, parsed, and resolved data (documents and transactions), for compound keys used in dynamic entity resolution, and for a resolved entity cache.
The performance of Elasticsearch is critical to the performance of the Quantexa application (UI and mid-tier application services and APIs). Optimizing the deployment Elasticsearch for Quantexa can deliver significant benefits and potentially cost savings.
Elasticsearch stores data primarily in Inverted Indexes (Inverted index - Wikipedia). Indexes are broken down into a configurable set of index shards, and these shards are distributed over Elasticsearch (Data) Nodes. A Node can be thought of as an instance of Elasticsearch deployed to a server. Nodes operate together as part of a cluster.
Additional copies of each index shard (called replica shards) can be created to deliver redundancy / high availability. Replica shards are written to a Data Node which is different from the original (and other replica shard(s)). Having one set of replica shards means that if one Data Node is lost, a replica of any shards on that Data Node will exist on another Data Node, so no data is lost. Having two sets of replica shards means two Data Nodes can be lost without losing any data.
Elasticsearch Node Types
Alongside Data Nodes, an Elasticsearch cluster also needs Master Nodes and (optionally) Coordinator Nodes - on a small cluster, a node can perform multiple roles, eg. a node can be both a Master Node and a Data Node.
The following sections describe each node in more detail and give recommendations for the configuration of each node type. The recommendations are well aligned with those made by Elasticsearch.
These are responsible for cluster-wide actions such as tracking which nodes are part of the cluster, creating or deleting an index, and allocating index shards Data Nodes. To provide redundancy, it's important to have a cluster of Master Nodes (minimum 3 - for reasoning see:
Quorum-based decision making | Elasticsearch Guide [8.9] | Elastic). To save on resources / reduce costs, a Master Node can be made a voting-only node leaving a minimum of 2 Master Eligible Nodes.
Quantexa recommends a minimum cluster of 2 Master Eligible Nodes and 1 Voting Only Master Node. Each Elasticsearch Master Node should be allocated:
▪️ Master Eligible Node: 2vCPU, 8GB of memory (10GB boot disk).
▪️ Voting Only Node: 1vCPU, 2GB of memory (10GB boot disk)
These are responsible for storing index shards and responding to queries.
Quantexa recommends each Elasticsearch Data Node be allocated:
▪️ 8vCPU, 64GB of memory, 1TB* index disk storage (SSD) (10GB boot disk).
* The exact size of SSD will depend on the type of data being indexed (see below).
The vCPUs are used far more intensively when loading data into indexes. For queries, it's unusual to see vCPU usage (with 8vCPU) over 25%. As discussed below, memory is heavily used (as a disk cache) for queries.
Coordinator Nodes (Optional)
All Elasticsearch Nodes are Coordinator nodes and can accept a query request.
Queries can be broken into 2 phases:
▪️ Phase 1: Query is routed to each node containing index shards for the index being queried,
▪️ Phase 2: The results from each index shard are combined into a single response (a reduce).
Coordinator (only) nodes do not store data and do not run queries over index shards.
Coordinator Nodes route the query to Data Nodes and combine the results. In small to medium clusters, queries can be distributed evenly amongst all Data Nodes (using a load balancer) and therefore the coordination and reduce effort is shared across all Data Nodes. For large Clusters a smaller set of Coordinator nodes can be created and queries can be distributed evenly amongst these nodes.
Quantexa has seen some performance improvement on a Cluster with 100+ Data Nodes when using a cluster of 8 - 12 Coordinator (only) Nodes. However, as confirmed by Elasticsearch, the benefits of Coordinator (only) Nodes should not be "overstated".
Quantexa recommends each Elasticsearch Coordinator Node be allocated:
▪️ 8vCPU, 64GB of memory (10GB boot disk).
Elasticsearch Disks and Disk Cache
Elasticsearch will cache (in memory) the results of a query and reuse this if the same query, or in certain cases a similar query, is run again. Quantexa does not typically make much use of the Elasticsearch cache as queries (or suitably similar queries) are unlikely to be run multiple times.
Outside of Elasticsearch, the server onto which the Elasticsearch node is deployed will also cache blocks of data read from the disk (in memory) and these will get used if the same blocks of data are needed again - this is commonly referred to as the disk cache.
The disk cache will normally give a significant performance increase to Elasticsearch as used by Quantexa.
Disk read speed and the disk cache are critical to the performance, for this reason, Quantexa recommends solid state disks (SSDs) are used, and each Elasticsearch node is allocated 64GB memory with the majority of this left unallocated and therefore available for use as a disk cache.
The amount of memory that needs to be allocated to the an Elasticsearch node service depends on a number of factors such as the number of index shards, and therefore differs between deployments, but should not normally be more then 8GB.
(and their Elasticsearch cluster requirements)
There are, broadly speaking, 4 index types created by Quantexa. Each index type has a different usage profile and therefore different storage and cache requirements. The performance of queries run against each index type is driven, to a large extent, by the proportion of the index data disk blocks available in the (in memory) disk cache - we call this the index-to-memory ratio. So if an index is 80GB in size, and we suggest an
8:1 index-to-memory ratio, we would need 10GB of memory available for that index. If the index is split into 5 shards, each shard would need 2GB of memory. As Elasticsearch distributes index shards evenly across the available Nodes, the total memory required for each node is:
index_to_memory_ratio] / number_nodes
The memory to index ratio is driven by a number of factors including the number of queries run, the fraction of the index data read, the distribution of the index data read, and the size of the query result.
The different index types are described below.
1 - Document Indexes
These are used for typical Elasticsearch rich -text queries, in particular:
(a) rich-text and other searches over cleansed and parsed source data documents, and
(b) retrieving full document details for use in the UI and other APIs.
The recommended index-to-memory ratio for document indexes is
2 - Transaction Indexes
These are used for:
(a) for retrieving full transaction details for use in the UI and other APIs, and
(b) for building aggregated transaction totals across a range of transactions.
The recommended index-to-memory ratio for transaction indexes is
3 - Entity Store Indexes
These are used for typical Elasticsearch rich-text queries, in particular:
(a) rich-text and other searches over entities cached in the entity store, and
(b) retrieving full entity details for use in the UI and other APIs.
The recommended index-to-memory ratio is for entity store indexes is
4 - Resolver Indexes
These are used for dynamically resolving entities in the UI and APIs. Entity resolution across a network of connected entities can result in a large number of queries. These are internally generated queries, searching for an set of compound key values.
The recommended index-to-memory ratio for resolver indexes is
Number of Shards per Index
The number of shards per index is configurable within Quantexa. If the number of shards is not specified, Quantexa will estimate the number of shards required (see Quantexa ETL Elasticsearch Tools).
Elasticsearch recommends a maximum shard size of 50GB. The number of shards in an index should be set to ensure that each shard size is within this range.
When setting the number of shards for a large index, it is important to ensure that these are distributed evenly across the Elasticsearch Data Nodes.
For example, for an Elasticsearch cluster with 10 Data Nodes, if an index is created with 11 shards, one node would need to host two shards whereas the rest would host only one. This means one node would effectively have double the workload in a query against this index and could become a performance bottleneck.
As most Quantexa deployments have many indexes, and it's common for multiple indexes to be queried simultaneously, this may not always be as significant an issue as it would be for a single index query, but it could have an impact so is worth consideration.
When loading data into the index, the Node hosting two shards would certainly become a bottleneck and slow the Elasticsearch load.
However, be careful to stay within the 50GB per shard recommendation set by Elasticsearch.
Replica Shards / Redundancy
As described above, replica shards can be created for an index and these support high availability. With a single set of replica shards, a Data Node can be lost from the cluster without losing any index data as the replica shards will be on a different node to the original shards.
In most deployments, Quantexa recommends having a set of replica shards for each index in a Production environment. In Pre-Production/UAT and development environments, replica shards are not required. Here the available disk space is better use to allow multiple instances of indexes (for use in testing and/or to allow multiple parallel deployments).
A single set of replica shards will double the size of the index and it is important that the index size to memory ratio (discussed above) is maintained or else this will impact on performance.
Let's walk through an example: If we assume with have an Elasticsearch cluster with 3 nodes, one index of size 30GB split into three (10GB) shards and an index size to memory ratio requirement of
10:1 (1GB of memory available to the disk cache is required for each 10GB of index).
Key: PS - Primary Shard, RS - Replica Shard
Single set of shards (no replica shards)
Server 1 (1GB disk cache memory): Node A: PS:1 (10GB) Server 2 (1GB disk cache memory): Node B: PS:3 (10GB) Server 3 (1GB disk cache memory): Node C: PS:2 (10GB)
Note: Each server/node stores 10GB of index data so has a index-to-memory ratio of
Two sets of shards (one set of replica shards)
Server 1 (1GB disk cache memory): Node A: PS:1 (10GB) & rs:3 (10GB) Server 2 (1GB disk cache memory): Node B: PS:3 (10GB) & rs:2 (10GB) Server 3 (1GB disk cache memory): Node C: PS:2 (10GB) & rs:1 (10GB)
Note: As Elasticsearch will use both the primary and replica shards in queries, the shard and its replica end up getting cached on two nodes, using memory on both. Each node now stores 20GB of index data so the index-to-memory ratio is now
20:1. Elasticsearch has no knowledge as to which disk blocks are in the cache on a node so cannot route queries accordingly, this means that same index data disk blocks will get cached on multiple nodes and, on average, only 5% of the index disk blocks get cached in memory.
Query Performance / Capacity Increase
Replica shards can be use to improve query performance and / or increase the number of queries / users the cluster can support. This is because two (or more) nodes can respond to queries against a particular shard.
Replica shards increase the size of the index and it is important, for the reasons described above, to ensure that the index-to-memory ratios are maintained. This means that, unless there is spare capacity on the cluster, if replica shards are used to improve performance, the number of nodes will need to be increased to host these.
Better performance / increased capacity almost always needs an increase in hardware.
Index Loads with Replica Shards
When loading indexes with replica shards, Elasticsearch builds the primary and replica shards independently, replicating the data loaded rather than the index disk files. This means a single set of replica shards will double the ingest effort.
When possible, to reduce the load to Elasticsearch runtime, create new indexes without replica shards, load the index and then add replica shards. Elasticsearch will then asynchronously create replica shards. Note, that until the replica shards have been created, the high availability / data redundancy characteristic for the index is not available.
These options are all available as Quantexa configuration.
Index Segments and Force Merge
Each time data is loaded to an index in Elasticsearch, it is written into a new index segment - each index shard can be made up of multiple segments. Whilst a new segment is being created or written to, it cannot be queried, so during a long running index load, Elasticsearch will switch to a new segment (called an index refresh) at regular intervals - the interval is configurable and can be set of each index with the refresh_interval setting. This refresh makes new data available for queries at each interval.
To improve index ingest / load performance, the
refresh_interval can be set to
-1 for the index. This stops the automatic / periodic index refreshes which means new segments are not created for the purpose of a refresh. Once the index load is complete, the
refresh_interval can be set back to a suitable value (in seconds) and / or the index can be manually refreshed.
These options are also available as Quantexa configuration settings.
Each index segment for an index shard can be thought of as a sub-shard in it's own right. When running a query against an index, every index segment within an index shard has to be queried, this can have a significant impact of performance if there are more than a few segments.
Elasticsearch will automatically merge index segments however this is a resource-intensive process. and Elasticsearch will not normally merge all index segments into a single segment due to the cost. The index segment merge process is, in effect, one which re-indexes the data creating a single merged segment to replace the existing segments so a merge down to one segment means, in effect, re-indexing all data in the shard.
For indexes which are infrequently loaded, it can be very beneficial to force merge the indexes segments. A force merge can be run for an index and the maximum number of index segments per shard can be set. Where possible, indexes should be merged to a single segment per shard but this means re-indexing the entire shard so can take a while to complete. Whilst the merge is running, however, the indexes existing segments can still respond to queries.
Force merge operations can be run as part of the Quantexa index load process, however, this will elongate the index load process as it waits until the force merge operation completes. For this reason, it is usually better to run the force merge on indexes outside the Quantexa index load process, perhaps as a scheduled job.
Hopefully, having read this article, you will have a better understanding of how Quantexa uses Elasticsearch, how best to configure Elasticsearch for Quantexa, and the reasons why.
For next steps, read our list of Useful Elasticsearch API Calls.
There are lots of details on configuring Elasticsearch and Quantexa use of Elasticsearch in the Quantexa Documentation.