Introduction
This page describes the key considerations for architects to consider when designing and setting up the underlying platforms for a Quantexa deployment in an on-premise environment.
The diagram above shows the high-level architecture of a Quantexa deployment including the necessary underlying platforms, these are:
- Spark - used for all batch data processing
- Elasticsearch/OpenSearch - used as the data/search backend for the application tier
- Container orchestration platform - used for deploying the Quantexa application tier
- Relational database - used for storing the state for the application tier including user investigations and audit logs of user actions
- Kafka - required for Quantexa deployments that perform streaming data processing, as this is only required for some deployments, I will not cover it here
To have a performant and cost-efficient platform, it is important to set each of these underlying platforms up correctly. Most importantly, deploy them using their respective vendors' best practices, but from our experience, there are some recommendations that will help you optimise their setup for use with the Quantexa platform.
Note that this page only covers the underlying platforms rather than all required software. A list of required software can be found here.
Spark
Distributed Spark platforms
The vast majority of Quantexa deployments process data volumes too large to be run on a single node (>25m documents) and hence require a distributed Spark cluster. There are two common deployment approaches for Spark clusters:
- Hadoop - many of our customers already have multi-tenanted Hadoop clusters that they use for the Quantexa batch processing. Where this does not exist, Hadoop is still a well-trodden approach for running distributed Spark. Most Hadoop distributions are, by default, installed with a large number of optional services that are not required for Quantexa deployments; these can add to the complexity of managing the cluster and consume a significant proportion of the cluster resources.
- Spark on K8S - this is growing in popularity among our customers, one key consideration is that you will need a distributed storage platform such as min.io. The Spark executor containers will need to be allocated at least 4 CPUs and 32 GB RAM, this is a lot high than typical K8S application workloads.
Optimising Spark cluster size
A key factor in the sizing of the Spark cluster is the target run time for the batch data processing. If you are deploying a dedicated Spark cluster for the Quantexa deployment, then by performing the batch processing throughout the day/week/month, you can reduce the cluster size required and hence the cost.
Due to Spark processing's linear scalability, using more resources for less time is generally preferable when utilizing a shared Spark cluster. This results in the same overall resource usage and reduces the batch processing window. However, this does require planning and coordination between the various applications using the Spark cluster.
Storage
Generally, our customers provision 100GB-1TB of disk per core. When sizing the distributed storage platform, there are several factors which are essential to consider, including:
- Data replication (generally a factor of 2 or 3)
- It will often be a requirement to keep copies of historical runs; from a purely operational perspective, we would usually recommend keeping all data from at least the last three runs
- Spark requires a significant amount of temp disk space
- Some of the intermediary datasets, especially for ego graph building, can be very large (up to 10-30TB in total for a single run with global corporate registry)
- Parquet compresses very well, meaning that file sizes are generally smaller than you would expect
Elasticsearch/OpenSearch
The Quantexa platform supports both Elasticsearch and OpenSearch as a data/search back end for the Quantexa application tier. We have found that both work equally well for both small and large Quantexa deployments. For simplicity, I will refer to Elasticsearch, but all recommendations apply equally to both.
Optimizing infrastructure and sizing
There are some key considerations when sizing and setting up an Elasticsearch cluster for use by a Quantexa deployment:
- Heap size - To support search operations, Elasticsearch performs a very large number of random (non-contiguous) reads and makes heavy use of the page cache. Therefore it is recommended to set a low heap size for the Elasticsearch process (8-16GB) to leave as much memory as possible for the page cache.
- SSD storage - Due to the large number of random reads, Disk IOPS is also a significant driver of the performance of an Elasticsearch cluster; therefore choosing fast SSDs will generally allow you to have a smaller overall size for your cluster while maintaining good performance.
- Memory:index ratios - Quantexa deployments often require higher memory:index size ratios than typical Elasticsearch deployments. This is because the resolver indexes are queried heavily as part of dynamic Entity Resolution, and hence, very low latency for these requests is essential.
- Offline indexer - One driver of Elasticsearch cluster size is the time taken to perform indexing. Where indexing run times for full data loads are a challenge, we recommend using the offline indexer, which performs the indexing in Spark rather than increasing the number of CPUs in the Elasticsearch cluster.
Resilience
Using replicas on all Elasticsearch indexes is essential to be resilient to node, disk, and Elasticsearch process failure. The queries use both primary and replica shards, and having 1 replica will double the volume of index being served and hence the size of the cluster required.
For Quantexa deployments that contain small volumes of data (<25 million docs) to be resilient to node failure, Elasticsearch clusters require at least 2 data nodes and at least 3 master-eligible nodes. More guidance can be found here.
More detailed guidance on Elasticsearch considerations can be found here.
Container orchestration platform
We strongly recommend deploying the Quantexa application tier into a container orchestration platform. This provides easy management, scaling, HA, and self-healing without the need to implement complex scripts.
Where using a container orchestration platform, we would recommend using the Quantexa helm charts to deploy the Quantexa application tier, as it makes deployment and upgrading much easier.
The application tier resource requirements can be found here. These resources are suitable for most deployments and generally serve up to 100 concurrent users. Going below these resource settings is not typically recommended, as it can result in slow application performance or instability.
Database
The database stores the state for the application tier and optionally audit logs of user actions. It is generally small, with 4 CPUs and 16 GB RAM typically being sufficient. The amount of storage required scales with the number of users and the system's lifetime. The database contains data that is not recreatable, so it must be robustly backed up.
Exactly 1 database is required for each Quantexa application tier. This means that investigations and tasks cannot be shared across multiple application tier instances, and a single application tier cannot duplicate data to multiple databases, e.g. for data localization purposes. However, a single database platform can support multiple application tier instance databases through different database schemas.
Monitoring
We recommend monitoring all of the underlying platforms to ensure that any health issues are identified early and to support debugging of platform issues. The Quantexa application tier uses Prometheus to provide monitoring metrics that can be integrated with your monitoring tooling or visualized through Quantexa Grafana dashboards.
Supported software versions
The details about which versions of these platforms are supported for each Quantexa platform version is available on the Quantexa doc site https://docs.quantexa.com/reference-component/latest/reference/architecture/deployment/supported-software.html.