Introduction
This page describes the key considerations when setting up the underlying platforms for a Quantexa deployment in a cloud environment.
The diagram above shows the high-level architecture of a Quantexa deployment, including the necessary underlying platforms; these are:
- Spark - used for all batch data processing
- Elasticsearch/OpenSearch - used as the data/search backend for the application tier
- Container orchestration platform - used for deploying the Quantexa application tier
- Relational database - used for storing the state for the application tier, including user Investigations and audit logs of user actions
- Kafka - required for Quantexa deployments that perform streaming data processing, as this is only required for some deployments, I will not cover it here.
Each of these has PaaS offerings available on each of the major cloud platforms, which makes it easier to deploy them using their respective vendors' best practices. However, from our experience, these additional considerations that will help you optimize their setup for use with the Quantexa platform.
Note that this page only covers the underlying platforms rather than all required software. You can find a list of required software here.
Cloud quick starts
Quantexa Cloud Quick Start is an easy-to-run deployment of the infrastructure required for a cloud-based deployment of The Quantexa Platform. It is designed to help infrastructure and DevOps teams get started quickly with Quantexa deployments in Azure, AWS or GCP cloud environments. Cloud Quick Start is provided to users as an accelerator only, and it is expected that most users will need to extend its functionality to meet their requirements.
Spark
Each of the cloud providers has multiple PaaS offerings for Spark; these broadly fit into the following categories:
- Cloud vendor Hadoop distributions (Dataproc, EMR, and HDInsight) - very similar to on-prem Hadoop clusters but available on-demand
- Databricks - distributed Spark with an ecosystem of tools to help build, deploy, and maintain data processing pipelines
- Serverless Spark services (e.g. Synapse, Glue) - these vary from vendor to vendor but tend to abstract away levers of control required to run Quantexa batch processes optimally. This can lead to problems with running Libpostal for address parsing and instability due to the inability to tune the spark settings. In particular, we strongly recommend against using AWS Glue as we find that it is not a good fit for Quantexa batch processing workloads.
On-demand clusters
The most significant difference between deploying the batch data processing workloads in the cloud and deploying them on-premise is that the Spark clusters can be created for the batch processes and then shut down when finished, enabling significant infrastructure savings. Due to the linearly scalable nature of Spark jobs, it is generally preferable to provision larger Spark clusters to allow the processing to be completed earlier for little additional cost.
Elasticsearch
The Quantexa platform supports both Elasticsearch and OpenSearch as a data/search back end for the Quantexa application tier, and we have found that both work equally well for both small and large Quantexa deployments. I will refer to Elasticsearch for simplicity, but all recommendations apply equally to both.
PaaS or K8S
There are two commonly used deployment patterns for Elasticsearch in cloud environments:
- PaaS offerings e.g. AWS OpenSearch Service and Elastic on Azure and Google Cloud - this option is generally more expensive but requires less management and deployment effort.
- Deploying Elasticsearch into Kubernetes - this is an excellent option for teams with strong DevOps capability due to its lower cost.
Elasticsearch heap size
To support search operations, Elasticsearch performs a very large number of random (non-contiguous) reads and makes heavy use of the page cache. Therefore it is recommended to set a low heap size for the Elasticsearch process (8-16GB) to leave as much memory as possible for the page cache.
Choosing node/disk types
For most Quantexa deployments, nodes with 6-8 GB RAM per CPU should be used for Elasticsearch, as these offer a good balance between search performance, indexing performance, and cost.
Due to the large number of random reads, Disk IOPS is a significant driver of the performance of an Elasticsearch cluster; therefore choosing fast SSDs will generally allow you to have a smaller overall size for your cluster while maintaining good performance. The NVMe SSDs available from the cloud vendors support extremely high IOPS and hence are excellent choices for storage for Elasticsearch clusters:
- GCP - local SSDs
- AWS - NVMe SSDs on storage-optimized nodes (e.g. i4g series), these are a bit more work to set up but worth it for the performance improvement
- Azure - VMs with NVMe enabled (e.g. Lsv3 series)
Similarly, multi-tiered Elasticsearch clusters, e.g., using UltraWarm storage on AWS, are unsuitable for Quantexa deployments as all queries are performed over all the data.
Offline indexer
One driver of Elasticsearch cluster size is the time taken to perform indexing. Where indexing run times for full data loads are challenging, we recommend using the offline indexer, which performs the indexing in Spark, rather than increasing the number of CPUs in the Elasticsearch cluster.
Resilience
Using replicas on all Elasticsearch indexes is essential to be resilient to node, disk, and Elasticsearch process failure. The queries use both primary and replica shards, and having 1 replica will double the volume of index being served and hence the size of the cluster required.
For Quantexa deployments that contain small volumes of data (<25 million docs) to be resilient to node failure, Elasticsearch clusters require at least 2 data nodes and at least 3 master-eligible nodes. More guidance can be found here.
Application tier
We strongly recommend deploying the Quantexa application tier into a container management platform (e.g. AKS, EKS or GKE). This provides easy management, scaling, HA and self-healing without the need to implement complex scripts. Where using a container orchestration platform, we would recommend using the Quantexa helm charts to deploy the Quantexa application tier, as they make deployment and upgrading much more straightforward.
You can find the application tier resource requirements here. These resources are suitable for most deployments and generally serve up to 100 concurrent users. Going below these resource settings is not typically recommended, as it can result in slow application performance or instability.
Auto-scaling
Generally, we would not recommend configuring auto-scaling for the Quantexa application tier for the following reasons:
- The minimum application tier resource requirements are suitable for most deployments
- User-driven load on the Quantexa application tier is very spiky and the auto-scaling will not be able to react sufficiently quickly
- Elasticsearch is often the bottleneck during periods of high throughput and hence having more replicas of the application tier services will not improve performance
Database
The database stores the state for the application tier and optionally audit logs of user actions. It is generally small, with 4 CPUs and 16 GB RAM typically sufficient. The amount of storage required scales with the number of users and the system's lifetime.
The inbuilt HA capabilities for the cloud providers' database PaaS offerings are trivial to set up and provide excellent availability. However, the database contains data that is not recreatable, so it must be robustly backed up, as there is still a risk of data loss/corruption due to administrator error or incorrectly provisioned data.
Exactly 1 database is required for each Quantexa application tier. This means that Investigations and Tasks cannot be shared across multiple application tier instances, and a single application tier cannot duplicate data to multiple databases, e.g. for data localization purposes. However, a single database platform can support multiple application tier instance databases through different database schemas.
Monitoring
We recommend you monitor all underlying platforms to identify health issues early and support debugging platform issues. The Quantexa application tier uses Prometheus to provide monitoring metrics that can be integrated with your monitoring tooling or visualized through Quantexa Grafana dashboards.
We recommend utilizing cloud-native monitoring agents, e.g. AWS Cloudwatch, provided by the cloud vendor. These can seamlessly integrate with the existing services within the environment and scrape metrics from the Quantexa endpoints.
Supported software versions
Details about which versions of these platforms are supported for each Quantexa platform version are available on the Quantexa doc site: https://docs.quantexa.com/reference-component/latest/reference/architecture/deployment/supported-software.html.