This article discusses the key considerations for Solution Architects, and Data Engineers when sizing infrastructure for Quantexa.
Quantexa makes use of four underlying platforms:
Batch Tier
- Apache Spark and the associated data storage platform (HDFS or Cloud object store).
Dynamic / Application Tier
- Elasticsearch (or OpenSearch).
- Kubernetes (or OpenShift).
- Relational Database (Postgres, MySQL or Oracle).
Sizing infrastructure for Quantexa means sizing for each of these four platforms.
Environments
In addition to a full-scale production environment, Quantexa recommends a pre-production environment with a complete copy of real data for developer-led testing and entity resolution/score tuning.
During initial deployment or major updates, pre-production environments often require more frequent batch runs, so consider increasing resource allocation in flexible environments like cloud deployments.
For development and integration testing, a third environment with approximately 10% of the production infrastructure is usually sufficient.
Key Drivers
Data Volumes:
- Larger and more complex datasets require larger Apache Spark and Elasticsearch clusters, as well as more storage capacity.
- For streamed data, the Kubernetes cluster hosting dynamic ingest services will also be affected.
Update Frequency:
- Frequent updates increase the load on the Spark cluster and associated storage.
- Streamed updates impact the Kubernetes cluster hosting dynamic ingest services.
User Activity and API Calls:
- Higher user counts and API call volumes necessitate a larger Kubernetes cluster for application services and a larger relational database for storing investigations and audit logs.
- With a high volume of concurrent users (200+) or API calls, the Elasticsearch cluster may also require increased capacity.
Batch Tier
Apache Spark
Quantexa utilizes Apache Spark for its batch processing jobs, including Fusion data ingest, Batch Resolver, Network Generation, and Scoring. Spark's distributed architecture allows these jobs to scale linearly. This means:
- Doubling the Spark cluster size generally halves the processing time.
- For cloud deployments, stopping the cluster between runs reduces cost.
- Horizontal scaling is recommended by adding additional worker nodes (servers) to the cluster.
For optimal performance, Quantexa recommends provisioning worker nodes with 8GB of memory per vCPU (e.g., 16 vCPUs with 256GB memory).
Quantexa can provide estimated runtimes for each batch process based on your specific data volumes and complexity.
Note: The final steps of Fusion ingest (loading to Elastic) and the Entity Store Elastic loader are almost always less Spark-intensive and may be limited by Elasticsearch performance.
Distributed File System / Object Data Store
Quantexa's batch jobs utilize Apache Spark to process and store data in Parquet format. This data is stored in a distributed file system (DFS), such as:
- Hadoop Distributed File System (HDFS) for on-premises deployments.
- Cloud object storage (e.g., AWS S3, Azure Blob Storage, Google Cloud Storage) for cloud deployments.
The amount of storage required depends on:
- Number and complexity of data sources: More sources and more complex data lead to higher storage needs.
- Volume of entities and networks: The number and complexity of entities and networks generated by Quantexa also add significantly to storage requirements.
DFS often includes built-in redundancy, with data replicated across multiple storage nodes to prevent data loss. This increases the storage required. Cloud providers manage data redundancy and backups within their object storage services.
For Production environments, it is recommended to retain a copy of the previous batch run to enable rollback in case of issues. For Pre-Production environments consider keeping additional copies for testing and comparison purposes.
Quantexa can provide estimated storage requirements based on your specific data volumes and complexity.
Dynamic / Application Tier
Elasticsearch (or OpenSearch)
Elasticsearch is a key component of Quantexa, storing:
- Cleansed and parsed source data for search.
- Compound keys used in dynamic entity resolution.
- Event records (transactions) used in the Event / Transaction Explorer or Data Viewers.
- Entities (when using the Entity Store).
Quantexa recommends deploying Elasticsearch as a horizontally scalable cluster with the following specifications for each data node: vCPUs: 8, Memory: 64GB, Storage: 1 - 2TB locally attached SSD. SSD performance and the in-memory disk cache have the biggest impact on Elasticsearch (and therefore, application) performance.
Index Size and Scaling
Elasticsearch splits indexes into shards and these are distributed across data nodes for scalability. Replica shards provide redundancy and can improve performance.
Elasticsearch cluster size is driven by index-to-memory ratios: Recommended ratios vary by index type: Search indexes (including Entity Store): 12:1, Resolver (compound key) indexes: 8:1, and Event (Transaction) indexes: 16:1.
For high numbers of concurrent users of the Quantexa UI (200+) or API usage, add more data nodes to maintain performance. If using replica shards for performance, add extra nodes to accommodate the additional data.
In a production deployment, where data sources are refreshed in full, it is recommended to load the new iteration into a new index and then switch the application over to this index. This means it is often worth sizing Elasticsearch to hold two copies of these indexes by increasing the storage. However, it is not necessary to increase CPU and memory as only one instance of the index will be in use.
An undersized Elasticsearch cluster is the most common reason for poor application performance but is also the most costly in terms of infrastructure so it's important to have this correctly sized.
Quantexa can help you estimate your Elasticsearch cluster needs based on your data and usage patterns.
Kubernetes (or OpenShift)
Quantexa leverages Kubernetes to orchestrate its containerized application services. This allows for:
- High Availability: Multiple instances of each container can be deployed to ensure continuous operation even if some instances fail.
- Scalability: Easily scale to handle increased user activity, API usage, and streamed data processing by adding more container instances.
Kubernetes Cluster Configuration
Kubernetes typically runs on a cluster of nodes that can be scaled horizontally by adding more nodes. Each Quantexa application service container is allocated a specific amount of vCPU and memory. While increasing individual node size is possible, adding more nodes to the cluster is the recommended approach for scaling and redundancy.
Refer to Quantexa's documentation for recommended vCPU and memory allocations for each application container. Typically a High Availably deployment for 25 concurrent users would need 26 - 32 vCPU and 80GB Memory. Quantexa can assist in determining the appropriate number of containers based on your usage requirements.
Relational Database (Postgres, MySQL, or Oracle).
Quantexa utilizes a relational database to store:
- Metadata: User privileges, permissions, and other system-level information.
- Investigations: Including those generated for alerts and tasks.
- Audit Logs (Optional): Records of all Quantexa UI and external API requests.
The database is, by far, the smallest use of infrastructure. Factors that influence database sizing are the number of API requests from Quantexa UI users or external services and the number of Investigations including those linked to alerts and tasks.
Database resource requirements:
- vCPU and Memory are driven by the rate of API requests and investigation interactions. Higher activity requires more processing power.
- Storage is determined by the volume of API requests and investigation interactions. More data requires more storage.
- Metadata has minimal impact on infrastructure requirements.
The relational database is critical for Quantexa's operation and contains data (user interactions and audit) that cannot be recreated so regular backups are essential for data protection and recovery. Quantexa also recommends deploying with at least two instances (e.g., primary and replica) for high availability. Consider using managed cloud database services like GCP Cloud SQL, AWS RDS, or Azure SQL Database for simplified backup and high availability configuration.
Quantexa can assist in determining the appropriate vCPU, memory, and storage requirements for your relational database.