Introduction
Entity Resolution (ER) and graph expansion are fundamental components of high-performance streaming solutions, particularly in near real-time detection use cases. These processes help identify Entities, relationships, and connections that support risk detection, fraud prevention, and other analytical objectives. However, streaming environments require a careful balance between maximizing analytical accuracy and throughput. Unlike batch processing, which allows deeper computational analysis, streaming systems must optimize resource efficiency while maintaining effective Entity Resolution and network expansion.
This article provides best practices for configuring ER and graph expansion in streaming architectures. It focuses on balancing computational efficiency with analytical effectiveness, helping Architects and Engineers design scalable, high-performance solutions that meet both functional and throughput requirements.
Balancing throughput and coverage
A core challenge in ER and graph expansion is balancing system throughput against detection risk coverage. Reducing Document and Entity volumes, as well as the complexity of the network improves processing speed but, may fail to capture some risks especially indirect ones. Unlike batch processing, which allows for broader graph expansion and deeper Entity Resolution, streaming architectures require a careful trade-off between computational workload and detection coverage.
To achieve an optimal balance, the following aspects must be carefully considered:
Entity Resolution
Entity Resolution in a streaming environment needs to be optimized to avoid unnecessary computational overhead while preserving quality. Key factors to consider include:
- Number and complexity of compounds: The more compounds an Entity contains, the greater the data volume processed by Elasticsearch, which directly impacts query response times.
- Compound exclusions and resolution templates: Removing unused and redundant compounds helps streamline processing without sacrificing effectiveness.
Graph expansion
Graph expansion dynamically generates networks that help identify relevant connections and risk factors. To optimize efficiency:
- Types of Entities and Documents to expand: Prioritizing essential Document types ensures meaningful expansions while minimizing computational load.
- Number and complexity of expansions: Limiting the number of expansion hops and controlling expansion complexity significantly improves performance.
Please note: While the above is not an exhaustive list, additional considerations are referenced at the end of this article.
A typical streaming detection dataflow involves ingesting Documents, building Entities, generating networks, and scoring them to support near real-time detection use cases.
Entity Resolution
Optimizing ER is crucial for ensuring efficient streaming system performance. The complexity of ER directly impacts Elasticsearch query execution and Resolver service load. The following best practices help streamline Entity Resolution in high-throughput environments:
Managing overlinked Entities
Overlinked Entities, those that connect an excessive number of unrelated records, are a primary source of performance degradation. These Entities increase query times within Elasticsearch and the resolver service due to false compound matches linking large, unrelated data sets.
To identify and address overlinking:
While under-linked Entities may lead to false negatives, they generally do not impact system throughput as severely as overlinked Entities. Therefore, performance tuning should prioritize resolving overlinking issues.
Applying exclusions for Entity Resolution optimization
Exclusions help reduce unnecessary data processing, improving throughput without compromising analytical outcomes.
- Element exclusions & compound filters: These should be applied at the Data Fusion stage rather than during ER to minimize Elasticsearch query load.
- Autocoldlist exclusions: Resolver service supports automated exclusions to filter out redundant compounds, though excessive filtering can increase processing time. A well-optimized exclusion strategy is essential for balancing efficiency.
By strategically managing compounds, and ER exclusions, ER matching can be significantly improved while maintaining analytical integrity.
Graph expansion configuration
Graph expansion plays a pivotal role in network-based detection and analytical use cases. Optimizing this process ensures that systems remain efficient while still capturing necessary risk relationships.
Key Considerations
- Number of network hops and types of Entities/Documents expanded per hop: Limiting expansion depth helps prevent exponential network growth.
- Balancing near real-time and batch processing: A hybrid approach allows non-essential expansions to be processed in batch, reducing near real-time computation load.
- Attribute-based scoring for network analysis: Replacing direct expansions with attribute-based scores can achieve similar analytical outcomes while significantly reducing processing complexity.
Managing expansion complexity
Graph size can grow exponentially with each additional expansion hop. To optimize performance:
- Restrict the number of hops to only those essential for the use case.
- Optimize Document and Entity types to avoid excessive computational overhead. For example, Customer and Transaction data sources often contain high volumes, making them computationally expensive to expand.
- Prioritize lightweight data sources where possible (e.g., metadata attributes over full Document expansions).
Balancing near real-time and batch processing
In high-throughput streaming environments, it is often advantageous to limit dynamic expansion to direct connections while deferring deeper network expansion to batch processing.
For example, in a Lending Fraud streaming detection use case, two types of detection pipelines can be employed:
- Near real-time pipeline – This pipeline operates in near real-time to detect fraudulent applications, aligning with instant decision-making processes.
- Batch pipeline – This pipeline performs more complex and deeper graph expansions on a scheduled basis (e.g., weekly or monthly) to uncover further indirectly linked fraud indicators, intricate relationships, and high-risk entities. By analyzing these expanded networks, the batch pipeline can identify persons or businesses of interest, providing organizations with a more comprehensive risk assessment.
This hybrid approach enables the batch layer to extend the scope of streaming-based detections without overloading near real-time processing pipelines, ensuring both efficiency and depth in fraud detection.
Attribute-based scores
Expansions are inherently resource-intensive, as they require Elasticsearch queries to resolve Entities or locate Documents for network expansion. Instead of expanding further to connect to Documents or Entities, consider refactoring scores to leverage Entity attributes, reducing computational overhead.
Before implementing an expansion, assess whether an attribute-based score can achieve the same outcome more efficiently. For example, rather than expanding to a Watchlist Document, an Entity attribute check may provide sufficient insight.
A commonly used attribute is isOnWatchlist
, which indicates whether an Entity has been resolved from a Watchlist. Additionally, more complex attributes can be introduced using aggregation techniques, such as ValueAtMaxByValue
to create a mostRecentWatchlistEntry
attribute, which determines if an Entity is linked to a recent Watchlist Document. By utilizing these attributes, the need for expansion can be eliminated while still capturing the required risk.
The set of supported attribute aggregations is available on the documentation site; however, they may not cover all scoring needs.
During the score design phase, it is crucial to evaluate the performance impact of Document, Entity, and Network scores to ensure efficiency and scalability.
Access management for optimized performance
ER and graph expansion in Quantexa leverage REST APIs to search Documents and resolve Entities. As a result, access permissions directly impact how the Resolver service queries Elasticsearch, affecting query volume, searching Document types, and overall system load.
For example, if the service account used for these APIs has access to Customer, Orbis, and Transaction data, it will incorporate these Documents into Entity Resolution during graph expansion, even if those Document types are not explicitly required for the expansion. The Resolver service formulates search queries based on the access policies applied to the authenticated user.
Performance Optimization Recommendations:
- Restrict service account access to only the necessary data sources.
- Consider Entity Resolution dependencies: Excluding a Document type from graph expansion does not exclude it from ER. Ensure access policies align with the intended data usage.
- Apply appropriate access policies to minimize unnecessary Document queries, reducing overall system load and improving query processing times.
Implementing these best practices will enhance performance, optimize resource usage, and improve system efficiency.
Conclusion
To ensure optimal performance in streaming solutions, the following best practices should be adopted:
- Continuously monitor Entity health using automated tools.
- Follow ER tuning guidelines, focusing on resolving overlinked Entities.
- Optimize graph expansions by limiting hops and prioritizing lightweight data sources.
- Adopt a near real-time and batch hybrid approach to balance scalability and performance.
- Leverage attribute-based scores where applicable to reduce expansion complexity.
- Restrict service account access to only essential data sources.
Additionally, Application tier deployment configurations and infrastructure play a crucial role in driving overall system performance. Below are key references for tuning and optimization:
Important links and references