Service Design Platform Architecture: Kafka Streaming

4. Lessons Learned from a Streaming Lending Fraud Project

Introduction

This blog shares insights into the challenges encountered during a complex streaming project focused on Lending Fraud detection. The solutions implemented and key lessons learned are derived from a real-world project aimed at achieving two primary objectives:

Decommissioning a legacy system – replacing the outdated legacy system.
Implementing a near-real-time decision engine – integrating Quantexa’s Entity Resolution (ER) and network capabilities to enhance fraud detection and decision-making.

The project faced significant functional and non-functional challenges, requiring meticulous performance optimization. Initial performance metrics did not meet expectations, but a series of refinements facilitated the Go-Live. Further optimizations are ongoing to fully meet stringent performance requirements.

Project background

Replacing the legacy system

The client aimed to retire their existing system, which lacked advanced capabilities such as:

4-hop network generation
Complex Entity Resolution (ER) algorithms

Quantexa’s platform was selected to address these deficiencies, offering superior fraud detection through enhanced network analysis.

Building real-time decision system

The client developed a centralized decision engine, functioning as a hub that integrates assessments from multiple spoke systems. Quantexa’s solution serves as one of these key spokes, responsible for:

Advanced Entity Resolution (ER)
Network generation and scoring
Providing fraud investigators with enriched data for decision-making

The streaming data processing layer for the solution relies on a client-managed enterprise Kafka platform.

Performance requirements

A key requirement was to complete ingestion, network expansion, and scoring within a 10-second SLA. While task loading is not in the critical SLA path, it plays a crucial role in fraud investigations by providing investigators with relevant case links.

Additional data sources

The project introduced two new watchlist data sources in addition to all existing data sources including customers, Orbis, SMR etc:

External Watchlist – provided by the Australian Financial Crimes Exchange (AFCX), a non-profit combating financial crime.
Internal Watchlist – a proprietary list maintained by the client to flag entities linked to financial crime.

These watchlists are ingested using a Spark-based hourly micro-batch, consuming events from the same enterprise Kafka platform.

Quantexa streaming pipeline overview

The Quantexa streaming fraud detection pipeline processes bank's lending applications. The end-to-end streaming pipeline includes:

Ingestion
Network Expansion
Scoring
Task Loading
Persisting scorecards in Elasticsearch for future retrieval

Project key characteristics and solutions to address challenges

1. Application tier

Challenge

In non-Kafka deployments, the startup sequence of mid-tier applications is generally non-critical. However, Kafka applications function as "clients," meaning that applications such as expand-score and task-load depend on core services being available beforehand. Additionally, without service discovery, Kafka clients are unaware of core service availability, leading to failed API requests.

Solution

Sequential startup – Core services start first, followed by Kafka services, with a brief delay in between.
API health-check – Kafka services check core service availability before starting. If unavailable, they shut down and must be restarted.
Retry mechanism – API calls retry up to three times with exponential backoff.

These measures reduce failures and ensure a stable startup sequence. In a containerized environment (e.g., Kubernetes), service dependencies could be managed through orchestration tools.

2. Managing high tasks volume

Challenge

Each Lending Application required a corresponding task, even if its risk score was zero. With 5,000+ applications daily, this resulted in over 1 million tasks annually, exacerbated by multiple update and outcome events per application.

Solution

Initial task creation – A task is created only for the first application submission.
Task updates instead of new tasks – Subsequent application events update the existing task.
APIs:
- Investigation Client: Refresh Graph & Check Refresh State
- Investigation Client: Expand

This approach reduces unnecessary task creation while maintaining a timeline view of application updates.

3. Propagating Event Metadata

Challenge

After upgrading from Quantexa 2.1 to 2.6, metadata propagation was disrupted due to the new record extraction and ingestion schema definitions. This affected scoring decisions and task creation.

Solution

Initially, metadata was embedded in the document model but led to overwrites. Finally, we managed to overwrite (the document-ingest service was customized) the schema of the document ingest success topic to include the full application event, preserving metadata integrity across updates.

4. Meeting Performance Requirements

Challenge

The decision engine required a 5 TPS throughput, with a <10-second response time. Initial performance tests showed:

0.4 TPS throughput
40th percentile: <10s, but 90th percentile: >30s

Solution

Infrastructure optimization

Increased heap space to reduce GC overhead (some services spent 30+ seconds per minute on garbage collection).
Configured ActiveProcessorCount to optimize thread allocation on the single-node environment.
Adjusted service instance count:
- Reduced app-investigate instances
- Increased app-resolve and app-scoring instances

OpenSearch Index Optimization

Identified Orbis indices as bottlenecks using slow logs.
Implemented:
- Segment merging post-Elastic Load (jobSettings.indexAdmin.sendMergeSegmentMessage)
- Optimized shard count (indexShards settings)
- Added replicas for high availability

Results

TPS increased to 1.2
Response times <10s for 60% of applications
Further improvements uplifted performance to 80% of applications scoring within 10s

Conclusion

This project highlighted key challenges in implementing a streaming Lending Fraud detection solution and demonstrated how Kafka, Quantexa’s ER, and optimized infrastructure can meet complex real-time processing requirements. Lessons learned include:

The importance of service startup sequencing in Kafka-based architectures.
Task update strategies to manage high event volumes.
Ensuring metadata integrity after product upgrades.
Performance tuning across infrastructure, application, and indexing layers.

While the project has significantly improved since its initial deployment, ongoing refinements are necessary to achieve full SLA compliance. Future enhancements will continue to leverage improvements in newer version of Quantexa platform, further optimizing processing efficiency and resilience.

More read: https://community.quantexa.com/kb/categories/209-platform-architecture-kafka-streaming