Business Problem
Within the data we will often see multiple versions of the same money movement within the transaction systems, this can lead to duplicate transactions in the aggregated documents causing inflated transaction counts and values which can result in false positive scoring hits.
These duplicates can occur for the following reasons:
- Both originator and beneficiary are customers of the bank so we can see the money leaving the originators account and “landing” in the beneficiary account.
- Certain transactions will be sent via two types of swift messages (e.g. we see the same transaction as both a 103 and 202_cov message).
It is incredibly difficult to de-duplicate these transactions as there is:
- rarely (if ever) a true unique identifier
- the transactions are often not exact copies (for example the dates and amounts can slightly differ due to taking time to clear and currency conversion rates changing)
- Parties can often do multiple transactions for the same amount on the same day meaning just using parties, time and date could result in the removal of true examples.
Example
In this example John Smith sends one payment of $100 to Barry White on 10/02/2022, both parties are customers of our client bank ABC. Within ABC’s system are likely to see the following:
When we fit this to our standard transaction document model with originator and beneficiary we end up with an aggregated transaction containing 2 transactions for $200.
Best Practice Approach
As it is effectively impossible to de-duplicate the transactions safely and reliably Quantexa’s recommendation is to keep both version of the transaction within the same aggregation (aggregated by originator/beneficiary pairing). However within the thin transactions add to the case class model a new Boolean flag showing we believe the transaction is a duplicate. This can be done by grouping on amount and date, then only count the transaction and amount for any scores / aggregation stats where the Boolean flag is false.
Limitations
This process obviously will not be perfect and therefore some user training will be required and the possible duplication of amounts will have to be considered when setting scoring thresholds.
The approach will also be weakened if the two parties are in different countries and the data is processed separately, in this case the duplicate transactions will be in two different aggregated documents.