This piece is about how to clean up bad data to improve Entity Resolution.
When performing Entity Resolution (ER), we want our produced Entities to be as accurate as possible. Having bad Entities provides an inaccurate customer view and makes it much harder to spot risks.
In our ER, data is commonly brought in from a number of different sources. These data sources often have some undesirable default values. This can cause huge problems when resolving Entities. A default value is used as a placeholder if data is not available for a row in a database.
Imagine that within a portion of data, Incorporation Date was not available and has been set to 01/01/1900. Using this value during ER may resolve together two or more different businesses that have the same name but happen to have this default value. In a lot of cases, this is going to cause our resulting Entities to be incorrect.
Figure 1: An example of how a default value can potentially over-link our Entities
We want to find these values and prevent them being used to link our Records but doing so can be challenging! This post will show you how you can use your first iteration of ER and AI to root out these default values and dramatically improve your next iteration.
Finding Default Values
The most intuitive way that developers use to find these values is by looking at the most common values in a data table using spark. This naïve approach will find extremely common default values but it can easily miss some that are less obvious.
Imagine we have a data table where each row is a Quantexa Record about a business. If we looked at the most frequent values for the names, we’d likely see some of the largest companies in the world such as Apple, Blackrock and Amazon. This will feed through to the other elements as well. The most common incorporation dates will be when these companies were founded. This noise makes it difficult for a developer to spot default values that we want to exclude.
If we have run Quantexa Entity Resolution, we can add a new column to this data table with the Entity ID. We can then take only one row for each Entity ID to mean that we are only counting these large companies once. This process of taking only one row for a particular value is known as Data Normalization.
This removes a lot of noise but we can still go further. There are certain companies / individuals known as “Formation Agents” that set up companies on behalf of others. Companies formed in this way are often registered at the same address and have other elements in common such as directors. We can normalize based on other elements such as postcode to remove these examples and make default values even easier to spot.
Automating the Search
If we have lots of different elements, it would certainly be easier if we could fully automate identifying these default values.
A simple approach would be to simply select a threshold, and if a value occurs more than X times then mark a value as being default and exclude it. This in practice doesn't work very well.
Consider the element type “City”. The number of rows that contain the likes of New York, Sydney, Vancouver and London will be very large but these are not default values. The number of rows that have 01/01/1900 as an incorporation date may be small, perhaps only 0.1% of the data. How do we highlight that 01/01/1900 is unusual but New York is not?
To solve this problem we consider statistical significance of each element individually.
Figure 2: This distribution highlights the counts for each element value. The critical value is when a value is appearing too commonly and is likely a default
To calculate the critical value (our threshold), we take the counts for each value of an element e.g. cities: (London: 990, New York: 780, ….) and calculate the mean and standard deviation. To get our threshold we take the mean and add 6 times the standard deviation to get only the most extreme values. There is a different threshold for each element, allowing us to highlight 01/01/1900 and not New York even if New York appears more frequently.
This methodology works well in practice but does still produce false positives. An example of this is that “Main Street” and “High Street” appear unusually frequently compared to other roads. Although these are not default values, it may still make sense to exclude them as they are weakening compounds.
It means a lot less that two businesses happen to be on “High Street” compared to a much less common road name. This is likely to result in separate businesses that have nothing to do with one another being resolved together.
Conclusion
Having accurately resolved Entities is extremely important. It is crucial for correctly identifying risk and having a reliable view of customers. This task becomes difficult when the underlying data contains default and common values. Following this approach after your first iteration of ER will help identify weaknesses in your data. You can mitigate these risks by removing these values which will lead to higher quality Entities.