This blog details the implementation of the Entity quality overlinking (EQO) tool developed by the AI team.
What is EQO?
Figure 1. Overlinking example in Entity lab
The EQO is a tool that observes the shape of your Entities and identifies whether or not they are overlinked. The tool is similar to the existing Entity Lab in the UI, where a graph displays all the Compounds that construct the selected Entity. The EQO displays Compounds in a similar way. If the Compound graph is composed of disjointed clusters that are only connected through one or two Compounds, you can assume that this Entity is overlinked. On the other hand, if the Compound graph is densely connected, it is likely that the Entity is well-formed.
Why is the feature useful?
This feature enables you to:
- Identify overlinked Entities and the root causes of overlinking.
- Quantify the overall level of overlinking and its evolution through time, also known as Regression Testing.
- Attach overlinking probabilities for downstream processes to use, such as reports and filtering logic.
What did you discover during implementation?
The implementation identified the following features and observations:
- Setup instructions and prerequisites for the feature are minimal.
- No Python or Scala knowledge is required. This means it is more accessible for users with a limited or no code base.
- The process is efficient. It requires less than ten minutes for the entire UK DNB and BVD business Entity data, with ten executors and 25GB of memory each.
- EQO finds non-trivial overlinking where other tools struggle. Most overlinked Entities are truly overlinked and do not focus on large Entities, often only identifying between 10-20 Records.
- The tool currently lacks an explainability feature. It does not highlight the bad Compounds in the same way that the original Bad Entity Analyzer did.
- This feature is only supported in Batch Resolver mode.
Implementation and design steps
No specific design is required to implement EQO. The process is as follows:
- Reach out to the AI team to transfer the JARs.
- Add the JARs into Nexus and pull them using Gradle.
- Adapt your
runQSS to point to the pulled JARs.
- Add configuration keys to your reference.conf, mostly pointing to an ENG output.
The EQO feature only returns the
entity_id and the probabilities. You must either check the given Entities in the UI, or add a post-processing job to understand the causes of overlinking. For more information, see the following Implementation tips.
If you plan on using the UI, ensure that you can access information available in the
entity_id to perform searches, such as
business/idNumber[GB0561921]/businessName. Verify that you can search by
The following are additional tips for implementing EQO:
- Applying a 70% threshold on the probabilities is optimal for identifying truly overlinked Entities. If you use less than the recommended threshold, you may encounter more false positives.
- Make subsets with the tool. Take only the most overlinked Entities and their associated Documents to filter your ETLs. Focus only on the most complicated examples that are still of a manageable size. This avoids complicating the process.
- Analyzing Entities in the UI is challenging if you have many fuzzy Compounds. The tool is more useful when you have strict Compounds as it is able to find overlinked Entities that other tools may not. For this reason, use it first on strict Compounds to find issues such as bugs, parsing errors, and default values. Then add fuzzier Compounds one at a time and monitor the evolution through time.
- You can join the output to Entity Attributes on
entity_id. To identify specific areas of improvement, aggregate the probabilities by Attribute such as country.
- The EQO feature is useful for Regression Testing and ensuring new changes do not negatively affect Entity Resolution.
Note: At the time of writing, the EQO tool is not available on general release.