Before you begin this ER in Action Tutorial, please make sure you have completed the Product Tour which will give you the details of how to use the Unify product.
This Tutorial guides you through using the Unify product to resolve the sample Data Sources and how to analyze the output in Quantexa Unify, Power BI, and within Spark Notebooks.
Getting started
- First, navigate to Quantexa Unify and create a new project. Call it UnifyTutorial.
- Select the Add Data Source action:
- Then select
Contoso
:
Once bootstrapping is complete, you will be presented with the default data mapping. We are going to leave the mapping "as is" for our first iteration.
- Create the first Iteration and name it
contoso
, then click Run.
Once the Iteration is complete you will be presented with the run results:
- Here you can see that Quantexa Unify has resolved 127 individual Entities from a total of 281 input records. Keep this number in mind for later.
Configuring the NationalID within the Entity Resolution
As described in the Product Tour, the next step is to add the nationalID to the configuration. Please map this into the Individual → nationalId
field then create and run a new Iteration called contosoIDs
:
Once complete we can analyze the results.
Analyzing the results in Quantexa Unify
You can now compare the output of each Iteration by selecting them in turn.
The chart and the data tables show the outputs and you can see the total number of resolved Entities in each case:
On the left we can see the first Iteration with 127 Entities. On the right we can see the updated chart for the nationalID
iteration where we now have 125 resolved Entities. This means the change of our configuration has merged a small number of Entities within the system.
Let's analyze this in more detail.
Analyzing the results in Power BI
Let's open up the results of our second Iteration within Power BI so we can visualize the output.
Browse to the OneLake catalog and open up the Semantic Model associated with the "contoso only" build. If you used the names in this tutorial, it will be called UnifyTutorial - contoso
.
Within this semantic model you will find a Power BI report has been created automatically.
- Open the Power BI Report, select
Individual
on the left Pages panel side to select the resolved Individual Entities.
- Scroll to the data viewer at the bottom, and click anywhere within that table.
- To the right, you will see a Filters panel. Type the name
DEBORA
(uppercase) and then select the two results. This will update the table to show three Entities with a name similar to Debora Mayer.
However if we now repeat the same process as above but with the NationalID iteration, we can see these have resolved down to a single Entity with 6 records:
Accessing the raw data in OneLake
The sample data provided by Quantexa has been loaded into the same output Lakehouse where the Iterations are written to. Let’s open the Contoso file and take a look at the input data.
Browse to OneLake and open the Lakehouse you selected for the Iteration destination.
On the left Explorer panel, you will see a set of Tables. Click on the contoso
table and it will load in the right hand panel.
- On the Forename column header, select the “…” icon to open the menu.
- In the menu, uncheck the Select All toggle, then perform a search for
Debora
.
You will get two results. Tick them and click Apply to filter the table.
These are the raw records used in the Entity Resolution process:
- Sort by CustomerAddress column:
You can see here that there are six records. The bottom four of them connect together because they have the same, or a very similar, addresses as well as two records matching on date of birth.
If you ignore the national identifiers column, then the top two records ONLY have a name and a country, which clearly is not enough to be certain they are the same person.
However, with the introduction of the NationalIDs, which do match in this instance, we can see that this then combines the top two records with the bottom Entity, therefore producing a single resolved Entity.
Analyzing the results in a notebook
The above flow assumed that we knew which Entity had changed between the first and the second iteration.
However, what if we wanted to programmatically compare the results of the two iterations? We can do this by leveraging the Notebook functionality within Fabric.
- Within the Lakehouse where the output data is stored, select New Notebook:
- Now paste the following code. Configure the name of the Lakehouse you selected for the outputs as well as the Project and Iteration names, if you changed them:
# Configure the Quantexa Unify Project and iteration settings
lakehouse="EntityOutputs" #name of the lakehouse containing the output of the iterations
projectName="UnifyTutorial" #name of the Quantexa unify project
contosoOnlyIterationName="contoso" #Name of the iteration with contoso data and default configuration
contosoWithIDsIterationName="contosoIDs" #Name of the iteration with contoso data including the nationalID
#raw input data
contosoRecords = spark.sql(f"SELECT * FROM {lakehouse}.contoso")
#resolved entity output: Contoso Only
individualsContosoOnly = spark.sql(f"SELECT * FROM {lakehouse}.quantexa_{projectName}_{contosoOnlyIterationName}_individual_records")
#resolved entity output: Contoso and Northwind
individualsContosoWithIDs = spark.sql(f"SELECT * FROM {lakehouse}.quantexa_{projectName}_{contosoWithIDsIterationName}_individual_records")
#find any contoso records which have changed the entity they are associated with
changedDocuments=individualsContosoWithIDs.exceptAll(individualsContosoOnly)
display(changedDocuments)
#and all the entities in the "contosoWithIDs" build have changed
changedEntities=individualsContosoWithIDs.join(changedDocuments.select("entityId"), "entityId")
display(changedEntities)
#join on the raw data
entitiesWithRawData=changedEntities.join(contosoRecords, changedEntities["documentId"]==contosoRecords["customerID"], "INNER")
display(entitiesWithRawData.drop("entityType","documentType","documentId"))
The following table shows the results:
We now have our original contoso
table filtered by the records that were impacted by the change to the NationalID
, which helps us understand why the Entity Resolution has changed between each build.
Enriching the raw data with the resolved Entity ID
We can also extend the above Notebook to produce a single table with the raw contoso
information enriched with the Entity ID:
#join on the raw data
contosoWithRawData=individualsContosoWithIDs.join(contosoRecords, changedEntities["documentId"]==contosoRecords["customerID"], "INNER")
display(contosoWithRawData.drop("entityType","documentType","documentId"))
This produces the following results:
We can also combine this with the aggregated resolved Entity information, that is, the data shown in the above Power BI dashboard:
#join on the entity data
resolvedIndividuals = spark.sql(f"SELECT * FROM {lakehouse}.quantexa_{projectName}_{contosoOnlyIterationName}_individual_entities")
#join entity data to the raw data:
entityDetailsWithRaw=resolvedIndividuals.join(contosoWithRawData, "entityId")
display(entityDetailsWithRaw)
We now have a table that we can use to understand where we might have some potential data issues.
For example, finding Entities that have multiple National IDs or dates of birth. For example, if we filter by DateOfBirthCount=2
, we can see some Entities where there is some variance in the dates of birth within the Entities. In this case, it highlights some data entry issues which need to be resolved in the upstream data.
This concludes the ER in Action tutorial.
💡 So, what did we learn?
We have seen how we can change the configuration of the Entity Resolution and we have analyzed it in a number of tools within Fabric, including Power BI, OneLake, and Spark Notebooks. All of this data is persisted in OneLake, enabling you to perform advanced Entity-level analytics.