This is a blog about setting up and using a low-code, data ingestion simplification feature known as Data Fusion, which is designed to allowing richer data sets to be ingested more quickly, speeding up ROI, and provides a basis for the most valuable insights possible.
So what’s Data Fusion all about?
Figure 1. Data Fusion
Data Fusion is a framework which handles the data ingestion part of the Entity Resolution and Network Generation process without the user need to understand complex code, once the framework is implemented.
Fusion was introduced in The Quantexa Platform (TQP) version 1.6.0 so if your project is already on 1.6.0 (or higher) you should be able to use this feature, otherwise your team will have to upgrade.
To get everything set up, you’ll need a good understanding of the traditional data ingestion process, including Extract, Transform and Load (ETL), which will help you to direct and support others. The documentation site already has sufficient information which can be used to familiarize yourself with this concept.
If you need any help, the Quantexa Academy team has developed a module on Data Fusion, which can be accessed on request by emailing your learning manager.
Your project can still have some data sources implemented the old way and new data sources implemented using Data Fusion…
— Shipra Bhatt, Data Engineer
Why is the feature useful?
Data Fusion provides a simpler process to get data into the system, a reduced reliance on Scala skills, and grants cleaner, more intuitive interfaces to use.
Features include:
- Generate code to cleanse, parse, and standardize raw fields.
- Specify which Entities exist within the data.
- Configure extraction of linking, Scoring, and display data for Entities.
- Generate scripts to run your compiled ETL code.
- Generate configuration for The Quantexa Platform.
- Declare which fields exist in your data.
What did you find out during implementation?
The way this feature is designed is so that the system is able to handle both versions of the code (legacy ETL and Fusion) at once. It means your project can still have some data sources implemented the old way and new data sources implemented using Data Fusion. If you are planning to add a new data source to the system and the project is already on v1.6.0 (or higher), Data Fusion is recommended.
Design considerations
Estimation
As this is the new feature make sure you consider the learning curve the team will have to go through.
Custom functions
Data Fusion core libraries contain useful functions, which can be used during the cleansing stage, but if the data source needs custom cleansing rules, then you need to add additional functions in Scala, which requires some Scala expertise.
Aggregation Mode
Data Fusion supports this, but the configuration is slightly more complex.
Implementation tips
- Quantexa supports projects having a mixture of Fusion and non-Fusion data sources, but project teams have found it easier to upgrade their legacy data sources to Fusion - some things like compounds need to be consistent across sources and it made maintenance easier. Only one “mindset” was needed to, for example, add an attribute to every source.
- Moving to Fusion made upgrades easier. The project team did an upgrade with some sources in Fusion and some in legacy and the (specifically Parsers) upgrade was quicker on the Fusion sources.
For a step-by-step guide on using fusion, check out our tutorial which provides an introduction to the fundamentals of Data Fusion.
Build information
Version 1.6.0 TQP, April 2022
Additional Resources
Did you know that you can log in (or sign up) to the Community to unlock further resources in the Community and on our Documentation site?