Data generation can be an invaluable and often crucial part of a Quantexa deployment. This post aims to explore the benefits of adding data generators. It will provide guidance on best practices for implementation and go over some custom examples.
Why Generate Data?
Either out of necessity or to improve the development workflow, there are many benefits and reasons to add data generators:
- Development Without Real Data. Some teams start developing data integration processes without real or test data. They are only provided with data schemas in which to create the case classes for the raw data. Data generation can be added using these case classes to produce sets of test data. This prevents possible delays when ingesting a datasource. Allowing developers to start developing and testing the Extract, Transform, and Load (ETL) pipelines before any real data is able to be accessed.
- Development in Test Environments. Development teams may have to work in an environment that does not allow hosting real data. In this case, they can use data generators to create synthetic datasets. This can also allow development on local workstations, rather than in a secure environment. Allowing local development will mean developers are able to develop code faster. With the focus on development rather than data handling.
- Reduce Cost and Improve Efficiency. You often need to run a pipeline end-to-end or individual stages during development - either to ensure a full ETL pipeline works or to test a logic change. Real data can be substantial in volume and costly to process. Depending on the environment the real data is hosted in it may also be required to package and move the ETL code in order to test it. Leading to significant time and resource consumption when performing test iterations on the full set. Data generators can create smaller datasets that allow developers to work on their local environment. This enables efficient ETL process testing. It also saves resources, time, and computing costs.
- Continuous Integration Testing. A deployment may require continuous integration testing. In it, the full code would be built, ETL pipelines run through end-to-end, and the apps deployed on a regular basis. Using real data in this case may not be feasible due to policies in place limiting the usage and processing of the data, or for the reasons above. In some cases, the content of the data isn’t important as the tests are based purely on ensuring the deployment runs through end-to-end. Generating data at the start of the pipeline allows control over the size and content of the data processed in the test.
Best Practice Advice
Project Example
Project Example is a crucial resource when implementing data generators. We highly recommend studying the example found there before starting work. It shows best practices for folder structure and configuration. It also provides guidance on more complex logic.
Seeding
Deployments often need static generated data:
- To ensure each developer on a Quantexa deployment uses the same test data for consistency.
- To process the same test data when running continuous integration tests. Especially when regression testing.
For this we need seeding. A seed is a value passed in at run time to initialize a generator. Using the same seed in an individual generator over many iterations will result in the same outcome in each iteration. Establish a uniform seeding approach among generators by doing the following:
- The
seed
values in the config remains the same. - The number of Documents and Entities defined within config remains the same.
- For versions before 2.7, ensure the same seed value from config is used. Pass it into the
generateEntities
function. Also, pass it into an individual generator's generateWithIds
function. This makes sure each generator in the script is initialised with the same seed and uses the same pool of shared Entities. - For versions 2.7 and above, make sure the same Document and Entity seed is used. Pass the
entitySeed
into generateDefaultEntities
. Pass the documentSeed
into each data source’s DocumentGeneratorRunner
class definition. This ensures each Document generator in the script is initialised with the same seed and uses the same pool of shared Entities.
Enhancing With Real Data Insight
If real data is accessible, understanding this data will help to generate representative test data.
- Extract Statistical Data. Analyze the real data to determine the frequency and proportions of different field populations.
- Ensure Specific Fields Are Generated. Make sure that the specific fields required for your tests are generated as expected.
- Establish Field Dependencies. When fields typically share specific values, ensure they are dependent on each other in the data generator. For example, a
businessName
field is only populated if the isBusiness
field is true. See the ImportedCustomerGenerator.scala
script in Project Example for an in depth example. - Identify Linking Frequencies. Find how often datasets or models link together. This keeps realistic Network structures.
For further Network and volume tuning please refer to the documentation here.
Effort and Purpose
This section will outline the various stages of implementing a data generator module and configuring a generator for a specific Document. The initial stage will only need to be completed once when adding a new data generator module. The remaining stages will then be repeated for each new Document generator to configure. Apart from the initial stage, all other stages are optional. Each should only be factored into planning based on the requirements and level of generated data needed. The stages are as follows:
Stage Name | Time Guidance per Document | Tasks |
---|
Initial Implementation | 0 - 0.5 Day | Data Generation module added to the repository using the Repository Tool if not added on the initial creation of the repo, shown here. |
Document Configuration | 0.5 - 1 Day | Document generator script added (from step 2 onwards in the documentation) and population logic enhanced by: - Gaining insight from real data (see Enhancing With Real Data Insight in the Best Practice section). Note: This has a dependency on DQ reports from real data.
- Combine these insights with generator utility functions to implement advanced population logic. For example, the function genCategory can be used to mimic the frequency of particular values.
- If needed, apply Document seeding (see Seeding in Best Practice section).
|
Entity Configuration | 0.5 - 2 Days | Entities can be enhanced in various ways: - Setting up Entity seeding and a shared pool of Entities for each Document generator to be populated from
- Using real data statistics to generate Entities that are more representative of those found in the data. Note: This has a dependency on DQ reports from real data.
- Ensuring Entity data stays consistent when populating Documents. See the Enriched Generated Entities below in the Technical Posts for more information.
|
Network Configuration | This will be an ongoing process throughout development. | This is an open-ended and iterative stage. Refinement and adjustments will need to be made to the preceding stages in order to obtain desired Network patterns. Currently, the process for Network tuning is different for each deployment. Guidance and previous examples can be found in the documentation and the technical posts linked below. |
Using the below guidance, understand the purpose of the data you wish to generate and identify which of the stages would be required. This ensures the level of effort planned into scope aligns with a deployments needs. Overall providing better value in terms of both setup time and costs, and the resulting generated data.
Simple Document Configuration
This stage requires minimal configuration on top of those mentioned from step 2 onwards in the documentation - only that fields in a case class are assigned generation logic using the default functions available, with little to no information from the real dataset. This is useful when a small set of test data is required in a short amount of time. It has the benefits of a quick implementation and does not need any real data to aid in configuration - only the schema. The downsides are that the data itself may not be representative of the real data, and the Networks and Entities resolved from it may not be well formed.
Example Usages
- Access Delayed to Real Data. When access to the real data is delayed this will allow a developer to quickly obtain a set of test data in order to start development of the ETL pipelines. Provided a data schema is given.
- ETL Pipeline Testing. When regular integration tests are required to test the build of code and ensure ETL pipelines can run through without errors. This assumes no qualitive checks.
Document Configuration
Further configuration of the Document model may be required when the generated data will be used throughout development. Especially where the contents of a Document will be important, but Networks and Entities are not. It will require the real data in order to inform the developer of how to populate the model, or at the very least a Data Quality (DQ) report. It's downsides are that the Networks and Entities resolved from it may not be well formed.
Example Usages
- Partial Development in Test Environments. For development of ETL and Document UI components. When developers work on local workstations or in a test environment where access to the real data is prohibited. The improved population of the fields in a Document model allows for stronger cohesion of the work between environments.
- ETL Logic Testing. Some ETL pipelines contains custom logic which requires data representative of the real data in order to test. For examples, some Documents are formed of multiple raw files that are linked by custom logic in the Create Case Class stage of ETL. For example see Customer and Accounts in Project Example. Ensuring that linking fields are appropriately populated allows for testing the custom logic.
Entity Configuration
This stage ensures Entity data is generated realistically. Additionally, that it is distributed consistently across data sources. Resulting in more realistic resolved Entities and standard Network patterns. This will allow the full development of a standard deployment (ETL, Batch Resolver, Scoring and UI). Due to this level of configuration a developer will need to have had access to the real data and a basic Batch Resolver output using the real data. This is in order to obtain all the relevant information to configure the generators, however it may not always be possible. It also increases the time a data generator can be set up.
Example Usages
- Full Development in Test Environments. Generated data with configured Entities allows for full development on a standard deployment. This means that prohibited access to processing real data will not delay development. This enables testing on Batch Resolver, Scoring and Entity/basic Network based UI components.
- Reduce Cost and Improve Efficiency. Full pipeline testing with qualitive checks may be required - i.e. Scoring rates and Batch Resolver analysis. For when the real data is large in size and costly to process. Being able to generate a smaller set of data representative of the real data will aid in keeping costs down.
Network Configuration
The previous stage will produce basic Networks that will be suitable for most deployments needs. However, some deployments require analysis of specific or complex Network patterns. Specifically, in the development and testing of Graph Scripting and Scoring. When these are required it is recommended to undergo the Network Configuration stage. It is important to note that as a framework for this level of configuration and tuning is not yet defined this can be a time consuming process.
Technical Posts
Generator Deep Dive
Generating Data for the Quantexa Platform
This post provides in-depth technical advice on creating a case class generator. It covers common techniques and the helper tools used for field population. It also covers how to use shared Entities to create Networks. An approach called ‘Data Injection’ is also discussed which is useful for ensuring a set of data will trigger Scores for testing.
Enriched Generated Entities
Data Generators: Enhancing Data Generation with Seeding
A deployment has been able to enhance how Entities are generated and shared between sources. The focus being on the consistency of data shared between Entities across sources. For example, a business having the same address on different Documents. This led to more linkages and realistic Networks for testing. This work can be used to aid similar implementations when working on a ‘Entity/Network Level Setup’ as explained above.
Useful Links