This article covers how a deployment used the JavaFaker library and seeding to enhance Entity Generation from data generators. Specifically, in ensuring that consistent information is generated alongside each Entity and between Document types to create realistic Networks.
Prerequisite Reading
To ensure you have a solid background on this topic, check out the Data Generators page on the documentation site and this community article on best practices before proceeding.
Why is this needed?
In a standard implementation of data generators, you might create the same business names across data sources. However, due to the randomness of data generation, there is no guarantee that the incorporation date, for example, would be the same. Also, a generated address in one Document is unlikely to match the address linked to the business in other documents. This complicates validating and testing Entity resolution on generated data because the two businesses lack enough shared information to be identified as the same Entity.
How can you generate consistent data across documents?
Note: The seed value referred to here is the one passed into the generateWithIds
function. It ensures that test data generated by a generator remains the same over multiple iterations. However, using the same seed across different generators will not affect cross-Entity or cross-Document linking.
You can define `createEntity` functions for consistent JavaFaker calls when creating entities
If you generate a Faker object with a seed and change the order of function calls across many iterations, the results may vary significantly.
Defining createEntity
functions ensures deterministic outcomes. These functions consistently call Faker functions in a specific order. For example, consider the following createIndividualEntity
function:
import java.util.Random
import net.datafaker.Faker
case class GenIndividual(
name: String,
nationality: String,
jobTitle: String
)
def createIndividualEntity(seed: Int): GenIndividual = {
val individualFaker = new Faker (new Random(seed))
GenIndividual(
name = individualFaker.name().fullName(),
nationality = individualFaker.nation.nationality(),
jobTitle = individualFaker.job().title()
)
}
Note: In versions of the Quantexa Platform prior to 2.7.0, the library com.github.javafaker.Faker
is used instead of net.datafaker.Faker
. These two libraries contain some different Faker objects for populating fields, so be mindful of these variations. Refer to Project Example for specific examples relevant to your Quantexa Platform version.
With a visual representation as follows:
You can extend these functions to include any fields you want to keep consistent across data generators.
You can use defined seed ranges to force associations across Document and Entity types
Align Seed Values
Ensure you use the same seed values across different data generators to align the generated entities.
Configure Pool Sizes
In the config, define the size of the pools for business, individual, and address entities that data generators choose from. Align these pool sizes with your data.
Reducing these pool sizes increases overlap within data sources. It also increases overlap across data sources, creating tightly linked networks. For example, if you create 100 business documents with 10 individuals in each pool, each individual will appear about 10 times on average. This results in a highly interconnected network.
Creating Real-Life Associations
Use these seeds to create associations between Entity types. Passing the same seed into all createEntity
functions generates a “context” representing real-life associations between individuals, addresses, and businesses.
This ensures more realistic Entity resolution in the generated data. Compounds created from associated businesses, addresses, or individuals will be consistent across documents.
In the following example, you can see there are two approaches to generating individuals and addresses. To best reflect the behaviors seen in real data, you must combine both approaches. Usually, pick the associated address for each individual, but occasionally choose a different address to introduce variability in the generated data.
First Approach:
- Choose an individual randomly from the individual range. Then select an address randomly from the address range.
- Both the individual and address may appear elsewhere in the data but are unlikely to have associations with each other elsewhere.
Second Approach:
- Pick a seed randomly from the individual range and use it to get both the individual and address details
- This method enhances the probability of these two entities appearing in separate documents.