It is standard practice for Quantexa engineers to develop in non-production environments with no access to real data. These environments are used to validate code changes and assess regression impacts prior to production releases. Generated data is used as a substitute for real data in these cases to enable running ETL and producing outputs for the UI. When done well, the outputs from generated data will lead to realistic looking data points and networks in the UI.
General Approach
A generator class is created for each of the raw tables an ETL pipeline requires. This class contains logic to produce a case class object aligned to the schema of a given raw table. A data generator script is then created that executes the logic defined for every raw table.
The logic used to generate a given field can vary based on the type of the field, the intended values and whether it is used for entity resolution. A combination of functions from the library org.scalacheck.Gen
and internal Quantexa libraries are used to serve these needs.
The general structure of a raw table generator is a function that uses for comprehension to generate a series of values for each field, yielding a single case class object with type aligning to the raw table. Configuration is provided to this class to set the desired number of output rows. See below for a simple example of a a class used to generate records for theCustomerRiskRating
table:
class CustomerRiskRatingGenerator(generatorConfig: GeneratorLoaderOptions) {
def customerRiskRatingGen(id: Long): Gen[CustomerRiskRating] = {
for {
customer_id <- id.option(1)
customer_risk_rating <- Gen.choose(0, 5).option(0.95)
} yield {
CustomerRiskRating(customer_id, customer_risk_rating)
}
}
def generateRecords: Seq[CustomerRiskRating] = generateWithIds(customerRiskRatingGen, generatorConfig.numberOfCustomers)
}
This class could then be utilised as below to create a dataset of the CustomerRiskRating
type:
val customerRiskRatingRecords: Dataset[CustomerRiskRating] = new CustomerRiskRatingGenerator(config).generateRecords.toDS
The same approach would then be followed for all other tables used to construct the Customer model along with any raw tables for the other data sources a project has.
Common Use Cases
Dates, Timestamps and Numerical Data
The below example will generate a value for customer_start_dt
between "2000-01-01" and "2023-01-01" and a value for customer_end_dt
between the value generated for customer_start_dt
and "2023-01-01". The field customer_end_dt
will have the value None
approximately 30% of the time due to the application of the option(0.3)
method. A similar approach can be taken for the field load_date
which is of type Date
rather than Timestamp
:
import java.sql.Date.valueOf
import java.sql.Timestamp
val customerStartDateRange = DateRange(valueOf("2000-01-01"), Some(valueOf("2023-01-01")))
for {
customer_id <- id.option(1)
customerStartRange <- genDateRangeBetween(customerStartDateRange)
customer_start_dt <- new java.sql.Timestamp(customerStartRange.from.getTime).option(1)
customer_end_dt <- new java.sql.Timestamp(customerStartRange.to.get.getTime).option(0.3)
load_date <- Gen.choose(customerStartDateRange.from.getTime,
customerStartDateRange.to.get.getTime).map(new Date(_))
} yield {
CustomerDates(customer_id = customer_id,
customer_start_dt = customer_start_dt,
customer_end_dt = customer_end_dt,
load_date = load_date)
}
The following statements can be included for generating fields of type Integer
, Double
and Long
respectively:
Gen.choose(1, 100000)
Gen.choose(0.0, 1.0)
Gen.choose(10000000L, 99999999L)
Sampling from a Set of Values
It is common for fields to contain only certain specific values, with varying likelihoods of occurring. This situation is handled through use of the genCategory
method, where inputs can be provided via configuration. The below example config file shows the expected format of the input string, where a list of tuples are provided. Here, the first tuple element is a probability and the second element is the associated value:
config {
accountStates = "(0.95,Active)(0.05,Closed)"
}
This config parameter could then be referenced within a generator function via:
genCategory(generatorConfig.accountStates)
The probabilities to use for each distinct value can be estimated by looking at the frequency of each value from a sample of the full table. Another benefit of storing the probabilities of specific values in configuration is removing the need to re-build the code between ETL runs during development.
Creating Connected Networks
Generating data that leads to entities linked across data sources is required to create realistic looking networks in non-production environments. To ensure linkage of entities between data sets, values can be sampled from a GeneratedEntities object for the raw fields used to create compound elements. The following will create a single pool of entities which can be provided to the generators for separate data sources:
val specificEntities: GeneratedEntities = generateCommonEntities(numberOfPeople = 100,
numberOfAddresses = 100,
numberOfBusinesses = 100,
numberOfPhoneNumbers = 100,
numberOfEmails = 100)
val customerRecords: Dataset[Customer] = new CustomerGenerator(config, generatedEntities).generateRecords.toDS
val watchlistRecords: Dataset[Watchlist] = new WatchlistGenerator(config, generatedEntities).generateRecords.toDS
The generators for each of these data sources can then obtain raw values from these entities via:
for {
customer_id <- id.option(1)
genPerson <- Gen.oneOf(generatedEntities.names)
full_name <- genPerson.fullName.option(1)
forename <- genPerson.forename.option(1)
family_name <- genPerson.familyName.option(1)
} yield CustomerName(customer_id, full_name, forename, family_name)
This will ensure overlap in the values used to build addresses in the Customer and Hotlist documents.
Data Injection
Projects with detection use cases will commonly implement network scores that only trigger if certain connections and conditions occur. Such networks are difficult to produce using standard approaches but can be created using the "data injection" method. This works by creating sequences of data in the format of data source models where the fields are specifically populated to ensure entity resolution. This data can then be appended to the output from the create case class step for a given data source. The below minimal example could be used to create a network with a customer document connected to a watch list document via a linked individual entity:
import com.quantexa.scoring.test.ZeroElement
val customer1 = ZeroElement[Customer].copy(individualName = "John Smith",
customerStartDate = "2020-01-01",
dateOfBirth = "1990-01-01",
nationality = "Australian")
val watchlist1 = ZeroElement[Watchlist].copy(individualName = "John Smith",
watchlistCategory = "Crime",
dateOfBirth = "1990-01-01",
nationality = "Australian")
val customerInjection: Dataset[Customer] = Seq(customer1).toDS.as[Customer]
val watchlistInjection: Dataset[Watchlist] = Seq(watchlist1).toDS.as[Watchlist]
Additional information can be found in the following documentation site pages: