Service Build and Transition Technical Development and Test - Data Generators

1. Generating Data for the Quantexa Platform

It is standard practice for Quantexa engineers to develop in non-production environments with no access to real data. These environments are used to validate code changes and assess regression impacts prior to production releases. Generated data is used as a substitute for real data in these cases to enable running ETL and producing outputs for the UI. When done well, the outputs from generated data will lead to realistic looking data points and networks in the UI.

General Approach

A generator class is created for each of the raw tables an ETL pipeline requires. This class contains logic to produce a case class object aligned to the schema of a given raw table. A data generator script is then created that executes the logic defined for every raw table.

The logic used to generate a given field can vary based on the type of the field, the intended values and whether it is used for entity resolution. A combination of functions from the library org.scalacheck.Gen and internal Quantexa libraries are used to serve these needs.

The general structure of a raw table generator is a function that uses for comprehension to generate a series of values for each field, yielding a single case class object with type aligning to the raw table. Configuration is provided to this class to set the desired number of output rows. See below for a simple example of a a class used to generate records for theCustomerRiskRating table:

class CustomerRiskRatingGenerator(generatorConfig: GeneratorLoaderOptions) {    
  def customerRiskRatingGen(id: Long): Gen[CustomerRiskRating] = {        
    for {            
      customer_id <- id.option(1)            
      customer_risk_rating <- Gen.choose(0, 5).option(0.95)        
    } yield {            
      CustomerRiskRating(customer_id, customer_risk_rating)        
    }    
  }    
    
  def generateRecords: Seq[CustomerRiskRating] = generateWithIds(customerRiskRatingGen, generatorConfig.numberOfCustomers)
}

This class could then be utilised as below to create a dataset of the CustomerRiskRating type:

val customerRiskRatingRecords: Dataset[CustomerRiskRating] = new CustomerRiskRatingGenerator(config).generateRecords.toDS

The same approach would then be followed for all other tables used to construct the Customer model along with any raw tables for the other data sources a project has.

Common Use Cases

Dates, Timestamps and Numerical Data

The below example will generate a value for customer_start_dt between "2000-01-01" and "2023-01-01" and a value for customer_end_dt between the value generated for customer_start_dt and "2023-01-01". The field customer_end_dt will have the value None approximately 30% of the time due to the application of the option(0.3) method. A similar approach can be taken for the field load_date which is of type Date rather than Timestamp:

import java.sql.Date.valueOf
import java.sql.Timestamp

val customerStartDateRange = DateRange(valueOf("2000-01-01"), Some(valueOf("2023-01-01")))

for {
  customer_id <- id.option(1)    
  customerStartRange <- genDateRangeBetween(customerStartDateRange)    
  customer_start_dt <- new java.sql.Timestamp(customerStartRange.from.getTime).option(1)
  customer_end_dt <- new java.sql.Timestamp(customerStartRange.to.get.getTime).option(0.3)
  load_date <- Gen.choose(customerStartDateRange.from.getTime,
  customerStartDateRange.to.get.getTime).map(new Date(_))
} yield {    
  CustomerDates(customer_id = customer_id,
                customer_start_dt = customer_start_dt,        
                customer_end_dt = customer_end_dt,        
                load_date = load_date)
}

The following statements can be included for generating fields of type Integer, Double and Long respectively:

Gen.choose(1, 100000) 
Gen.choose(0.0, 1.0)
Gen.choose(10000000L, 99999999L)

Sampling from a Set of Values

It is common for fields to contain only certain specific values, with varying likelihoods of occurring. This situation is handled through use of the genCategory method, where inputs can be provided via configuration. The below example config file shows the expected format of the input string, where a list of tuples are provided. Here, the first tuple element is a probability and the second element is the associated value:

config {    
  accountStates = "(0.95,Active)(0.05,Closed)"
}

This config parameter could then be referenced within a generator function via:

genCategory(generatorConfig.accountStates)

The probabilities to use for each distinct value can be estimated by looking at the frequency of each value from a sample of the full table. Another benefit of storing the probabilities of specific values in configuration is removing the need to re-build the code between ETL runs during development.

Creating Connected Networks

Generating data that leads to entities linked across data sources is required to create realistic looking networks in non-production environments. To ensure linkage of entities between data sets, values can be sampled from a GeneratedEntities object for the raw fields used to create compound elements. The following will create a single pool of entities which can be provided to the generators for separate data sources:

val specificEntities: GeneratedEntities = generateCommonEntities(numberOfPeople = 100,    
                                                                 numberOfAddresses = 100,    
                                                                 numberOfBusinesses = 100,    
                                                                 numberOfPhoneNumbers = 100,    
                                                                 numberOfEmails = 100)

val customerRecords: Dataset[Customer] = new CustomerGenerator(config, generatedEntities).generateRecords.toDS
val watchlistRecords: Dataset[Watchlist] = new WatchlistGenerator(config, generatedEntities).generateRecords.toDS

The generators for each of these data sources can then obtain raw values from these entities via:

for {    
  customer_id <- id.option(1)    
  genPerson <- Gen.oneOf(generatedEntities.names)    
  full_name <- genPerson.fullName.option(1)    
  forename <- genPerson.forename.option(1)    
  family_name <- genPerson.familyName.option(1)
} yield CustomerName(customer_id, full_name, forename, family_name)

This will ensure overlap in the values used to build addresses in the Customer and Hotlist documents.

Data Injection

Projects with detection use cases will commonly implement network scores that only trigger if certain connections and conditions occur. Such networks are difficult to produce using standard approaches but can be created using the "data injection" method. This works by creating sequences of data in the format of data source models where the fields are specifically populated to ensure entity resolution. This data can then be appended to the output from the create case class step for a given data source. The below minimal example could be used to create a network with a customer document connected to a watch list document via a linked individual entity:

import com.quantexa.scoring.test.ZeroElement

val customer1 = ZeroElement[Customer].copy(individualName = "John Smith",    
                                           customerStartDate = "2020-01-01",    
                                           dateOfBirth = "1990-01-01",    
                                           nationality = "Australian")

val watchlist1 = ZeroElement[Watchlist].copy(individualName = "John Smith",    
                                             watchlistCategory = "Crime",    
                                             dateOfBirth = "1990-01-01",    
                                             nationality = "Australian")
                                                                                                                                                                                    
val customerInjection: Dataset[Customer] = Seq(customer1).toDS.as[Customer]
val watchlistInjection: Dataset[Watchlist] = Seq(watchlist1).toDS.as[Watchlist]

Additional information can be found in the following documentation site pages: