Description

While GenRocket can be used to generate virtually any volume and variety of synthetic test data, there are use cases where it is necessary to use existing production data. But, production data may contain sensitive data, so you need to secure that data. You also do not need all that data for application testing, only a subset of the data. 


GenRocket uses a process called Synthetic Data Masking (SDM) to mask sensitive data values with synthetic data values in files and databases. Data Subsetting capabilities are also available for databases. SDM and subsetting can be used together or independently. Both work based on the specifications provided by the user. 



Note: It's important to note that GenRocket's SDM and Subset features are not an ETL tool. SDM with subsetting cannot be used for the following:  

  • Data Sanitation, Correcting (Cleansing), and Curation
  • Data Consolidation (Consolidate Data Across Different Databases)
  • ETL - Typically used to consolidate into a single location for easier analytics and/or storage


In This Article


What is Synthetic Data Masking (SDM)?

  • SDM is the process of dynamically masking sensitive data values with synthetic test data. It can be used for the following:  
    • Databases - Occurs before insertion into the destination database. The G-Migration+ feature can be used to perform SDM for databases.

    • Files - Occurs at the time of test data generation. A Masking Receiver must be added to the Domain to mask files.


  • The synthetically generated values have similar characteristics to the real value but cannot be traced back because they are synthetic. SDM maintains the structure of the value to ensure it remains usable.

  • Unlike many other TDM and synthetic TDM providers, GenRocket does not need to look at the actual data in production, making it more secure. The Production data is never exposed or "read" - only the metadata.


What is Data Subsetting?

  • Query a subset of data within a source database and insert it into a destination database based on defined subsetting conditions. 

  • The G-Migration+ feature can be used to perform data subsetting. 
     

File Masking Capabilities

GenRocket provides file masking capabilities through Receivers for the following data types: 


File TypeMasking ReceiverDescription
Delimited FilesDelimitedFileMaskReceiverMasks targeted sensitive values within a delimited file (e.g., CSV, TXT) using SDM.
Fixed FileFileMaskMultiBlockReceiverMasks targeted sensitive values within a fixed file using SDM.
JSONJSONFileMaskReceiverMasked targeted, sensitive values within a template JSON file with synthetic data values in a new JSON file using SDM.
ORC (Hadoop)ORCFileMaskReceiver

Masks targeted Attributes (i.e., columns) within a given ORC (Hadoop) file using SDM. 

EDIEDIFileMaskReceiver

Uses an existing EDI document as a template to create one or more EDI documents by replacing the sensitive data with synthetically generated data at the positions configured in the Receiver. 


Each of the above file masking receivers works slightly differently, but they have a similar pattern of the steps.

  1. Create a Project with the Default Project Version
  2. Create a Default Domain and Add Attributes
  3. Define the Masking Receiver Parameters
  4. Create Default Scenario 
  5. Download Scenario and Execute


Delimited File Masking Example

A user has set up the DelimitedFileMaskReceiver to mask three data values within the source file: lastName, username, and password.



In the output file, sensitive data values have been masked with synthetic test data. 


Database Data Subsetting and Masking Capabilities

The G-Migration+ feature is used to perform data subsetting and SDM for databases. Please note that the source and destination databases must be identical (e.g., MySQL to MySQL); however, table schemas can vary between the source and destination databases.  

 

Supported Databases (Subsetting and Masking)

Supported databases include: 

  • Oracle
  • MS SQL Server
  • MySQL Server
  • DB2
  • PostgreSQL
  • Sybase


What Actions Can Users Perform?

Users can perform the following actions: 

  • Data Subsetting Only
  • Synthetic Data Masking Only
  • Data Subsetting and Synthetic Data Masking


Synthetic Data Masking (SDM) Only

The user adds one or more tables from the imported schema and selects sensitive columns within the table. A Scenario is created with Attributes for each sensitive column and is used to insert synthetic test data into the destination database. 

Example - The customer table's date of birth (dob), last name, phone number, and ssn columns have been marked as sensitive data columns. Synthetic data values will be inserted into the destination database for these columns. 


Data Subsetting Only

The user adds a table from the imported table schema and then adds subsetting conditions. Subsetting conditions can only be added to one table and include the following: 


Where clauseA filter/condition that is applied to migrate a subset of data.
% of RowsDefines a percentage of rows (e.g., 25%, 50%).
# of RowsDefines a constant number value of rows.


Example - Subset will start at Customer record id '251' and only contain 100 rows of data along with their associated records in related tables. The last included records will have a customer id of '350'.


Data Subsetting and SDM

As discussed above, the user adds subsetting conditions for one table and selects sensitive table columns in one or more tables.


Example - Subset will start at '51' and contain 100 rows of data. Customer information will be masked during subsetting (e.g., Last Name, Date of Birth, SSN, Phone Number).