Data Subsetting and Synthetic Data Masking for Databases

GenRocket provides Data Subsetting and Masking capabilities for databases through a feature we call "G-Migration+." This feature provides the following functionality when working with identical databases (e.g., MySQL to MySQL): 

  • Data Subsetting - Query a subset of data within a source database and insert it into a destination database based on defined subsetting conditions.

  • Synthetic Data Masking (SDM)- Dynamically mask sensitive data values with synthetic test data before insertion into the destination database. The user chooses the sensitive columns within each table.
     
  • Test Data Augmentation - Make modifications and insert a richer data set into the destination database. This can be done by adding/modifying Generators and using Test Data Cases (G-Cases).


G-Migration+ Overview Video (04:10)

This video demonstrates the power and flexibility of G-Migration+ to determine a subset of data and transfer it from a source database to a destination database while using synthetic data replacement.

Brief Example

In the example below, a subset of data from a Customer table and associated records in related tables will be migrated to the destination database. 

  • Only records starting with a Customer.id after '50' will be migrated.
  • Only 50% of records with an id greater than '50' will be migrated (Limit Rows). 
  • SDM will be performed for the dob, last_name, phone_number, and ssn columns in the customer table prior to insertion.



In This Article


What Subscription Tiers Have Access to G-Migration+?

The G-Migration+ feature is available with the following GenRocket editions: 

  • Professional Licenses (Legacy)
  • Enterprise Licenses (Legacy)
  • Premium Edition
  • Corporate Edition
  • Enterprise Edition
  • Unlimited Edition


GenRocket Methodology Phase and Roles

  • Model and Design Phases
  • GenRocket Power User (e.g., Org Admin, Test Data Engineer (TDE)

What Databases are Supported for G-Migration+?

The following databases are currently supported for intelligent data migration:

  • MySQL
  • MSSQL
  • PostgreSQL
  • Oracle
  • DB2


Important: The data migration can only be performed between the same database types. For example, a MySQL database to a MySQL database.


Sensitive Data Remains Untouched on the Production Database

With G-Migration+, your sensitive data remains untouched and safely on your production database. Three processes occur during data migration: 

  1. Non-sensitive data is obtained from the production database using queries.
  2. Synthetic data is generated and merged with the non-sensitive data within each specified table and column.
  3. Merged data is inserted into the testing database using the same data model and format. 


What are the Benefits of Using G-Migration+?

G-Migration+ provides the following benefits for achieving a testing objective: 

  • Sensitive data values are replaced with synthetically generated values before insertion.
  • It provides the potential to augment and insert a richer data set into the destination database for testing.
  • Ability to only grab the needed information for the testing objective rather than large amounts of unnecessary data.


When Should G-Migration+ be Used? 

G-Migration should be used to meet the following needs:

  • Migrate a subset of data where sensitive data values are replaced using SDM.
  • Migrate a subset of data from a production database with only non-sensitive data values.
  • Migrate a subset of data or with augmented test data.


Note: This feature is designed for migrating a subset of data, not for full data migrations.


How Does G-Migration+ Work?

During data subsetting and/or masking, data is read from the source database tables into a set of temporary files on the local machine where the migration is performed. The data is then inserted into the destination database tables. 


The G-Migration+ configuration determines where the temporary files are stored on the user's local computer. These files are deleted once the process has been completed. 

It is important to note that data is not read from any source database table columns that have been marked as sensitive. Instead, synthetic data is inserted into the destination database in its place, and the sensitive data remains untouched in the source database tables. 


Step-by-Step Instructions


Prerequisite 1 - Install GenRocket Runtime

GenRocket Runtime must be installed on the machine where data migration will be performed. For more information, use the links provided below:  


Prerequisite 2 - Source Database and Destination Database Property Files

For GenRocket to perform a data migration, configuration files must be present to connect to the source and destination databases via JDBC. Two properties files must be present on the user's local computer and contain the appropriate database connection information. For example, they can be called source.properties and destination.properties


These files can be placed within a subdirectory and must be present to connect to each database during the data migration. For more information on how to create these configuration files, please click here.


Step 1 - Access the G-Migration+ Platform

  • Create or select an existing Test Data Project. 
  • Select a Project Version if multiple versions exist.
  • Next, expand the Self Serve Menu Options drop-down menu and select G-Migration+.


Step 2 - Create a G-Migration+ Configuration

A Project Version must have a G-Migration+ Configuration to perform Data Subsetting and Synthetic Data Masking (SDM). This configuration file contains important details, including the source and destination database connection information. Added tables, data subsetting conditions, and selected sensitive columns are also specific to this configuration.


Users can optionally specify a Schema Name so that data can be subset across different schemas. A schema name can be provided for the Source database only, the Target database only, or both. Please note that the behavior will be different depending on where a schema name is entered. 


Note: A Default Domain and Default Scenario will be automatically created in the selected Project Version. The selected Partition Receiver will be automatically assigned to the Default Domain. 


Step 3 - Import the Table Schema(s)

Next, the user must import the Table Schema (i.e., Database Schema) into G-Migration+ to configure what tables will be used for Data Subsetting and/or Synthetic Data Masking (SDM). 


A database schema can be created using the genrocket -xts command. If this has not already been done, please do so before continuing. To learn more about XTS files and how to create them, click here.


Once created, the user must import the XTS file using the Import Table Schema button.


Users can view the imported schema by clicking on the Show Table Schema button and manage the table schema by clicking on the Manage Table Schema button.


 


Note: After the import, tables can be excluded by adding them to the Exclusions Tab. Tables that do not have subsetting conditions and are not descendants can also be selected for a full load in the Full Load Tab.


Step 4 - (Optional) Add Custom Referential Integrity to Table Schemas

Users have the option to add custom referential integrity to imported table schemas. This is most commonly done when referential integrity does not exist between tables in the source database. The defined referential integrity only applies during subsetting/masking and does not impact the actual source database. 

Click Manage Referential Integrity to add referential integrity for one or more tables. For more information, click here.



A sample configuration is shown below: 



Step 5 - Add Tables, Select Columns, and Add Subsetting Conditions

Tables must be added to the G-Migration+ Configuration to perform Data Subsetting and/or Synthetic Data Masking (SDM). Use the Manage Tables button to add a table and optionally select sensitive columns within that table. 


Note: Tables must be selected to perform subsetting and masking (separately or together). Below the steps are discussed separately to show that subsetting and masking can be performed separately with G-Migration+.


Data Subsetting Steps

The basic steps are described below. Additional links are provided at the end of this article. 

  1. The user selects the G-Migration+ Configuration.
  2. The user adds the table to the G-Migration+ Configuration.
  3. The user clicks the Edit icon to add data subsetting conditions for a specific table (Where Clause, Limit Rows, Test Data Case). Note: Only one added table can have data subsetting conditions. 



Synthetic Data Masking (SDM) Steps

The basic steps are described below. Additional links are provided at the end of this article. 

  1. The user adds the table to the G-Migration+ Configuration.
  2. The user selects sensitive columns for each added table.


  3. A Domain and Scenario are created in the Project for each table with sensitive columns. Note: The Domain and Scenario are used to perform SDM. 


Step 6 - (Optional) Make Test Data Design Changes

Users can replace Generators and change Generator Parameter configurations for any created Scenario. They can also create a G-Case Suite and G-Cases within the selected Project/Project Version.


Generator Example

A customer table is being migrated that contains an SSN, which has been chosen as a sensitive column. Whenever sensitive columns are selected for a table, a Domain and Scenario are automatically created within the Project Version. 

  • The Customer Domain contains an SSN Attribute.
  • The SSNGen Generator has been assigned with the default parameter configuration.
  • A user wants the SSNs to follow a pattern; however, the default configuration generates random data. 
  • The user clicks on the Customer Domain within the G-Migration+ Dashboard.



  • The user selects the SSN Attribute and changes the generationType parameter from 'random' to 'sequential.'


Step 7 - Download the Required Files

Note: This step is not required when using G-Repository Server and Client. These features automatically manage the required download files and ensure the most recent file is always available to users. 


Depending on how the G-Migration+ Configuration has been configured, users will need to download specific files and place them in their designated locations. This includes the following: 

  • G-Migration+ Configuration
  • Default Scenario and any other created Scenarios
  • Test Data Cases (when used)


The G-Migration+ Configuration and any required Scenarios can be downloaded as a ZIP file by clicking on the Download All Configuration Files icon below. Test Data Cases will need to be downloaded separately. 
 


Alternatively, the Download (Cloud) icon can be used to download any individual component for test data generation.


The Default Scenario (and any other Scenario files) must be placed in the user's output directory as specified here within the GenRocket web platform. 



The G-Migration+ Configuration and Test Data Case Suite (or individual case) must be located at the path specified in the user's home directory (unless another location has been specified in the configuration).



Step 8 - Run the Command Line in a Terminal or Command Window

After downloading all items mentioned and placing them in the right location, use the following command line in a Command or Terminal Window to complete the migration.

genrocket -gmigp <G-Migration+ Name>.gredb


G-Migration+ Troubleshooting

The following articles can be used to learn more about errors received in GenRocket Runtime when subsetting and/or masking data: 

Article

Error

MySQL Insert Error - Loading Local Data is DisabledLoading local data is disabled; this must be enabled on both the client and server side.

Note: Error received when attempting to migrate a subset of data between two MySQL databases.


Additional "How To" Information

The following articles can be used to view step-by-step instructions on how to perform the steps outlined in this article:

Topic

Description

G-Migration+ Management DashboardLearn more about the G-Migration+ user interface. 
How to Create a G-Migration+ ConfigurationLearn more about creating and managing G-Migration+ Configurations.
How to Import and Manage Table SchemasLearn more about importing Table Schemas.
How to Add Custom Referential Integrity to Table Schemas
Learn more about customizing referential integrity for imported tables. 
Data Subsetting - How to Add Tables and Subsetting ConditionsLearn more about adding tables and configuring data subsetting conditions. 
Data Masking - How to Add and Manage Sensitive Columns for TablesLearn more about adding tables and selecting columns for data masking. 
How to Select Tables for Full Migration During Data Subsetting
Learn more about selected tables for a full load during data subsetting. 

G-Migration+ Use Cases

The following use cases provide users with a step-by-step walkthrough on how to perform data subsetting and masking with the G-Migration+ feature:


Article

Description

Data Subsetting Only Use Case for DatabasesShows how a user can migrate a subset of data based on a WHERE clause and row limit. 
Synthetic Data Masking (SDM) Use Case Example for DatabasesShows how a user can mask sensitive data values for specific table columns before inserting the data into the destination database.