Subsetting and Synthetic Data Masking for Databases

While synthetically generated test data can be used for application testing and other usage, such as AI training data, there are situations where you need to use existing data. You may need to secure existing data to ensure it doesn't contain any PII information, or you may only require some data for your objectives.


GenRocket provides Data Subsetting and "Masking" capabilities for databases using a self-serve functionality called "G-Migration+".  

Unlike many test data management tools that use the traditional 'masking' methods, which can be reverse engineered, GenRocket uses "Synthetic Data Replacement," which replaces the sensitive values with synthetically generated values; these synthetically generated values cannot be reverse engineered to determine the original value.  

With G-Migration+, there are also multiple options available to subset data from source to destination database: 

  • Subset based on specific value using "Where" clause
  • % of rows (can also be used along with "Where" clause)
  • # of rows (can also be used along with "Where" clause)


Both subset and "masking" functionality can be used by themselves or in conjunction, as described in the table below:


Data Subsetting OnlyQuery a subset of data within a source database and insert it into a destination database using defined subsetting conditions ("Where," % of rows, # of rows)
Synthetic Data Masking (SDM) OnlyDynamically mask sensitive data values with synthetic test data before insertion into the destination database. The user chooses the sensitive columns within each table.
Data Subsetting and SDMMask and subset data set together by providing both the masking and subset parameters in a G-Migration+ Configuration.
Test Data AugmentationMake modifications and insert a richer data set into the destination database. This can be done by adding/modifying Generators and using Test Data Cases (G-Cases).


G-Migration+ Overview Video (04:10)

This video demonstrates the power and flexibility of G-Migration+ to subset and mask data dynamically from source to destination database:


In This Article



What are the Benefits of Using G-Migration+?

  • Sensitive data values are replaced with synthetically generated values before making the usage of existing data secure. Synthetically generated data cannot be reverse engineered to find the original true value.
  • Masking is done without GenRocket having to 'read" the actual sensitive data. Instead, it uses user-defined sensitive columns from imported metadata.  - Dynamic data masking
  • Flexibility of subsetting data based on different parameters to grab the needed information for the testing objective rather than large amounts of the dataset, saving you lots of storage space. 
  • Using both masking and subset together instead of separately to improve efficiency
  • It provides the potential to augment and insert a richer data set into the destination database for testing.
  • Whether masking or subset, the actual data is never uploaded/stored or ready by GenRocket cloud. All masking/subset functions happen in the customer environment behind their firewall, making it completely secure.


What Databases are Supported for Masking and Subset?

  • MySQL
  • MSSQL
  • PostgreSQL
  • Oracle
  • DB2
  • Sybase


Important: The data migration can only be performed between the same database types. For example, a MySQL database to a MySQL database.

G-Migration+ Deployment Diagram in Test Environment

The diagram below shows G-Migration+ deployed within a Test Environment along with the G-Repository Server and the GenRocket Engine. G-Migration+ performs the following operations during a migration:

  • Load a GMigrationConfig file
  • Launch migration threads
  • Query data from a source DB
  • Generate synthetic data for masking
  • Mask/replace sensitive data synthetically
  • Write data to block files
  • Bulk load data to destination DB
  • Delete block files 



How Does G-Migration+ Subset Data? 

  • User uploads metadata of the source database to subset (No customer data is actually uploaded)
  • Define the Subset conditions and parameters ("Where," % of rows, # of rows).
  • Download the G-Migration+ configuration files, which are simply the subset instructions set. The actual subset happens in the customer environment. 
  • When the migration is run, data that needs to be subset and any "non-sensitive" data will be read from the source database tables into temporary "dump" files on the local machine where the migration is performed. The "dump" file cannot be accessed by the user 
  • The G-Migration+ configuration determines where the temporary files are stored on the user's local computer. This location is defined by two parameters:
    • Data-Dump Directory
    • Data-Dump SubDir 
  • The data is then inserted into the destination database tables, and the temporary files are removed after completion. 

How Does G-Migration+ Mask Sensitive Data Values?

  • User uploads metadata of the source database to subset (No customer data is actually uploaded)
  • User defines the sensitive columns (from the uploaded metadata) to be 'masked
    • It is important to note that sensitive data values are NEVER read or stored from any source database table columns that have been marked as sensitive. 
  • Instead, in the temporary dump file, sensitive columns are identified as "to be masked" and replaced with 100% synthetic data using an automatically created Domain and Scenario. Sensitive data remains untouched in the source database table. 
  • The data is then inserted into the destination database tables, and the temporary files are removed after completion. 


How to Access G-Migration+

  • Expand the Self Serve Menu Options and select G-Migration+.


Required Prerequisite Steps

  1. Install GenRocket Runtime on the machine where the migration will be performed. 
  2. Create a Source and a Destination Database Property File. Click here to learn more.
    • Files can be placed within a subdirectory. 
    • Must be present to connect to each database via JDBC and perform the migration. 
  3. Use the -xts command to create an XTS file for the Source Database. 
    • This file contains the source database's schema for importing into GenRocket. 
    • Click here for more information. 
  4. Create a new Project Version. 
  5. Navigate to G-Migration+ from the "self-service" dropdown menu, as shown above.


How Does G-Migration+ Work? 

  1. Import the Source Database Table Schema
  2. Create a G-Migration+ Configuration
    • Automatically creates a Default Domain and Scenario for the migration.
    • Users can optionally specify a Schema Name to migrate across different schemas. 
  3. (Optional) Add Customer Referential Integrity to Table Schemas
    • Does not impact the source database. 
    • The defined referential integrity is used dynamically when the migration is performed. 
  4. (Optional) Exclude Tables from the Migration
  5. (Optional) Select Tables for Full Load
    • The subsetting table and related child tables cannot be chosen for full load.
    • Select tables will be fully loaded during the migration regardless of the subset condition.
  6. Add Tables for Subsetting and/or SDM
  7. Select Sensitive Columns for SDM 
    • A Domain and Scenario will automatically be created for each table with selected sensitive columns.
  8. Add Subsetting Conditions (where clause, limit)
  9. (Optional) Make Test Data Design Changes (e.g., replace Generators, create G-Cases).
  10. Download the required files (e.g., Scenario, G-Migration+ Configuration,  G-Cases).
    • SDM Only (or with Subset) - Download Default Scenario, each Table Scenario, and G-Migration+ configuration
    • Subsetting Only - Download Default Scenario and G-Migration+ Configuration.
  11. Place downloaded Scenario files in the path defined for the resource.output.directory. 
  12. Run the command at the command line to generate data.


Simple Subsetting and Masking Example

In the example below, a subset of data from a Customer table and associated records in related tables will be migrated to the destination database. 

  • Only records starting with a Customer.id after '50' will be migrated.
  • Only 50% of records with an id greater than '50' will be migrated (Limit Rows). 
  • SDM will be performed for the dob, last_name, phone_number, and ssn columns in the customer table prior to insertion.


G-Migration+ Command

genrocket -gmigp <G-Migration+ Name>.gredb


G-Migration+ Troubleshooting

The following articles can be used to learn more about errors received in GenRocket Runtime when subsetting and/or masking data:

ArticleError
MySQL Insert Error - Loading Local Data is DisabledLoading local data is disabled; this must be enabled on both the client and server side.

Note: Error received when attempting to migrate a subset of data between two MySQL databases.


Additional "How To" Information

The following articles can be used to view step-by-step instructions on how to perform the steps outlined in this article:

TopicDescription
G-Migration+ Management DashboardLearn more about the G-Migration+ user interface. 
How to Create a G-Migration+ ConfigurationLearn more about creating and managing G-Migration+ Configurations.
How to Import and Manage Table SchemasLearn more about importing Table Schemas.
How to Add Custom Referential Integrity to Table SchemasLearn more about customizing referential integrity for imported tables. 
Data Subsetting - How to Add Tables and Subsetting ConditionsLearn more about adding tables and configuring data subsetting conditions. 
Data Masking - How to Add and Manage Sensitive Columns for TablesLearn more about adding tables and selecting columns for data masking. 
How to Select Tables for Full Migration During Data SubsettingLearn more about selected tables for a full load during data subsetting. 

G-Migration+ Use Cases

The following use cases provide users with a step-by-step walkthrough on how to perform data subsetting and masking with the G-Migration+ feature:


ArticleDescription
Data Subsetting Only Use Case for DatabasesShows how a user can migrate a subset of data based on a WHERE clause and row limit. 
Synthetic Data Masking (SDM) Use Case Example for DatabasesShows how a user can mask sensitive data values for specific table columns before inserting the data into the destination database.