While GenRocket can be used to generate virtually any volume and variety of synthetic test data, there are use cases where it is necessary to use existing production data. But, production data may contain sensitive data, so you need to secure that data. You also do not need all that data for application testing, only a subset of the data.
GenRocket uses a process called Synthetic Data Masking (SDM) to mask sensitive data values with synthetic data values in files and databases. Data Subsetting capabilities are also available for databases. SDM and subsetting can be used together or independently. Both work
Note: It's important to note that GenRocket's SDM and Subset features are not an ETL tool. SDM with subsetting cannot be used for the following:
- Data Sanitation, Correcting (Cleansing), and Curation
- Data Consolidation (Consolidate Data Across Different Databases)
- ETL - Typically used to consolidate into a single location for easier analytics and/or storage
In This Article
- What is Synthetic Data Masking (SDM)?
- What is Data Subsetting?
- File Masking Capabilities
- Database Data Subsetting and Masking Capabilities
What is Synthetic Data Masking (SDM)?
- The synthetically generated values have similar characteristics to the real value but cannot be traced back because they are synthetic. SDM maintains the structure of the value to ensure it remains usable.
What is Data Subsetting?
File Masking Capabilities
GenRocket provides file masking capabilities through Receivers for the following data types:
|File Type||Masking Receiver||Description|
|Delimited Files||DelimitedFileMaskReceiver||Masks targeted sensitive values within a delimited file (e.g., CSV, TXT) using SDM.|
|Fixed File||FileMaskMultiBlockReceiver||Masks targeted sensitive values within a fixed file using SDM.|
|JSON||JSONFileMaskReceiver||Masked targeted, sensitive values within a template JSON file with synthetic data values in a new JSON file using SDM.|
Each of the above file masking receivers works slightly differently, but they have a similar pattern of the steps.
- Create a Project with the Default Project Version
- Create a Default Domain and Add Attributes
- Define the Masking Receiver Parameters
- Create Default Scenario
- Download Scenario and Execute
Delimited File Masking Example
In the output file, sensitive data values have been masked with synthetic test data.
Database Data Subsetting and Masking Capabilities
The G-Migration+ feature is used to perform data subsetting and SDM for databases. Please note that the source and destination databases must be identical (e.g., MySQL to MySQL); however, table schemas can vary between the source and destination databases.
Supported Databases (Subsetting and Masking)
Supported databases include:
- MS SQL Server
- MySQL Server
What Actions Can Users Perform?
Users can perform the following actions:
- Data Subsetting Only
- Synthetic Data Masking Only
- Data Subsetting and Synthetic Data Masking
Synthetic Data Masking (SDM) Only
The user adds one or more tables from the imported schema and selects sensitive columns within the table. A Scenario is created with Attributes for each sensitive column and is used to insert synthetic test data into the destination database.
Example - The customer table's date of birth (dob), last name, phone number, and ssn columns have been marked as sensitive data columns. Synthetic data values will be inserted into the destination database for these columns.
Data Subsetting Only
The user adds a table from the imported table schema and then adds subsetting conditions. Subsetting conditions can only be added to one table and include the following:
|Where clause||A filter/condition that is applied to migrate a subset of data.|
|% of Rows||Defines a percentage of rows (e.g., 25%, 50%).|
|# of Rows||Defines a constant number value of rows.|
Example - Subset will start at Customer record id '251' and only contain 100 rows of data along with their associated records in related tables. The last included records will have a customer id of '350'.
Data Subsetting and SDM
As discussed above, the user adds subsetting conditions for one table and selects sensitive table columns in one or more tables.
Example - Subset will start at '51' and contain 100 rows of data. Customer information will be masked during subsetting (e.g., Last Name, Date of Birth, SSN, Phone Number).