What is Data Column Profiling? 

Data Column Profiling is a GenRocket feature designed to automatically identify Personally Identifiable Information (PII) within your datasets by scanning column headers, not the actual data. GenRocket checks each data column name to see if it CONTAINS keyword values that are defined at the GenRocket and Organization levels. 


The user is provided with all matches and can then choose which sensitive columns to replace with synthetic test data using GenRocket’s Synthetic Data Replacement (SDR)This capability allows teams to maintain high-quality, compliant, and secure test data without manually combing through large and complex datasets.


In This Article


What Features Use Data Column Profiling? 

  • G-Migration+ - Migrate a subset of records to an identical database and mask/replace any specified PII columns. The G-Migration+ configuration defines which data columns are PII that need to be replaced with synthetic test data. Users can perform this task manually or utilize the Data Column Profiling feature for automatic detection of PII. 

  • In-Place Masking (IPM) Engine (In Development/Beta Testing) - Used to perform in-place masking for all data values within a specified database table. The IPM Engine requires G-Migration+ to know which data columns require in-place masking for the defined table.


Feature Key Benefits

  1. Automated PII Detection - GenRocket intelligently scans column names for built-in and organization-defined profile names (e.g., user_id, email). Data Column Profiling accelerates and standardizes the identification of sensitive data columns. It also saves time in large-scale, repetitive workflows, such as CI/CD pipelines.

  2. Enhanced Data Privacy Compliance - Organizations can meet regulatory requirements, such as HIPAA, by identifying PII early, to reduce the risk of non-compliance and data exposure.

  3. Accommodates Unique Organizational Naming Conventions - Define organization-specific PII identifiers based on your data environment. GenRocket adapts to your schema.

  4. Reduced Risk of Data Breaches or Exposure - Replacing sensitive data early in the data life cycle decreases risk.


How Does It Work?

  1. Profile Column Headers - GenRocket analyzes table schemas and column headers for known or custom-defined profile terms.

  2. Identify PII Candidates - Any column that contains one of these terms is considered a match and provided to the user. 

  3. User Selection - The user then chooses which flagged columns require Synthetic Data Replacement (SDR).

  4. Synthetic Data Replacement (SDR) - GenRocket replaces sensitive column values with realistic, non-identifiable data.


GenRocket Level Data Profile Names

GenRocket uses 175 built-in data profile names that commonly contain sensitive or PII data values as part of the profiling check. During data profiling, GenRocket will automatically check to see if these values are part of the data column headers. Below are a few examples: 



Organization Level Data Profile Names

Additionally, you can add more data profile names at the Organization level. Complete the following steps to do so: 

  1. Go to My Organization > Data Profile Names.
  2. Select Add Profile Name.
  3. Enter the Name and click Save.



Examples of Detected Columns

Actual Column HeaderProfile Match
user_emailemail
client_ssnssn
home_city
city
street_addressaddress


Example Use Case Scenarios

These use case scenarios illustrate how GenRocket’s data column profiling, combined with Synthetic Data Replacement (SDR), supports scalability, consistency, compliance, and automation in modern data workflows.

  1. Development and Testing Environments - Easily detect and replace PII before provisioning test environments with real data. Beneficial for avoiding privacy risks while maintaining data integrity for QA and UAT. 
     
  2. Data Migration and Onboarding - Identify sensitive information when ingesting new data or migrating legacy systems. 

  3. CI/CD Pipelines - Integrate Data Column Profiling into automated pipelines to ensure that every dataset delivered to staging or testing environments is free of PII. 

  4. Data Analysis and Reporting - Enable analysts to examine production-like datasets without accessing real PII, thus preserving privacy while allowing insights.

  5. Third-Party Data Sharing Before sharing datasets with partners, vendors, or contractors, profile and mask all sensitive fields to enforce secure, contract-compliant sharing.

  6. Standardizing PII Detection Across DepartmentsGenRocket supports organization-defined profile names, enabling a centralized standard for identifying PII across various team datasets.

  7. Synthetic Data Simulation for AI Training - Make certain all training data follows realistic synthetic data without any PII for training AI models.


G-Migration+ Data Column Profiling Steps

The required steps to perform Data Column Profiling in G-Migration+ are identical except for how PII columns are identified and selected. 


Users will see an additional "Profile Columns " option for the selected table in the G-Migration+ configuration.  To see step-by-step instructions for using Data Column Profiling in G-Migration+, click here.