Description

The ORCFileMaskReceiver will mask targeted Attributes (columns) within a given ORC file using Synthetic Data Masking (SDM). 


Note about loopCount: 

  • The loopCount variable determines how many source data records will be considered for masking and saved in the output file.

  • Alternatively, the ORCLoopSetGen Generator can be assigned to an Attribute in the Domain to automatically set the Domain loopCount to the number of all records in the source file.

Additional Notes: 

  • For date Attributes, Integer values must be used. It considers epoch days (number of days from 1970-01-01).
  • For timestamp Attributes, Long values must be used. The desired DateTime can be converted to milliseconds.

In This Article


What is an ORC File? 

  • Optimized Row Columnar (ORC) is a file format where data is stored in columns and compressed. This results in smaller disk reads.


When Should the ORCFileMaskReceiver be Used?

  • Any time data values within a source ORC file needs to be masked with synthetic data. 


Receiver Parameters

The ORCFileMaskReceiver requires that the following parameters are defined. Items with an asterisk* are required.

  • sourcePath* - Defines the base path where the source file to be masked is located. The default value is "#{resource.output.directory}".
  • sourceSubDirectory - Defines an optional subdirectory under sourcePath where the source file is located. 
  • sourceFileName* - Defines the name of the file to be masked. 
  • destPath* - Defines the base path where the clean masked file will be stored.
  • destSubDirectory* - Defines an optional subdirectory under destPath where the clean masked file will be stored.
  • destFileName - Defines the name of the output file to be generated under the destination path. If not provided, it will consider the source file name. 
  • batchSize* - Defines the number of records to be considered in one batch during the reading/writing of data. The default value is 1024. 


Receiver Attribute Property Keys

The Receiver defines one property key that can be modified on any of its associated Domain Attributes:

  • include - Determines if the Attribute will be included in the output.


Example User Story

A user wants to mask sensitive data within an existing ORC file. They will need to complete the following:


Prerequisites

  1. The user has already installed GenRocket Runtime and downloaded their Profile.
  2. The user has created a Project with the default Project Version.


Step 1 - Create a Domain with Attributes

A Domain must be created with the Attributes (columns of ORC file) that should be masked. For this example, a user does the following: 

  1. Creates a Domain titled "MaskValues" with the required Attributes for masking.
  2. Sets the Domain's loopCount to "900". This means that 900 source records of source data will be considered for masking.


Note: The Domain and Attributes can easily be created using Scratch Pad. Other Domain creation and import methods are also available and can be viewed here.


The MaskValues Domain has the following Attributes, which will be used for masking data values:



Step 2 - Check Each Attribute's Original Name

Next, the user checks each Attribute's Original Name to ensure it matches the column name in the source ORC file. This can be done within the Attribute Dashboard


Step 3 - Configure the Generators for the Added Attributes 

The data warehouse will automatically assign a Generator to each Attribute. The assigned Generator may need to be replaced to generate the appropriate data. Additionally, one or more Generator parameters may need to be changed. 


String Type Attributes

For standard string type Attributes, any Generator can be used. In the example below, the ConstantGen Generator has been assigned to display 'sampleName'.


Date Type Attributes

For date type Attributes, use Integer values. It considers epoch days (number of days from 1970-01-01).


Generators:


These two Generators have been linked in the image below for a date Attribute. 

  • gen1 - generates a date
  • gen2 - gives the number of days between the date produced by gen1 and 1970-01-01 (epoch days).



Timestamp Type Attributes

For timestamp-type Attributes, use Long values. The desired DateTime can be converted to milliseconds.

Generators:


These two Generators have been linked in the image below for a modifiedOn (timestamp type) Attribute. 

  • gen1 - generates a date
  • gen2 - converts the data generated by gen1 to milliseconds


Note: For more information on assigning and re-assigning Generators, please look at this article: How do I assign a Generator to a single Attribute?


Step 4 - Add the ORCFileMaskReceiver to the Domain

The user adds the ORCFileMaskReceiver to the MaskValues Domain.


They then configure the Reciever's parameters. For this example, they have made changes to the following: 

  • sourceSubDirectory = source_orc 
  • sourceFileName = Sample. orc
  • destSubDirectory = dest_orc

Note: The source subdirectory and a source file with the defined name must be in the user's resource.output.directory path; otherwise, the process will have an error. 



Step 5 - Create and Download the Scenario

  • The user creates a Scenario for the MaskValues Domain.


  • Next, the user downloads the Scenario to their local machine.


    (or)


  • The Scenario file should be placed in the location defined in this Organization Resource:


Step 6 - Generate the Masked ORC Output File

Run the following command at the command line: 
genrocket -r <ScenarioName.grs>


Example