Description

The ParquetFileReceiver writes Domain Attribute values in a Parquet file format. Parquet stores data in a flat columnar format and is more efficient in terms of storage and performance. 


In This Article


Receiver Parameters

The following parameters can be defined for the ParquetFileReceiver. Items with an asterisk (*) are required: 

  • *path - Defines the location to store the newly generated Parquet output file.
  • subDir - Defines the sub-directory under the path to store the newly generated Parquet output file.
  • *fileName - Defines the name of the Parquet output file.
  • *blockSize - The block size is the size of a row group being buffered in memory. This limits memory usage while writing into the file. Larger values will consume more memory when writing. The default size is 134217728 bytes.
  • *pageSize - The page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default size is 1048576 bytes. 
  • *compressionCodecName - Defines the compression algorithm used to compress pages. The compression algorithms that GenRocket supports are UNCOMPRESSED, SNAPPY, and GZIP. 
  • enableValidation - Specifies whether schema validation should be turned on. The default value is "false."
  • *enableDictionary - Defines the Boolean value to enable/disable dictionary encoding. It should be either true or false.


Receiver Attribute Property Keys

The Receiver defines six property keys that can be modified on any of its associated Domain Attributes:

  • columnName - Defines the column name as it will be output into the Parquet file.
  • include - Determines if the Attribute will be included as a column in the output. Available options are "true" and "false."
  • columnType - Defines the column data type. Available options are string, null, boolean, int, long, float, double, and bytes.
  • logicalType - Defines the logical type for the data. Available options are date, timestamp, and decimal.
  • precision - Defines the precision for decimal numbers. Precision is the number of digits within the decimal value.
  • scale - Defines the scale for decimal numbers. Scale is the number of digits to the right of the decimal point within a decimal value.

Example of Setting Receiver Property Key Values

The example image below shows the property key view for the set of Attributes of a Domain using the ParquetFileReceiver.




File Config Tab

The File Config Tab is used to configure what event will trigger file creation and the naming configuration for generated files. 


Constant has been chosen as the event in the example below. A file will be created for every one-hundred records. Each created file will have the defined naming convention (e.g., Address1.parquet, Address2.parquet, Address3.parquet). The number will increment for each generated file. 



Note: For more information on how to use the File Config Tab, click here.


Directory Config Tab

The Directory Config Tabs are used to configure what event will trigger directory creation and the naming configuration for generated directories. 


Constant has been chosen as the event in the example below. A directory will be created for every ten files that are generated. Each created directory will have the defined naming convention (e.g., AddressFiles1, AddressFiles2, AddressFiles3). The number will increment for each generated directory. 


Note: For more information on how to use the Directory Config Tab, click here. 


Generating Logical Data Types

Additional steps must be performed when generating logical data types in Parquet output format. Users must modify the Attribute's Property Keys within the Receiver and assign/configure a Generator that generates the proper data type (e.g., int, long, decimal). 


These steps specifically apply when generating the following data values: 

  • Dates
  • Timestamps
  • DateTimes
  • Decimals


Use Case 1 - Generating Dates

A user wants to generate an account creation date in the Parquet file format. They have created a Domain Attribute titled "acctCreationDate" that will be used to generate this value.


A user must configure the property keys appropriately to generate a logical date. They must also assign and configure a Generator that generates the "int" data type.


Step 1 - Set Property Keys for the Attribute

For dates, the following property keys will need to be changed within the ParquetFileReceiver:

  • columnType - int
  • logicalType - date


Step 2 - Generator Assignment and Configuration

The date is based on the Unix epoch 01-01-1970. Any Generator that generates an "int" data type can be used to generate the required values. For example, a RangeGen or RandomGen Generator can be used. 


The generated value is added as days to the Unix epoch date of 01-01-1970. For example, if "2" is the value, the date will be 2 days after 01-01-1970 (e.g., 01-03-1970).


RandomGen Generator Example

This Generator will generate a random number between 1 and 10000. That value will be used to increase the day in the Unix epoch date of 01-01-1970.


RangeGen Generator Example

The Generator will generate a range of numbers in ascending order starting at 1. The value will jump by 3 for each record. In this example, the date will be changed as shown below for each record: 

  • Record 1 - 1 day from epoch (01-02-1970)
  • Record 2 - 4 days from epoch (01-05-1970)
  • Record 3 - 7 days from epoch (01-08-1970)
  • Record 4 - 10 days from epoch (01-11-1970)
  • .....


Use Case 2 - Generating Timestamps

A user wants to generate a timestamp value in the Parquet file format. They have created a Domain Attribute titled "timestamp" that will be used to generate this value.



The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "long" data type.

Step 1 - Set Property Keys for the Attribute

For timestamps, the following property keys will need to be changed: 

  • columnType - long
  • logicalType - timestamp


Step 2 - Generator Assignment and Configuration

The timestamp is based on the Unix epoch 01-01-1970 in milliseconds. Any Generator that generates a "long" data type can be used to generate the required value. For example, a RangeGen or RandomGen Generator can be used. 


The random or range value increases the milliseconds of the Unix epoch (01-01-1970 HH:MM:SS: MS) to generate a timestamp in parquet file format. 


RandomGen Generator Example

This Generator will generate a random number between 100 and 10000. That value will be used to increase the timestamp in milliseconds starting at the Unix epoch date of 01-01-1970.


RangeGen Generator Example

The Generator will generate a range of numbers in ascending order starting at 1. The value will jump by 500 for each record. In this example, the timestamp will be changed as shown below for each record: 

  • Record 1 - 1970-01-01 00:00:00.001
  • Record 2 - 1970-01-01 00:00:00.501 
  • Record 3 - 1970-01-01 00:00:01.001 
  • Record 4 - 1970-01-01 00:00:01.501 
  • ...


TimestampGen Generator Example

The TimestampGen Generator can also be used to generate a timestamp value. 



Use Case 3 - Generating DateTime Values

A user wants to generate a DateTime value in the Parquet file format. They have created a Domain Attribute titled "dateTime" that will be used to generate this value.



The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "long" data type.

Step 1 - Set Property Keys for the Attribute

DateTime values are represented as a Timestamp. To generate these types of values, these property keys must be configured for the Attribute generating the value: 

  • columnType = long
  • logicalType = Timestamp


Step 2 - Generator Assignment and Configuration

DateTime values use "timestamp" as the logical type, which is based on the Unix epoch 01-01-1970 in milliseconds. 

Any Generator that generates a "long" data type can be used to generate the required value. In the example below, the FlexibleDateRangeGen is used. 



Sample Output

Appears as shown below a Parquet file viewer: 


Use Case 4 - Generating Decimals

A user wants to generate a decimal value in the Parquet file format. They have created a Domain Attribute titled "balance" that will be used to generate this value.



The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "decimal" data type.


Step 1 - Set Property Keys for the Attribute

For decimal values, the following property keys will need to be changed: 

  • columnType - bytes
  • logicalType - decimal
  • precision - appropriate integer value (e.g., 5)
  • scale - appropriate integer value (e.g., 2)


Below, the balance Attribute has a precision of "5" and a scale of "2" (e.g., 999.55).


Step 2: Generator Assignment and Configuration

Any Generator that generates a decimal value can be used for generating logical decimal values. For example, the RandomDecimalGen or RangeDecimalGen Generator can be used. 


RandomDecimalGen Generator Example

The Generator will generate random decimal values between 100 and 999 with this format #0.00.


RangeDecimalGen Generator Example

The Generator will generate a value starting at 100 with this format #0.00. The value will jump by 0.15 for each record in ascending order. For this example, the value will be changed as shown below for each record: 

  • Record 1 - 100.00
  • Record 2 - 100.15
  • Record 3 - 100.30 
  • Record 4 - 100.45
  • ...