Description
The ParquetFileReceiver writes Domain Attribute values in a Parquet file format. Parquet stores data in a flat columnar format and is more efficient in terms of storage and performance.
In This Article
- Receiver Parameters
- Receiver Attribute Property Keys
- Example of Setting Receiver Property Key Values
- File and Directory Config Tabs
- Generating Logical Data Types
Receiver Parameters
The following parameters can be defined for the ParquetFileReceiver. Items with an asterisk (*) are required:
- *path - Defines the location to store the newly generated Parquet output file.
- subDir - Defines the sub-directory under the path to store the newly generated Parquet output file.
- *fileName - Defines the name of the Parquet output file.
- *blockSize - The block size is the size of a row group being buffered in memory. This limits memory usage while writing into the file. Larger values will consume more memory when writing. The default size is 134217728 bytes.
- *pageSize - The page size is for compression. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default size is 1048576 bytes.
- *compressionCodecName - Defines the compression algorithm used to compress pages. The compression algorithms that GenRocket supports are UNCOMPRESSED, SNAPPY, and GZIP.
- enableValidation - Specifies whether schema validation should be turned on. The default value is "false."
- *enableDictionary - Defines the Boolean value to enable/disable dictionary encoding. It should be either true or false.
Receiver Attribute Property Keys
The Receiver defines six property keys that can be modified on any of its associated Domain Attributes:
- columnName - Defines the column name as it will be output into the Parquet file.
- include - Determines if the Attribute will be included as a column in the output. Available options are "true" and "false."
- columnType - Defines the column data type. Available options are string, null, boolean, int, long, float, double, and bytes.
- logicalType - Defines the logical type for the data. Available options are date, timestamp, and decimal.
- precision - Defines the precision for decimal numbers. Precision is the number of digits within the decimal value.
- scale - Defines the scale for decimal numbers. Scale is the number of digits to the right of the decimal point within a decimal value.
Example of Setting Receiver Property Key Values
The example image below shows the property key view for the set of Attributes of a Domain using the ParquetFileReceiver.
File and Directory Config Tabs
The File and Directory Config Tabs are used to configure what event will trigger file/directory creation and the naming configuration for generated files/directories. Please click here for more information on how to use the File and Directory Config Tabs.
Generating Logical Data Types
Additional steps must be performed when generating logical data types in Parquet output format. Users must modify the Attribute's Property Keys within the Receiver and assign/configure a Generator that generates the proper data type (e.g., int, long, decimal).
These steps specifically apply when generating the following data values:
- Dates
- Timestamps
- DateTimes
- Decimals
Use Case 1 - Generating Dates
A user wants to generate an account creation date in the Parquet file format. They have created a Domain Attribute titled "acctCreationDate" that will be used to generate this value.
A user must configure the property keys appropriately to generate a logical date. They must also assign and configure a Generator that generates the "int" data type.
Step 1 - Set Property Keys for the Attribute
For dates, the following property keys will need to be changed within the ParquetFileReceiver:
- columnType - int
- logicalType - date
Step 2 - Generator Assignment and Configuration
The date is based on the Unix epoch 01-01-1970. Any Generator that generates an "int" data type can be used to generate the required values. For example, a RangeGen or RandomGen Generator can be used.
The generated value is added as days to the Unix epoch date of 01-01-1970. For example, if "2" is the value, the date will be 2 days after 01-01-1970 (e.g., 01-03-1970).
RandomGen Generator Example
This Generator will generate a random number between 1 and 10000. That value will be used to increase the day in the Unix epoch date of 01-01-1970.
RangeGen Generator Example
The Generator will generate a range of numbers in ascending order starting at 1. The value will jump by 3 for each record. In this example, the date will be changed as shown below for each record:
- Record 1 - 1 day from epoch (01-02-1970)
- Record 2 - 4 days from epoch (01-05-1970)
- Record 3 - 7 days from epoch (01-08-1970)
- Record 4 - 10 days from epoch (01-11-1970)
- .....
Use Case 2 - Generating Timestamps
A user wants to generate a timestamp value in the Parquet file format. They have created a Domain Attribute titled "timestamp" that will be used to generate this value.
The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "long" data type.
Step 1 - Set Property Keys for the Attribute
For timestamps, the following property keys will need to be changed:
- columnType - long
- logicalType - timestamp
Step 2 - Generator Assignment and Configuration
The timestamp is based on the Unix epoch 01-01-1970 in milliseconds. Any Generator that generates a "long" data type can be used to generate the required value. For example, a RangeGen or RandomGen Generator can be used.
The random or range value increases the milliseconds of the Unix epoch (01-01-1970 HH:MM:SS: MS) to generate a timestamp in parquet file format.
RandomGen Generator Example
This Generator will generate a random number between 100 and 10000. That value will be used to increase the timestamp in milliseconds starting at the Unix epoch date of 01-01-1970.
RangeGen Generator Example
The Generator will generate a range of numbers in ascending order starting at 1. The value will jump by 500 for each record. In this example, the timestamp will be changed as shown below for each record:
- Record 1 - 1970-01-01 00:00:00.001
- Record 2 - 1970-01-01 00:00:00.501
- Record 3 - 1970-01-01 00:00:01.001
- Record 4 - 1970-01-01 00:00:01.501
- ...
TimestampGen Generator Example
The TimestampGen Generator can also be used to generate a timestamp value.
Use Case 3 - Generating DateTime Values
A user wants to generate a DateTime value in the Parquet file format. They have created a Domain Attribute titled "dateTime" that will be used to generate this value.
The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "long" data type.
Step 1 - Set Property Keys for the Attribute
DateTime values are represented as a Timestamp. To generate these types of values, these property keys must be configured for the Attribute generating the value:
- columnType = long
- logicalType = Timestamp
Step 2 - Generator Assignment and Configuration
DateTime values use "timestamp" as the logical type, which is based on the Unix epoch 01-01-1970 in milliseconds.
Any Generator that generates a "long" data type can be used to generate the required value. In the example below, the FlexibleDateRangeGen is used.
Sample Output
Appears as shown below a Parquet file viewer:
Use Case 4 - Generating Decimals
A user wants to generate a decimal value in the Parquet file format. They have created a Domain Attribute titled "balance" that will be used to generate this value.
The user must configure the property keys appropriately. They must also assign and configure a Generator that generates the "decimal" data type.
Step 1 - Set Property Keys for the Attribute
For decimal values, the following property keys will need to be changed:
- columnType - bytes
- logicalType - decimal
- precision - appropriate integer value (e.g., 5)
- scale - appropriate integer value (e.g., 2)
Below, the balance Attribute has a precision of "5" and a scale of "2" (e.g., 999.55).
Step 2: Generator Assignment and Configuration
Any Generator that generates a decimal value can be used for generating logical decimal values. For example, the RandomDecimalGen or RangeDecimalGen Generator can be used.
RandomDecimalGen Generator Example
The Generator will generate random decimal values between 100 and 999 with this format #0.00.
RangeDecimalGen Generator Example
The Generator will generate a value starting at 100 with this format #0.00. The value will jump by 0.15 for each record in ascending order. For this example, the value will be changed as shown below for each record:
- Record 1 - 100.00
- Record 2 - 100.15
- Record 3 - 100.30
- Record 4 - 100.45
- ...