Description

The ParquetPartitionReceiver outputs data to one or more Parquet files parsed over multiple instances via the GenRocket GPartition Engine. This allows for huge amounts of data to be generated, in parallel, quickly. The generated files are to be merged together into one file via the GenRocket ParquetPartitionFileMergeReceiver.


Note: To learn more about the GenRocket GPartition engine, click here


Parameters

The following parameters can be defined for the ParquetPartitionReceiver. Items with an asterisk (*) are required. 

  • outputPath* - Defines the base path where data files will be stored.
  • outputSubDir - Defines an optional subdirectory under the outputPath where generated data will be stored.
  • blockSize* - The block size is the size of a row group being buffered in memory. This limits memory usage while writing into the file. Larger values will consume more memory when writing. The default size is 134217728 bytes.
  • pageSize* - The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. The default size is 1048576 bytes.
  • compressionCodecName* - Defines the compression algorithm used to compress pages. The compression algorithms that GenRocket supports are UNCOMPRESSED, SNAPPY, and GZIP.
  • enableValidation - Specifies whether schema validation should be turned on.
  • enableDictionary* - Defines the Boolean value to enable/disable dictionary encoding. It should be either true or false.
  • filesPerDirectory* - Defines the number of files that will be created in each directory.
  • recordsPerFile* - Defines the number of records that will be stored in each file.
  • serverNumber* - Defines the server instance number where the Receiver will be running and helps the Receiver determine the output directory structure where it will deposit the generated data files. This Receiver is meant to be used in Scenarios that are run by the GenRocket GPartition engine to generate huge amounts of data; thus, this parameter will automatically be set by the GPartition engine.
  • instanceNumber* - Defines the runtime instance number on a given server instance where the Receiver will be running and helps the Receiver determine the output directory structure where it will deposit the generated data files. This Receiver is meant to be used in Scenarios that are run by the GenRocket GPartition engine to generate huge amounts of data; thus, this parameter will automatically be set by the GPartition engine.



Receiver Attribute Property Keys

The Receiver defines two property keys that can be modified on any of its associated Domain Attributes:

  • columnName - Defines the column name as it will be output into the Parquet file. 
  • include - Determines if the Attribute will be included as a column in the output.
  • columnType - Defines the column data type.