The ParquetSegmentMergeReceiver merges different segments created by the SegmentDataCreatorReceiver to generate complex nested Parquet output.
In This Article
When should this Receiver be used?
- Any time you want to generate nested Parquet output.
When should this Receiver not be used?
- Any time you want to generate another type of nested output.
Are any additional items required to use this Receiver?
- SegmentDataCreatorReceiver- Must be assigned to each Domain, except the Merge Domain.
- Note: The ParquetSegmentMergeReceiver should only be assigned to the Merge Domain that merges the generated segments.
- Configuration File - Used to determine the XML output format. This is typically named "config.xml" but can be named differently.
- Note: When named differently, the name must also be changed for the configName parameter within the Receiver.
List of Steps to Generate Nested Fixed File Format
- Set up a Project with Domains, Parent-Child Relationships, and Scenarios.
- For each Domain except the Merge Domain, complete the following:
- Create a Merge Domain with just an id Attribute
- Note: This may have already been done if Domains were imported.
- Add the ParquetSegmentMergeReceiver to the Merge Domain and configure it, as discussed in this article.
- Create a Scenario Chain with all Domain Scenarios. Make sure the Merge Scenario is in the last position.
- Create a Configuration File as discussed in this article.
- Download the Scenario Chain to your local computer.
- Download the Configuration file to your local computer.
- Note: The configuration file will need to be placed in the location defined here (see the image below):
The ParquetSegmentMergeReceiver requires that the following parameters are defined. Items with an asterisk* are required.
- outputPath* - Defines the location to store the newly generated nested Parquet file.
- outputSubDirectory* - Defines the prefix name of subdirectories that are auto-created under the outputPath and then appended with a number (e.g., data1,data2, data3).
- configPath* - Defines the location where the configuration file is stored.
- configSubDir - The subdirectory under the configPath where the configuration and template files are stored.
- configName* - Defines the name of the configuration file.
- includeRootName* - Defines whether to include the root Domain name or not.
- filesPerOutputSubDir - Defines the number of files to be generated per output subdirectory.
- segmentPath* - Defines the path to the segment directory where all segment subdirectories can be found.
- segmentSubDirectory* - Defines the subdirectory under the segmentPath where segment files can be found.
- nullValue* - Defines the value for null.
- overrideFileName - This parameter allows you to override the output file name that is given in the configuration file. Also, this provides an ability to modify the output file name with the help of the Engine API.
- deleteOutputSubDir - Defines whether to delete the outputSubDir or not before generating a new output file.
- outputFormatType - Defines whether the output file format is expanded or collapsed.
- blockSize* - The block size is the size of a row group being buffered in memory. This limits memory usage while writing into the file. Larger values will consume more memory when writing. The default size is 134217728 bytes.
- pageSize* - The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be ready fully to access a single record. If this value is too small, the compression will deteriorate. The default size is 1048576 bytes.
- compressionCodecName* - The compression algorithm used to compress pages.
- enableValidation* - Specified whether schema validation should be turned on.
- enableDictionary* - The Boolean value is to enable/disable dictionary encoding. It should be either true or false.
- recordsPerFile - Defines the number of records in each output file.
- deleteSegmentDir - Defines whether to delete the segments directory or not.
Receiver Attributes Property Keys
There are no property keys necessary for this Receiver.
File Config Tab
The File Config Tab is used to configure what event will trigger file creation and the naming configuration for generated files.
Constant has been chosen as the event in the example below. A file will be created for every one-hundred records. Each created file will have the defined naming convention (e.g., Address1.parquet, Address2.parquet, Address3.parquet). The number will increment for each generated file.
Note: For more information on how to use the File Config Tab, click here.
Directory Config Tab
The Directory Config Tab is used to configure what event will trigger directory creation and the naming configuration for generated directories.
Constant has been chosen as the event in the example below. A directory will be created for every ten files that are generated. Each created directory will have the defined naming convention (e.g., AddressFiles1, AddressFiles2, AddressFiles3). The number will increment for each generated directory.
Note: For more information on how to use the Directory Config Tab, click here.
The ParquetSegmentMergeReceiver requires a configuration file to help facilitate the formatting of the data output.
Example Configuration File
The example configuration file below defines the following:
- fileNameSegments- The fileNameSegment tag defines the file naming convention for the Parquet file that is being generated. For example:
- Parquet-1.parquet, Parquet-2.parquet, and so on
- segments - The segment files from the segment tag will be loaded and used to create the merged output.
- segmentsHierarchy- Defines the hierarchical structure of the Domains.
Steps to Create a Configuration File
- Within the Project Dashboard, select the Configuration Management Tab In the Management Pane.
- Then click on the New Configuration button.
- Select the Parquet configuration type and then click the Select button.
- A form will open to fill. Enter the details and select the Segments/ Domains from the dropdown. This includes the:
- Name - Name used to identify the configuration file within the Project.
- Config File Name - Name the Receiver will look for when generating test data. It should be used in the configName parameter for the Receiver.
- Output File Name Format - Defines the format used for the generated output file(s).
- Segment Files- Defines the segments that will be used to create nested Parquet output.
- Click the Save button once finished.
- A new form will open. Select a Domain from the drop-down.
- Selecting a Domain will list down the Attributes.
- Users can select one of three options for how selected Domain objects are shown:
- Array Always - Domain object always shows in an array.
- Array Only When Greater Than 1 - Domain objects show as an array only when the loop count is greater than one.
- Show as List - The Domain data displays as a list of literals. Only one Attribute within the Domain can be selected when this option is selected.
- Select the Attributes to be included in the final output file for the selected Domain, as shown in the screenshot below.
- Optionally, users can use the Array and Null checkboxes for individual Attributes within a selected Domain.
- For each Attribute, users can select a Data Type; The default selection is "string" Options include String, Boolean, Bytes, double, float, int, long, null, and fixed.
Note: Additional options will be available in the Logical Type column when int, long, or fixed are selected. In these instances, the user can select the timestamp, date, or decimal logical type.
- If more than one Domain and its Attributes will be added, use the Save & Next button. Click the Save button once the last Domain and Attributes have been added.
- Drag and drop the Domains to set up segment hierarchy as in the screenshot below. Once finished, click the Done button.
- The configuration will be ready to download.
- Now it can be downloaded by clicking on the Cloud icon within the Configuration Management Tab.
- You can download this and place it into the directory as given in the resource.