The ParquetSegmentMergeReceiver merges different segments created by the SegmentDataCreatorReceiver to generate complex nested Parquet output.
This article will show how to generate a complex, nested Parquet file within the GenRocket web platform. Complete the steps shown in this article to generate a nested Parquet file:
Step 1: Create a Project in the GenRocket web platform
For this example, a Project titled "ParquetDemo" has been created. A default Project Version is automatically created when the Project is created.
Click on the New Project button within the Projects pane to create a new Project. For detailed steps on how to create a Project, please click here.
Step 2: Import the JSON File
Parquet has a JSON-like data model and therefore can be present as a JSON. Each JSON object can be represented as a GenRocket Domain. Each GenRocket Domain will generate a segment of data for that Domain only. All the segments are then merged into a single Parquet file using the ParquetSegmentMergeReceiver with the help of a Merge Domain.
To set up the Project, you can import a JSON file with the required segments for generating the Nested Parquet File Format. Complete the following steps to import the JSON file:
- Click on the New Domain menu within the Project Dashboard.
- Select the Import from JSON option.
- Click on the Choose File button. Browse to the location of the file and select the file.
- Choose Parquet for the Output File Format and click the Save button.
This will create all the Domains needed to generate the Nested Parquet File including the Merge Domain needed to merge the individual segments.
A Scenario will be automatically created for each Domain in the Project and a Scenario Chain will be created to run all Scenarios in sequence to generate the required data.
Additionally, a Configuration File will be created during the import process, which is required by the Merge Domain.
Note: It may take a few minutes for the Domains, Scenarios, Scenario Chain, and Configuration File to be automatically be created.
Step 3: Download the ParquetConfig.xml File
Select the Configuration Management tab within the Project Dashboard and then click on the Download (Cloud) icon to download the ParquetConfig.xml file to your computer.
The ParquetConfig.xml file will need to be placed in a Config folder within your resource output directory.
Step 4: Download the Scenario Chain
Click on the Download (Cloud) icon within the Scenario Chains pane to download the Scenario Chain to your local computer.
Step 5: Run the Scenario Chain
Note: Make certain GenRocket Runtime and these Jars have been updated before running the Scenario: Engine Jar and Receiver Jar. For more information on how to update individual GenRocket Jars, click here.
The following command line will need to be ran in a Command Window or Terminal Session.
<ScenarioChainName> should be replaced with the actual name of the Scenario Chain. The command for this example would appear as shown below:
Step 6: View Generated Files
This example will generate 1 Nested Parquet file titled "PARQUET-1.parquet".
You can download Apache Spark for Parquet File Validation by clicking here. Click the Download Spark link.