Description

Many types of testing do not require large volumes of data. The GenRocket Engine generates approximately 10,000 rows per second and hands off the data to a respective Receiver. 

For databases specifically, the Receiver communicates via JDBC in batches of records (typically 1,000 records per batch). 10,000 rows of data are sufficient for most use cases. However, generating large volumes of patterned or realistic, unique data is necessary for some use cases. 


Factors such as the volume of data, type of data, and where it is being generated can also impact test data generation speed and performance. This article provides information about the following: 

  1. Available Features for Generating Large Volumes of Data
  2. Recommendations for When to Use the Available Features
  3. Speed and Performance Factors for Optimizing Test Data Generation


In This Article


Features for Generating Large Volumes of Data

Additional GenRocket features can and should be used when generating millions or billions of rows of data. These features speed up test generation for one or multiple Domains (depending on the feature).

Partition Engine

The Partition Engine partitions the load across multiple GenRocket instances running within a given server. When generating enormous amounts of test data, the load can be partitioned across multiple servers, each running multiple GenRocket instances.

Question

Answer

When should the Partition Engine be used?Any time a user needs to generate hundreds of millions, billions, or even trillions of rows of test data for a Domain Scenario.

Recommended For
  • Bulk loading data into specific databases (e.g., MySQL, MSSQL, Oracle, PostgreSQL, DB2).

  • Generating large amounts of data in specific file formats (e.g., Delimited File, Parquet)

Not Recommended For
  • Nested Data Generation where multiple Scenarios are chained together. 

Can it be used with Scenarios?Yes

Can it be used with Scenario Chains?Not at this time

Can it be used with Scenario Chain Sets?
Not at this time

Can it be used when dependencies (parent/child relationships) exist between Domains?
Yes
Can it be used with small, medium, and large amounts of data?Large amounts of data for a Scenario
How does the Partition Engine work?Data generation is split up for one Domain across multiple threads. The data is generated on each thread at the same time. 

The volume of the test data generation must be evenly distributed across all instances.


The values produced by sequential generation must be unique and be increasing across all instances.

Example: Generate 100,000 Records for a User Domain over 5 Threads

  • Thread 1 - Records 1 - 20,000
  • Thread 2 - Records 20,001 to 40,000
  • Thread 3 - Records 40,001 to 60,000
  • Thread 4 - Records 60,001 to 80,000
  • Thread 5 - Records 80,001 to 100,0000

What is the recommended approach for generating Domain data when dependencies are present?
For dependencies, it is recommended to do the following order: 
  1. Generate data for the Root Domain.
  2. Generate data for each Domain that other Domains are dependent on. 
  3. Generate data for each remaining Domain that has no dependencies.


Please remember that test data generation must be performed separately for each Domain Scenario. It can become more complicated when many dependencies exist between Domains.

What Receivers can be used with the Partition Engine?Bulk Load Receivers are used with the Partition Engine. Please have a look at the Bulk Load Receivers section of this article for more information.

What is the Attribute Optimizer?This is a flag built into the Partition Engine.

When the flag is set to "true," it looks up the Parent/Child relationships to determine if a particular Generator with a parent is not being referenced by any of the children.


If not, it turns off Generators for Attributes that are generating data and not being referenced by the child. It is irrelevant for them to be generating data. This optimizes it and makes test data generation slightly faster.
 

Important Note about System ConfigurationThe user needs to have the appropriate system configuration to support the number of threads being executed at the same time.

Where can I learn more about the Partition Engine?Please look at this knowledge base article: What is the GenRocket Partition Engine?



Scenario Thread Engine

The Scenario Thread Engine provides another method for increasing test data generation speed for better performance by simultaneously executing multiple Scenarios within a Scenario Chain or Scenario Chain Set across multiple threads.


Question

Answer

When should the Scenario Thread Engine be used?Any time a user needs to generate large volumes of data faster for multiple Domains where the order of execution does not matter. 

Note: Data from one Scenario cannot be dependent on data from another Scenario. For example, if Domain B requires data generated from Domain A to generate its data, then this feature cannot be used. 

Recommended For
  • Nested File Generation - The data for each generated segment created by the SegmentDataCreatorReceiver is not dependent on one another. For example, nested XML, JSON, Delimited Files, Avro, and Parquet.

  • EDI Test Data Generation - Can be used to speed up test data generation of EDI Documents because nested segments are being generated with a Scenario Chain Set.

Not Recommended For
  • Inserting Test Data into a DatabaseAs the order of data generation and population in the database matter, this asynchronous approach will not work.  

  • SQLInsertReceiver - Should not be used when Scenarios contain this Receiver. This is because the order of execution is significant.

Can it be used with Scenarios?No
Can it be used with Scenario Chains?Yes
Can it be used with Scenario Chain Sets?Yes
Can it be used when Parent/Child Relationships (dependencies) have been set between Domains?Yes, when the order of data generation does not matter. 

Note: It should not be used when the order of data generation matters. 

Can it be used with small, medium, and large amounts of data?Small, medium, or large amounts of data.
Can the Scenario Thread Engine be used with the Partition Engine? No
How does the Scenario Thread Engine work?It simultaneously executes multiple Scenarios within a Scenario Chain or Scenario Chain Set across multiple threads.

Example:
If a user specifies 10 threads, it will run 10 Scenarios simultaneously. When each one is finished, it will grab another Scenario. This process will continue until all test data has been generated. 

Instead of generating 500 Scenarios in sequence, the user is now generating 10 Scenarios in sequence. This increase the speed of test data generation. 
Important Note about System Configuration
The user needs to have the appropriate system configuration to support the number of threads being executed at the same time (e.g., the number of Scenarios in a Chain or Chain Set). 

Example:
The user is executing 10 Scenarios at the same time within a Scenario Chain. The system must have enough memory to keep 10 Scenarios in memory at the same time and enough CPUs to support that execution. 

Where can I learn more about the Scenario Thread Engine?Please look at this knowledge base article: What is the Scenario Thread Engine?



Bulk Load Receivers

Bulk Load Receivers can populate a large amount of data into data warehouses (Teradata, MongoDB, Cassandra, etc.) or unstructured databases faster than through JDBC. These Receivers allow users to generate the data to a given database's native bulk load format. 


Question

Answer

When should Bulk Load Receivers be used? 

Use for the following: 

  • Populating large volumes of data into a data warehouse
  • Populating large volumes of data into an unstructured database
  • Some Bulk Load Receivers can be used with or without the Partition Engine

What is defined in these Receivers? 

The selected Receiver defines the following:

  • File format for a particular database (e.g., MySQL, PostgreSQL, Oracle, MSSQL, Parquet, DB2, Delimited).

  • The number of files generated within a subdirectory for each instance during test data generation.

  • The number of rows of data to be generated within each file.

How do Bulk Load Receivers work? These Receivers write the data generated for each instance to a subdirectory underneath that specific instance.

Based on the defined number of files per directory and number of records per file in the Receiver's parameters, the Receiver will do the following:
  1. On the disk partition, in a particular subdirectory, it creates a new directory.
  2. It begins inserting files into the created directory within that subdirectory.
  3. As each file hits the defined number of data rows, the file is closed, and another is created until the number of files for a subdirectory is reached.
  4. It creates another subdirectory and continues to generate files.
  5. The process repeats until all data has been generated for each defined instance. 

Each database takes data in what is called delimited Bulk Loading format. Two files are usually created.
  1. A file containing the delimited data
  2. A file that describes what that delimited data looks like


The database receives it and recognizes the file that needs to be looked at for the data and the columns, etc. It then slams that huge amount of data into the database very quickly. 


What Bulk Load Receivers are available? Bulk Load Data with Partition Engine

Bulk Load Data into a Database (with or without Partition Engine
These two Receivers are considered Bulk Load Receivers as well. 
  • SQLFileInsertReceiver
  • OracleFileInsertReceiver


Both should only be used for smaller loop counts and can load the data while maintaining referential integrity.




Speed and Performance Factors

Several factors can affect the speed of test data generation, even when one of the above features is being used. 


These non-GenRocket performance factors can slow down test data generation:

Factor

Considerations and Recommendations

Operating SystemSome operating systems are faster than others:
  • Linux - Fastest
  • Mac OS - Medium
  • Windows - Slowest


Recommendations for Better Performance
  • Try generating test data on a machine with a faster operating system. For example, if on Windows, try Mac OSx or Linux.

System Memory and CPUsThe system memory amount must be sufficient to support the number of CPUs.

Minimum Recommended Number of CPUs and RAM
  • 4 Core CPU 
  • 8GB RAM

Recommended in Modern Testing Environments
  • 4 Core CPU
  • 16GB RAM

Recommendations for Better Performance
  • Increase System Memory
  • Increase the Number of CPUs



These factors will impact test data generation when generating test data. The most significant impacts are typically for remote locations. 


Factor

Considerations and Recommendations

Network SpeedThis is how fast data is transferred from one system to another over the network. The following impacts network speed: 
  • Transfer Technology - What is being used to communicate over the network
  • Bandwidth - Available data capacity for a connection
  • Utilization - The number of users and type of traffic utilizing available bandwidth
  • Latency - Any delays in network communication.
  • Location - Systems in the same physical location will experience faster speeds than those in remote locations.


Note: Bandwidth, Utilization, and Latency are discussed in more detail in this table.

Recommended Actions for Better Performance
  • Periodically run an end-to-end speed test to gain insight into available bandwidth and the degree of latency for a given connection.

  • Schedule tests to avoid busy hours can help mitigate the impact of network-related performance issues.

Network BandwidthThis is the maximum amount of data that can be transferred over a network in a given time (typically 1 second).

Bandwidth can vary significantly over a network path, and each network segment can provide a different level of bandwidth.

Users, systems, and devices often share the same connection and thus share bandwidth. Some take more bandwidth than others. 

If many users, systems, and devices share the same connection, this will decrease overall bandwidth and reduce speed. 

Network Utilization
This is how much (in percentage) network bandwidth is being used or consumed by network traffic.

Higher traffic decreases speed. This is especially true when you have low bandwidth.

Network LatencyAs the number of hops along the path increases, so does the amount of latency or delay in end-to-end data delivery.

This includes any security checks and communications between remote systems or users.

 


These performance factors can also impact the speed of test data generation for any amount of data.


Factor

Considerations and Recommendations

Database LocationThe speed will depend on where the database is located: 
  • Fastest - Database on the user's local machine with GenRocket Runtime. Connecting directly with JDBC works fine for this.

  • Slower - GenRocket Runtime is installed on the user's local machine, and the database is remote. Connecting directly with JDBC is not recommended.

    Example
    GenRocket Runtime (Located in India) -> Corporate Firewall -> Remote Database (Located in the United States)
 
Recommendation 1 - Add  GenRocket Runtime to the Same Side as the Database
Adding GenRocket Runtime to the same side as the database will increase performance.

Example:
Firewall -> GenRocket Runtime Data <-> Database (Remote Location)


Recommendation 2 - GenRocket Multi-User Server (GMUS)
Install a GenRocket Multi-User Server (GMUS)on a machine within the same environment as the test database. The GMUS does not have to be on the same machine, just within the same location where the distance to the database is much shorter, and the connection to the database is secure because it's connecting within the same environment.

Testers can send commands to the GMUS via REST as to which Scenario to run (this can include a G-Case). The GMUS will use the GenRocket engine to load and execute the instructions of the Scenario within the local environment. 

The GenenericSQLInsertReceiver will securely connect to the database via JDBC within the local environment and should be able to send batches of data to the database optimally. This still depends on how well your database has been configured to receive the data optimally (Primary Key, Indexes, Foreign Keys, etc.).

Example:
User (API of GMUS /rest/scenario) -> Firewalls  (Optional) -> GenRocket Runtime (GMUS) <-> Remote Database 

Recommendation 2 - Generate an SQL file and Upload it to the Database
Generate an SQL file and then upload it to the database. Use the SQLFileInsertReceiver, to write ANSII SQL inserts statements (single or batch) to a file. 

A second Receiver (FTPReceiver, SFTPReceiver, S3Receiver, etc.) can be added to send and deposit the resulting file on a machine within the same secure location as the test database. 

This alternative would require you to have a solution within your testing environment to read the SQL Inserts statements from the file. Most databases have this capability built in; it's just a matter of calling the database with the proper command. 

Example:
GenRocket Runtime generates file Locally  -> Upload into Remote Database

Recommendation 3 - Implement a Better-Designed JDBC Driver
Implement a type of JDBC driver better designed to work securely over long distances with large batches of data.

JDBC ConnectionsWhen generating data, multiple calls are being made back and forth between GenRocket Runtime and the database based on the defined batch count. This determines how many records are being sent in each batch. 
  • Defined in the properties file used for the JDBC connection.
  • It should be set as per available memory. 
  • If not, the user will receive an OutOfMemoryException.


For example, if 10,000 records are being generated and the batch count is set to 1000, then 10 batches will be sent.


Database Indexing

Indexes are data structures that allow rapid access to data tables instead of sequentially examining each record to find a given row of data. 

  • Poor indexing can be a source of poor performance for data-intensive applications. 

  • Adding indexes without proper analysis can cause insert, update and delete functions to take longer when a large number of indexes need to be updated.

  • If the file is not indexed, each database operation must sequentially scan the entire data file to perform the right operation on the right record.

Network SpeedA slower network will decrease test data generation speed for querying or inserting data into a remote database.
Server Hardware

Hardware-related performance problems are unlikely for a database maintained on a production system. 


However, production data that is moved to an under-resourced test server will result in a performance hit.


Database Size

Multiplies the impact of every other performance issue (slow hardware, slow network, poorly indexed database, etc.). These issues are amplified when large volumes of data are involved.

Database QueriesThe number of queries and type of queries impact test data generation speed.
  • Query Before - Faster because the query is performed before and placed in memory.
  • Query Each - Slower because the query is performed for each iteration during test data generation.

Additional ReadingHow to Optimise Your Database for Loading Test Data Faster
Generating Data for a Remote Database



Query-Related 

These factors apply regardless of what is being queried (database, CSV file, Excel file).

Factor

Considerations and Recommendations

Number of G-QueriesA larger number of G-Queries will slow down test data generation.

Recommendations
Evaluate the G-Queries to determine if any can be eliminated or other changes can be made to improve test data generation.

Number of Query GeneratorsA larger number of Query Generators will also slow down test data generation. This is because more information is being stored in memory while test data generation is occurring.

Recommendations
Evaluate what Query Generators are being used to determine if any can be eliminated or if other changes can be made to improve test data generation.

Query Each vs. Query BeforeQuery Each - The query will occur for each iteration. This will slow down test data generation.

Query Before - The query will occur at the beginning and only once (typically faster).


Recommendations
If Query Each is being used for Generators or G-Queries, try using Query Before to increase speed and performance.



Specific file formats are faster than others. This section provides details on any file format factors to consider for better speed and performance: 

Factor

Considerations and Recommendations

Excel vs. Delimited File FormatExcel Files are generally slower than the Delimited File format.

Generating Data
Generating data in an Excel sheet will be slower. Recommended steps: 
  1. Generate a delimited file (e.g., Comma Separated text file).
  2. Then import it into Excel while opening the file.

Reading Data
Reading data from an Excel File (using ListExcelGen or another Excel Generator) is slower than reading data from a plain delimited file (ListCSV or CSVToMap, etc.).



Other Factors

This section contains any additional factors that can affect speed and performance:

Factor

Considerations and Recommendations

Memory GeneratorsMemory Generators or Generators generating unique data save some information in memory. Performance can be slowed down based on the following: 
  • Number of CPUs
  • Available Memory
  • Generator Configuration

Number of AttributesDomains having a large number of Attributes can slow down test data generation.

G-Map and CSV Files/DatabasesWhen using G-Map to store temporary data for mapping data generated by one Project that another Project needs, using CSVs and databases to store the intermediate data can be slow.