The Unstructured Synthetic Test Data Problem

Companies rely on document-heavy workflows that must operate accurately and efficiently. As part of testing and training, they require high volumes of quality data without exposing sensitive company or customer information. Unstructured data presents an operational challenge, requiring secure and scalable processing.

To understand why this is such a challenge, it's essential to clarify the difference between unstructured data and structured data. This understanding sets the stage for evaluating relevant solutions.

  • Unstructured data lacks a defined model and structure, and comes in many forms, such as PDFs, images, text, and voice. 
  • Structured data has a defined model that is easily modeled in a GenRocket project. Examples include relational databases and CSV files, which have columns and rows.

With these data types defined, let's learn about the Unstructured Data Accelerator (UDA) and how unstructured data workflows benefit from synthetic document generation. Please take a moment to review the diagram below as well. It shows that structured data is just the tip of the iceberg and that the ability to test unstructured data workflows is important. 


What is the Unstructured Data Accelerator (UDA)?

The Unstructured Data Accelerator (UDA) is a GenRocket feature that quickly generates large volumes of safe, realistic PDF files for scalable test data. UDA combines structured and unstructured data into a template to create large volumes of PDF documents containing synthetic test data. It makes it easy to cover positive, negative, and edge cases, as well as all required permutations for testing unstructured data workflows.



How Does it Work? 

  1. Model - Users upload a PDF source file containing unstructured data, which is used to create a customizable template. The template is exported and used to model their Project. 
  2. Design - Users design their test data by assigning different Generators, adjusting parameters, and setting up G-Cases (volume and variety).
  3. Deploy - Users download the necessary files and generate their PDF documents using GenRocket Runtime, an API call, or script (depends on how GenRocket is deployed). 
  4. Manage - Users create new versions as needed when the source document changes so that these changes are reflected in testing. 


In This Article


Supported UDA Formats

  • Digital PDFs - Forms, Reports, Invoices, Bank Statements, etc.
  • Other formats (e.g., images, voice) to be supported at a later time.

Benefits of UDA

  • Protect Sensitive Data by generating privacy-compliant, realistic PDFs.
  • Speed Up Testing by generating unstructured test data quickly without data cleanup.
  • Improve Coverage by simulating diverse formats to test more scenarios and edge cases.
  • Enable Safe Sharing by securely sharing synthetic unstructured data to reduce risk.
  • Support AI/Automation by using synthetic data to improve OCR accuracy and reduce bias.
  • Generate Large Volumes by using one template for all PDF testing scenarios.
  • Scale across CI/CD pipelines


Industry-Based Use Case Examples

UDA supports use cases in banking, insurance, and healthcare document workflows.

Banking and Financial

CategoryExample UDA ApplicationsPurpose / Benefit
Loan Applications
Loan forms, income statements, pay stubs
Test digital loan workflows
Account Statements
Bank statements with realistic transactions
Validate data extraction and parsing
Customer Onboarding
Identification documents, address proofs, consent forms
Simulate identity verification for onboarding
Reports / Confirmations
Investment reports, trade confirmations, and summaries
QA testing and AI modeling
Invoices and Receipts
Varied invoice formats
Test OCR and payment automation systems


Insurance

CategoryExample UDA ApplicationsPurpose / Benefit
Claims ProcessingClaim forms, damage reports, policy documents
Workflow and AI model testing
Policy ManagementPolicy PDFs for renewals and templates
Validate consistency and personalization
Fraud DetectionRealistic but fake or altered claims
Train systems to detect document fraud


Healthcare

CategoryExample UDA ApplicationsPurpose / Benefit
Patient RecordsMedical reports, discharge summaries, and lab results
Safe testing and AI training
Billing and InvoiceClaims, Explanation of Benefits (EOBs), Invoices
Validate billing workflows
Clinical TrialsConsent forms, trial reports, visit notes
Support validation and compliance
Regulatory ReportingHIPAA/FDA or audit-ready reports
Test compliance processes
Patient CommunicationsAppointment summaries, reminders, and discharge letters
QA for communication workflows


PDF Support Guidelines

To ensure optimal template creation and synthetic data generation, please review the following requirements and current limitations:


CategoryGuidelineDetails / Notes
Supported FormatsDigital PDFs onlyScanned, or image-based PDFs, are not currently supported.
File StructureOne document per template Do not combine multiple PDFs into one file.
PDF Version1.7 or newer recommendedEnsures the best compatibility.
LanguageEnglish content onlyOther languages are not supported at this time.
FontsStandard fontsCustom or unusual fonts may cause rendering and display issues and require additional support.
Password ProtectionRemove all passwordsProtected PDFs cannot be processed.
Security SettingsAvoid watermarks, copy protection, or editing restrictionsSuch files may not process correctly
Interactive ElementsNot supportedEmbedded videos, audio, or form fields are excluded.
Complex Visual ElementsCharts and graphs are preserved visuallyDynamic synthetic data is not supported in visual elements.
Page CountUp to 20 pages recommendedLarger files may require extended processing time.
Data GenerationOnly the synthetic data that Generators can generate is supportedNo other type of data is supported at this time.


Best Practices

  • Use a sample PDF that matches the production layout. 
  • Version templates so that the same is reflected in Project Versions 
  • Use documents with clear text and standard layouts.
  • Please submit clean, well-formatted digital PDFs for the best results.
  • Split merged fields in the PDF (e.g., "Name/Date"). 
  • Avoid unnecessary graphics, watermarks, or logos. 
  • Ensure all fonts are embedded or use standard system fonts.
  • Remove any security restrictions before uploading.
  • Use realistic sample data in the source PDF.
  • Validate modeled data for accuracy (e.g., names, types, and data patterns).
  • Verify by generating a small batch of PDFs. 


Dependencies

  • Java version 8, 11, 17, 21
  • Nodejs version 20.18.0 - Covered in the environment setup article

Prerequisites

  • UDA - Import from PDF must be enabled for the organization.
    • Reach out to support@genrocket.com to enable this feature 
  • GenRocket Runtime must be installed and properly configured.
  • Initial Environment Setup Steps must be completed.
  • Follow the provided PDF Support Guidelines and Best Practices discussed earlier in this article.


PDF Creation Workflow

  • Step 1 - Ensure all prerequisites listed above are complete before continuing.
  • Step 2 - Use the Import the PDF option to choose and upload the source PDF file.
  • Step 3 - Use the Template Editor to make configuration changes to the template before exporting. 
  • Step 4 - Make changes in the Project to further customize the type of data that is generated (e.g., Generator Tuning, G-Case Creation)
  • Step 5 - Change the gGrRoot Domain loopCount or use G-Cases to generate different volumes (recommended).
  • Step 6 - Check GrRoot Domain List Selection for the Config File (click here for a full walkthrough)
    • Select the Configuration Management tab, then the Modify Elements (Hamburger) icon.
    • Select the Edit icon for the GrRoot Domain.
    • Select List When More Than One and Save
  • Step 7 - Download the required files for PDF Document generation and move them to the appropriate directory location. 
    • G-Case or G-Case Library (Optional)
    • Scenario Chain (contains Scenarios for all Domains)
    • Config file as configured in the JSONSegmentMergeReceiver. Place in the Config folder within the specified output directory. 
    • PDF Config File - Download from the PDFTemplateReceiver configuration and place in the same location as the Config file above. 
    • PDF Template - Download from the PDFTemplateReceiver configuration and place in the appropriate folder, as specified in the receiver. The default folder name is 'templates'. 
  • Step 8 - Generate data at the command line with genrocket -r or use another method (e.g., API call).


Additional Information

Article LinkDescription
Unstructured Data Accelerator (UDA) - FAQsSee common questions and answers for UDA.
UDA - Environment Initial Setup StepsLearn the steps you need to complete before you can begin using UDA. 
UDA Template Editor - Import & Customize PDF for Project SetupLearn how to import a PDF source file and customize the created template in the editor before exporting it for project setup. 
UDA - Bank Statement PDF Generation WalkthroughSee a step-by-step example from importing a source PDF to generating synthetic bank statements.