Automated PDF testing

jonjbar · 2024-05-14 16:09:30

I believe that you briefly mentioned that automated PDF testing would be a great addition in order to avoid regressions with new code. Do you have any specific idea or requirements in mind (such as being compatible with both FPC and Delphi, being cross-platform, not using external tools...) ?

I thought about this a little and here is what I believe could work:
- Create a repository of trusted source to produce PDF documents for various features of SynPDF
- Use a tool such as ImageMagick to extract each of their pages as PNG images
- Automated tests compare the newly generated PNGs with source PNGs and fail if the difference is too big

Any thoughts ? Would you accept such a contribution ?

ab · 2024-05-14 16:21:11

With PNG images and bitmap rastering from a vectorial format like PDF, it is very unlikely that you may not be able to make proper comparisons.

In mORMot 1, we have basic SynPDF validation using a simple fixed EMF input.
And even with this, we need to validate several hashes, depending on the system it renders on.
So in mORMot 2, we did not put any such basic test yet.

But we are open to any contribution, of course.

rvk · 2024-05-15 10:52:31

In follow up on the other topic... I found a directory here with lots of EMF files to test.
https://github.com/kakwa/libemf2svg/tre … ources/emf

When putting this in the bugreport directory from the other topic... and changing the following.
It will create all the pdf files in that emf directory.
There are lots of pdf that end up corrupt.

And others have strange result. 000 and 015 doesn't have the axes. and 040 also has lots of things wrong.
It indeed shows the need for automated test

BTW. comparing images extracted from pdf from a testrun with baseline images would be very hard I imagine. They would never be pixel perfect.
And before that... you would first need to fix all the problems that already exist with the above emf's before using these for regression testing

uses
  mormot.ui.pdf, ShellApi, System.IOUtils, System.Types;

procedure TForm1.Button1Click(Sender: TObject);

  procedure ProcessAllFilesInDirectory(const Directory: string);
  var
    Files: TStringDynArray;
    FileName: string;
  begin
    if not TDirectory.Exists(Directory) then exit;
    Files := TDirectory.GetFiles(Directory, '*.emf', TSearchOption.soAllDirectories);
    for FileName in Files do
    begin
      Self.DoConvertMetafileToPdf(Filename, ChangeFileExt(Filename, '.pdf'));
    end;
  end;

begin
  ProcessAllFilesInDirectory(ExtractFilePath(Application.ExeName) + 'emf');
  //Self.DoConvertMetafileToPdf(ExtractFilePath(Application.ExeName) + 'bogus.wmf', ExtractFilePath(Application.ExeName) + 'bogus.pdf');
end;

test-040.pdf

test-015.pdf

Last edited by rvk (2024-05-15 10:53:09)

jonjbar · 2024-05-15 13:42:20

As a starting point, here is a Python script that I've created with the help of ChatGPT and here is what it does:
- It extracts all pages from a reference PDF and a generated PDF as PNG images
- It compares the number of pages and fail if different
- It compares each pages and for each of them, it outputs the difference as both a difference image, and a percentage
- It outputs the final result as a consistent and clear textual content for easy integration with automated tests

So I support we could create multiple small command line programs to produce PDFs using SynPDF and test most parts of the library, including MetaFiles conversion. Those programs generate the PDF in a path specified by arguments, so that they can be used to generate the reference PDFs at first (and update them if needed), and re-generate them in the correct folder during automated tests.
Then the Python script is called for each files in the reference folder and fails based on specific conditions.

Requirements: pip install PyMuPDF Pillow Wand numpy termcolor

Script:

import fitz  # PyMuPDF
from PIL import Image, ImageChops
import numpy as np
import os
import shutil
from termcolor import colored

# Function to clear the content of a folder or create it if it does not exist
def clear_folder(folder):
    if os.path.exists(folder):
        shutil.rmtree(folder)
    os.makedirs(folder)

# Function to convert PDF pages to PNG images and save them in the output folder
def convert_pdf_to_png(pdf_path, output_folder):
    clear_folder(output_folder)
    pdf_document = fitz.open(pdf_path)
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        pix = page.get_pixmap()
        output_path = f"{output_folder}/page_{page_num + 1}.png"
        pix.save(output_path)
    pdf_document.close()

# Function to compare two images and save the difference image if specified
def compare_images(img1_path, img2_path, diff_img_path=None):
    img1 = Image.open(img1_path).convert('RGB')
    img2 = Image.open(img2_path).convert('RGB')

    # Check if page sizes match
    if img1.size != img2.size:
        return False, 100.0, "Error: Page sizes do not match"

    diff = ImageChops.difference(img1, img2)
    
    # Save the difference image if a path is provided
    if diff_img_path:
        diff.save(diff_img_path)

    np_diff = np.array(diff)
    diff_count = np.count_nonzero(np_diff)

    total_pixels = np_diff.size / 3  # Divide by 3 for RGB channels
    diff_percentage = (diff_count / total_pixels) * 100

    return diff_count == 0, diff_percentage, None

# Function to display the final result summary
def display_final_result_summary(all_match, total_diff_percentage, num_pages, page_results, error_message=None):
    if error_message:
        final_status = "NOT OK"
        color = 'red'
        avg_diff_percentage = 100.0
    else:
        avg_diff_percentage = total_diff_percentage / num_pages
        final_status = "OK" if all_match else "Partial"
        if any(status == "Error" for _, _, status in page_results):
            final_status = "NOT OK"
            color = 'red'
        else:
            color = 'green' if final_status == "OK" else 'yellow'

    # Output final result summary
    print("\nFinal result summary:")
    print(colored(f"Average difference percentage: {avg_diff_percentage:.2f}%", color))
    print(colored(f"Result: {final_status}", color))
    if error_message:
        print(colored(error_message, 'red'))

# Main function to handle the PDF comparison process
def main(reference_pdf, generated_pdf, output_folder):
    # Check if the reference PDF exists
    if not os.path.exists(reference_pdf):
        error_message = f"Error: Reference PDF '{reference_pdf}' not found."
        print(colored(error_message, 'red'))
        display_final_result_summary(False, 0, 0, [], error_message)
        return

    # Check if the generated PDF exists
    if not os.path.exists(generated_pdf):
        error_message = f"Error: Generated PDF '{generated_pdf}' not found."
        print(colored(error_message, 'red'))
        display_final_result_summary(False, 0, 0, [], error_message)
        return

    # Define folders for reference, generated, and difference images
    reference_folder = f"{output_folder}/reference"
    generated_folder = f"{output_folder}/generated"
    diff_folder = f"{output_folder}/differences"
    
    # Clear or create the folders
    clear_folder(reference_folder)
    clear_folder(generated_folder)
    clear_folder(diff_folder)

    # Convert PDFs to PNG images
    convert_pdf_to_png(reference_pdf, reference_folder)
    convert_pdf_to_png(generated_pdf, generated_folder)

    # Get the list of image files
    reference_files = sorted([f"{reference_folder}/{file}" for file in os.listdir(reference_folder)])
    generated_files = sorted([f"{generated_folder}/{file}" for file in os.listdir(generated_folder)])

    # Check if the number of pages (images) match
    if len(reference_files) != len(generated_files):
        error_message = "Error: PDFs have a different number of pages."
        print(colored(error_message, 'red'))
        print(f"Reference PDF has {len(reference_files)} pages.")
        print(f"Generated PDF has {len(generated_files)} pages.")
        display_final_result_summary(False, 0, 0, [], error_message)
        return

    all_match = True
    total_diff_percentage = 0
    page_results = []

    # Compare each page and collect results
    for i, (ref_img, gen_img) in enumerate(zip(reference_files, generated_files)):
        diff_img_path = f"{diff_folder}/diff_{os.path.basename(ref_img)}"
        match, diff_percentage, error = compare_images(ref_img, gen_img, diff_img_path)
        total_diff_percentage += diff_percentage

        if error:
            print(colored(f"Page {i + 1}: {error}", 'red'))
            all_match = False
            page_results.append((i + 1, diff_percentage, "Error"))
        else:
            page_status = "OK" if match else "Partial"
            page_results.append((i + 1, diff_percentage, page_status))
            if not match:
                all_match = False

    # Output page-by-page results
    print("Page-by-page differences:")
    for page_num, diff_percentage, status in page_results:
        if status == "OK":
            color = 'green'
        elif status == "Partial":
            color = 'yellow'
        else:
            color = 'red'
        print(colored(f"Page {page_num}: {diff_percentage:.2f}% difference - {status}", color))

    # Display final result summary
    display_final_result_summary(all_match, total_diff_percentage, len(reference_files), page_results)

if __name__ == "__main__":
    import sys

    # Ensure the correct number of arguments are provided
    if len(sys.argv) != 4:
        print("Usage: python script.py <reference_pdf> <generated_pdf> <output_folder>")
        sys.exit(1)

    # Get the input arguments
    reference_pdf = sys.argv[1]
    generated_pdf = sys.argv[2]
    output_folder = sys.argv[3]

    # Create the output folder if it does not exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Run the main function
    main(reference_pdf, generated_pdf, output_folder)

mORMot Open Source

#1 2024-05-14 16:09:30

Automated PDF testing

#2 2024-05-14 16:21:11

Re: Automated PDF testing

#3 2024-05-15 10:52:31

Re: Automated PDF testing

#4 2024-05-15 13:42:20

Re: Automated PDF testing

Board footer