Document pipelines: from DOCX to PDF/A-3 to signed archive in one API call
A document pipeline that converts input documents to compliant archival output sounds like a solved problem. In practice, it produces silent failures at every step: fonts that appear correct on screen but fail PDF/A validation, text layers that get destroyed during conversion, metadata that is present but technically wrong, and timestamps applied to documents that are not yet in their final form.
This article walks through the full conversion chain, what goes wrong at each stage, and what a reliable pipeline looks like.
The full chain
Source document (DOCX / HTML / existing PDF)
↓ conversion
PDF (rendered with correct fonts and layout)
↓ PDF/A-3 conformance
PDF/A-3 (embedded fonts, color profiles, XMP metadata, no external references)
↓ structured attachment (for hybrid invoices)
PDF/A-3 with embedded XML (Factur-X, ZUGFeRD, XRechnung attachment)
↓ digital signature
Signed PDF/A-3 (PAdES signature, LTV information embedded)
↓ RFC 3161 timestamp
Timestamped PDF/A-3 (RFC 3161 token from qualified EU TSA)
↓ archive + audit trail
Final archive (WORM storage + hash-chained audit trail + evidence pack)
Each arrow is a step that can introduce non-conformance. Each step should include validation before the next step runs.
Step 1: source to PDF
The most common conversion route is DOCX to PDF. The risk here is font substitution: if a font used in the DOCX is not available on the conversion server, the renderer substitutes a different font. The output looks similar on screen but contains a different font than specified. When PDF/A validation runs later, it may detect the wrong font or fail because the embedded font does not match the font referenced in the document.
The mitigation: ensure the conversion environment has all required fonts installed, or use a conversion approach that embeds fonts at conversion time. LibreOffice headless and Gotenberg (a Docker-based LibreOffice wrapper) are reliable open-source options for DOCX to PDF conversion. Gotenberg in particular provides a predictable isolated environment.
Do not convert DOCX to PDF by printing to a PDF driver. Print drivers often rasterize text, which destroys the text layer and produces a document that is visually identical but no longer machine-readable or searchable. A rasterized PDF cannot be made into a valid PDF/A-3 without re-OCR, which introduces additional error.
For HTML to PDF, headless browser rendering (using browser rendering engines) is generally more reliable for layout fidelity than wkhtmltopdf. Modern browser engines handle CSS correctly. Check that the HTML source has all fonts either embedded or specified as web-safe fonts.
Step 2: PDF to PDF/A-3
PDF/A-3 conformance is not automatically achieved by outputting a PDF from a PDF/A-3-aware library. Several things can break:
Missing or incorrect XMP metadata: PDF/A requires a self-describing XMP block that identifies the conformance level. The required fields:
<rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
<pdfaid:part>3</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
Without this block, a document is not technically PDF/A-3 regardless of its structure. Many converters produce PDF/A-looking output but omit or misconfigure the XMP block.
Unembedded fonts: even if conversion embedded fonts, some converters leave Type1 or TrueType fonts partially embedded (metrics only, not the full glyph data). PDF/A requires full embedding.
Unsupported color spaces: images in the document must use color spaces with embedded ICC profiles. An RGB image without an ICC profile fails PDF/A validation. Convert images to sRGB with an embedded profile before including them in the document.
Transparency: PDF/A-1 prohibits transparency. PDF/A-2 and PDF/A-3 allow it but require specific handling. Many source documents use soft shadows or translucent overlays that must be flattened before PDF/A-1 output.
Encryption: PDF/A prohibits encryption. A PDF with a user or owner password is not PDF/A-compliant regardless of other properties.
Run VeraPDF after this step to confirm conformance before proceeding. VeraPDF is the reference validator, used by national archives and regulators across Europe. It is open source and can be run in a headless pipeline.
Step 3: embedding structured XML (hybrid documents)
For invoices (Factur-X, ZUGFeRD, XRechnung), the PDF/A-3 must contain the XML invoice as an attachment with specific metadata.
The attachment relationship must be Alternative for Factur-X, not Data or Unspecified. This is defined in the Factur-X specification and is one of the most common conformance failures in PDF/A-3 invoice generators.
The required XMP extension schema for the Factur-X attachment:
<fx:ConformanceLevel>EN 16931</fx:ConformanceLevel>
<fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>
<fx:DocumentType>INVOICE</fx:DocumentType>
<fx:Version>1.0</fx:Version>
The filename must be exactly factur-x.xml for Factur-X, and the XMP metadata must reference this filename. Validators that check Factur-X conformance (not just PDF/A-3) will reject documents where the filename or the XMP reference is wrong.
Step 4: digital signature
Apply a PAdES (PDF Advanced Electronic Signatures) signature, not a basic PDF signature. PAdES is defined in ETSI EN 319 132 and supports LTV embedding.
At signing time, embed the full certificate chain and the OCSP response or CRL snapshot that proves the signing certificate was valid at the moment of signing. This is the LTV information discussed in long-term PDF validation. Without it, the signature may become unverifiable when the signing certificate expires.
Do not timestamp before signing. The correct order is: sign first, then timestamp the signature. Timestamping the document before signing changes the document hash, which invalidates the signature.
Step 5: RFC 3161 timestamp
Request the timestamp immediately after signing, while the signing certificate is still verifiable online. The timestamp is applied to the signed document’s hash and covers both the document content and the signature.
Store the timestamp token inside the PDF (in the document’s DSS dictionary) and also in the Legal Evidence Pack as a standalone file for independent verification.
Step 6: archive and audit trail
Write the final document to WORM storage with the retention period locked. Append the archiving event to the hash-chained audit trail.
The audit trail entry for the archive step should include: the document hash, the storage location, the retention period, the timestamp of the archiving event, and the hash of the previous audit entry. This produces a tamper-evident record that the document was archived in this exact form at this exact time.
Validation at each step
The failure mode to avoid is silent non-conformance: a document that goes through all six steps and produces a file that looks correct but fails validation when presented to an auditor.
Validate at steps 2, 3, and 5:
- After PDF/A-3 conversion: VeraPDF for PDF/A-3 conformance
- After XML embedding (if applicable): Factur-X or ZUGFeRD profile validator
- After timestamping: PAdES verifier (DSS or equivalent) for signature and timestamp validity
If validation fails at any step, halt the pipeline and return a structured error. Do not proceed to archiving with a non-conformant document.
SealDoc as a document pipeline
The SealDoc API implements the full chain described above as a single endpoint. You submit the source document (DOCX, HTML, or existing PDF) along with the structured metadata (invoice data, party information, retention category). The API runs conversion, PDF/A-3 validation, XML embedding, signing, timestamping, and archiving, returning the evidence pack on success or a structured validation error on failure.
Each step in the pipeline is implemented to fail fast: if VeraPDF returns a conformance error, the pipeline stops and returns the specific clause that was violated rather than proceeding to the next step with a non-conformant document. The evidence pack is only generated when all validation steps have passed.
For organizations that need to integrate document archiving into existing workflows, the API accepts webhooks that notify on completion or failure, and the evidence pack is retrievable as a ZIP containing all components in open, independently verifiable formats.