a 'mooh' point

clearly an IBM drone

Validating OOXML documents

The question of “is this a valid document?” is tricky. At the end of the day it comes down to the description in the conformance clauses of the specification of the document being considered. The conformance clauses of OOXML is listed in Section 2 of Part 1 and Section 2 of Part 4. There are also conformance clauses for Part 2 and Part 3, but they are not really relevant for this post.

The basic requirements for “document validity” can be summarized in these two points:

  • The markup must correspond/obey to the schemas of the specification
  • The markup must correspond/obey to any semantic and syntactic constraints of the specification

The first bullet is the easy one to check – because all you need to do is to validate the markup. The second bullet is much harder and it is almost impossible to automatically perform such a validation.

But since the first requirement is so easy to test, one could argue that at the bare minimum, a document producer MUST be able to create documents that are valid to the corresponding schemas.

To be able to test this, I implemented an OOXML document schema validator.

It turned out, that I should have done this from “day one” of my work with OOXML, because trying to implement a validator revealed a lot of information about how the document is structured and put together – a knowledge that really comes nicely in hand when trying to implement a document generator.

My approach was this:

  • Implement a (as much as possible) generic tool to validate documents
  • Use the latest, approved version of the schemas
  • Implement a web front-end to allow anyone to use it from anywhere.
  • Open-source the stuff

Originally I based the validator on OpenXML SDK 2.0, but during the implementation I realized that first of all it seemed a bit too “Microsoft Office dependant”. Secondly I could not get access to all the necessary information in the OPC-package that I needed to validate since the SDK hides some of this information (and rightly so, if you ask me) and thirdly it turned out that I didn’t need it at all. OpenXML SDK is based on System.IO.Packaging and since I use .Net to implement this, I found this a much better tool for the job.

What does it do?

The validator performs these tasks:

  1. It checks if the media types (MIME types) of each part is listed in the specification
  2. It checks if the relationship-type of the relationship file is listed in the specification
  3. It checks if each part referenced exists in the package in the correct location
  4. It checks the content (markup) of each part against the transitional schemas of the specification

What doesn’t it do?

The validator does not do the following things:

  • Support validation of documents containing extensions using MCE
  • Support documents in files with extensions not being either “docx”, “xlsx” or “pptx”.
  • Support validation against the strict schemas of OOXML
  • Support validation of the “root” package entry, being the file [Content_Types].xml

Other tools:

As you probably know Alex Brown has made the “office-o-tron”, which is a SAX/Java-based document validator. The differences between this tool and mine are summarized here:

Task Office-o-tron OOXML Validator
Validates OOXML documents x x
Validates OOXML <T> documents x x
Validates OOXML <S> documents
Validates against ISO/IEC 29500:2008 x
Validates against ISO/IEC 29500:2008 COR 1
Supports MCE
Inspects package of document (ZIP container) x
Validates ODF documents x


The only major difference (OOXML-wise) is really that office-o-tron validates against the core, base schemas of ISO/IEC 29500:2008 whereas the OOXML Validator validates against the set of schemas with the first set of approved corrigenda (COR1). Whenever a new set has been approved as either an amendment or a corrigendum, the schema sets will be updates accordingly.

I'll update this article with some of the details revealed during the creation of the validator - until then, have fun.