a 'mooh' point

clearly an IBM drone

Correct according to spec or implementation?

In the recent SC34 WG4-meeting in Stockholm, validators quickly became the talk of the town - so to speak. As I am sure you all know, Alex Brown made the office-o-tron some time ago - a validator targeting both ODF and OOXML in their ISO-editions. A few weeks ago I myself made a validator - but mine only targets OOXML in its "latest-improved-and-approved-transitional-version". Alex Brown's is written in Java and mine is written in C# .

Anyways - both Alex and I had some lengthy discussions with Microsoft about our validators and the errors they report. The thing is - there is a bug in the OOXML-specification dealing with how to specify relationship type for the part containing document properties like "author", "created on", "last modified on" etc. This part is a "central" part in OOXML, and to the best of my knowledge, there is not a single implementation out there that doesn't use this part for storing these so-called "core properties".

If you have tried to validate an OOXML-file in my validator, you'd probably have encountered this error:

Checking relationshiptype http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties ...

RelationshipType http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties is not valid. It should have been http://schemas.openxmlformats.org/officedocument/2006/relationships/metadata/core-properties.

In OOXML the "glue" tying the document and its various parts together is "relationship types". So for a given media-type (content type), a relationship type has to be used to properly register it in the package. A few relationship types are defined for the common parts of OOXML documents, i.e. for wordprocessing files, for spreadsheets, for presentations, for headers, footers etc. Some of these are defined in Part 1, section 15 and this is where the bug is. It is obviously a typo, and it has already been included in our list of fixes for the next batch.

The trick is - this has rather drastic consequences - at least from a validation point of view. Because a typo in this area will affect almost every implementation of OOXML that persists these basic data-chunks.

The thing is ... each and every document created by Microsoft Office will likely fail due to this bug in the specification.

So what are you gonna do?

Well, we discussed several different approaches.

One was simply to correct my validator to not report this error. I don't really like this idea, since it opens a flood gate of other scenarios where small corrections should take place Also, if I did want to go down that road, it should require a strategy for handling these things since I wouldn't want to correct any one error based on what Microsoft Office does - being an IBM drone and all. As of yet, I haven't been able to come up with such a strategy.

A second was to report warnings instead of errors in areas where "known bugs" were already in our to-do list of future corrections. I am not sure I like this either since it makes the validator almost impossible to maintain and it muddens the results since no-one will be able to figure out if a warning was simply a warning or a "down-graded error".

A third option is to do nothing.

I like that.

If you have tried to validate the same document using my validator and Alex's you'd probably have noticed that Alex's validator emits many more errors than mine. This is due to the fact that I use the schemas with the first batch of corrections (the so-called COR1-set). I'll update the schemas whenever the next batch of corrections are approved by either SC34 or JTC1. Alex's validator uses the schemas that was in the originally approved version of ISO/IEC 29500:2008. So my validator is already pretty "graceful" as it is.

Aonther reason that I like the idea of "doing nothing" is that it emphasizes a crucial point: A document should be valid according to the spec and not according to whatever implementation one considers "reference". There are other standards out there where we have a strange mixture of behaviour defined in the specification and behaviour buried in a "reference implementation". I don't know about you - but I'd rather have the spec be "the truth" than several gigs of source-code from whatever implementation is the pet app-de-jour at the moment.

Additionally, this shows us that all the implementations that handle this have failed in terms of feeding their experiences back to the standardisation organisation maintaining the specification. They will all have encountered this issue - but failed to report it ... unless, of course

  • they haven't looked in the spec at all [0]
  • they haven't bothered to validate their documents

The puzzling thing is - Alex and Gareth discovered this bug in January 2010 and his validator has been reporting this error for months now. I guess the answer to why neither of the implementers of OOXML has reported this bug is ... blowing in the wind.

So what I am trying to say is this: My validator stays the way it is - validating documents according to the spec. If any vendor discover a problem that is clearly an error in the spec, they should prioritize notifying us about it so we can correct it (which we will).

 

 

[0] Truth be told, prioritizing "make it work with most important implementation" is not the un-heard of. I myself, when I created my first ODF-files, didn't look in the ODF-spec. I reverse-engineered ODF-documents created by OOo since I only cared about whether OOo would eat it or not. Other implementations insist on not "supporting OOXML" but "supporting Microsoft Office output".

Validating OOXML documents

The question of “is this a valid document?” is tricky. At the end of the day it comes down to the description in the conformance clauses of the specification of the document being considered. The conformance clauses of OOXML is listed in Section 2 of Part 1 and Section 2 of Part 4. There are also conformance clauses for Part 2 and Part 3, but they are not really relevant for this post.

The basic requirements for “document validity” can be summarized in these two points:

  • The markup must correspond/obey to the schemas of the specification
  • The markup must correspond/obey to any semantic and syntactic constraints of the specification

The first bullet is the easy one to check – because all you need to do is to validate the markup. The second bullet is much harder and it is almost impossible to automatically perform such a validation.

But since the first requirement is so easy to test, one could argue that at the bare minimum, a document producer MUST be able to create documents that are valid to the corresponding schemas.

To be able to test this, I implemented an OOXML document schema validator.

It turned out, that I should have done this from “day one” of my work with OOXML, because trying to implement a validator revealed a lot of information about how the document is structured and put together – a knowledge that really comes nicely in hand when trying to implement a document generator.

My approach was this:

  • Implement a (as much as possible) generic tool to validate documents
  • Use the latest, approved version of the schemas
  • Implement a web front-end to allow anyone to use it from anywhere.
  • Open-source the stuff

Originally I based the validator on OpenXML SDK 2.0, but during the implementation I realized that first of all it seemed a bit too “Microsoft Office dependant”. Secondly I could not get access to all the necessary information in the OPC-package that I needed to validate since the SDK hides some of this information (and rightly so, if you ask me) and thirdly it turned out that I didn’t need it at all. OpenXML SDK is based on System.IO.Packaging and since I use .Net to implement this, I found this a much better tool for the job.

What does it do?

The validator performs these tasks:

  1. It checks if the media types (MIME types) of each part is listed in the specification
  2. It checks if the relationship-type of the relationship file is listed in the specification
  3. It checks if each part referenced exists in the package in the correct location
  4. It checks the content (markup) of each part against the transitional schemas of the specification

What doesn’t it do?

The validator does not do the following things:

  • Support validation of documents containing extensions using MCE
  • Support documents in files with extensions not being either “docx”, “xlsx” or “pptx”.
  • Support validation against the strict schemas of OOXML
  • Support validation of the “root” package entry, being the file [Content_Types].xml

Other tools:

As you probably know Alex Brown has made the “office-o-tron”, which is a SAX/Java-based document validator. The differences between this tool and mine are summarized here:

Task Office-o-tron OOXML Validator
Validates OOXML documents x x
Validates OOXML <T> documents x x
Validates OOXML <S> documents
Validates against ISO/IEC 29500:2008 x
Validates against ISO/IEC 29500:2008 COR 1
x
Supports MCE
Inspects package of document (ZIP container) x
Validates ODF documents x

 

The only major difference (OOXML-wise) is really that office-o-tron validates against the core, base schemas of ISO/IEC 29500:2008 whereas the OOXML Validator validates against the set of schemas with the first set of approved corrigenda (COR1). Whenever a new set has been approved as either an amendment or a corrigendum, the schema sets will be updates accordingly.

I'll update this article with some of the details revealed during the creation of the validator - until then, have fun.

Smile