a 'mooh' point

clearly an IBM drone

OpenXml SDK released as OSS

Yesterday I was notified that the OpenXml SDK had been released as an Open Source Project by Microsoft.

Back in summer 2008 I attended a workshop in Redmond, WA regarding the future support of OOXML and ODF in Microsoft Office. I remember sitting down with one of the PMs at the time – talking to him about what a wonderful idea it would be to release the OpenXml SDK as OSS. I also remember how frustrating it was to be told – amidst between the lines - that “it ain’t gonna happen”.

Now – almost exactly 6 years later, they have finally listened (btw, I am in no way trying to take credit for “making” Microsoft OSS the OpenXml SDK – it was completely their decision to do it). But it does seem to confirm a trend in Microsoft – where the revenue cows (Office, Windows, and Servers etc.) are kept closed, but the tooling around them, the stuff that ties them all together – is with increasing frequency being released as open source.

The OpenXml SDK is released under the auspice of “MS Open Tech” – in other words; Doug Mahugh and friends. Eric White has been an integral part of making this happen. Kudos to all of them from here :-).

The source code is available on github and is free for everyone to look at and download. The license is Apache 2.0 . It will still remain to be seen if they request pull-requests, but I cannot imagine why they should not.

Now, I haven’t had the time to dig into the code in much detail yet, but I will do this in the following weeks. One thing I will look deeply into is the .Validate()-method of the toolkit. It validates the content of the OOXML-document being worked at – oh well, it should do, but if anyone has tried to run a document through e.g. my validator on http://29500.idippedut.dk or Alex Brown’s at https://code.google.com/p/officeotron/ will have found out, that the document – even with a “clean” result from .Validate() is not valid according to the schemas of OOXML. It turns out, that it does not validate against the spec – it validates against the supported functionality of Microsoft Office. Now, that is a completely valid (no pun intended) approach from the SDK, since most working with OOXML at the end of the say need interoperability with Microsoft Office.

But now with the SDK being released to a larger amount of developers, I guess it would be appropriate to expand or “fix” the validation-method. One possible improvement could be to allow validation against a range of XML schemas. Another would be to allow validation after haven processed the document applying MCE to the content. A Third improvement would be to write out dependency of WindowsBase.dll ( and thereby System.IO.Packaging) . I have a theory that the reason why OpenXml SDK is not available on Windows Phone is this exact dll, and it would be nice to be able to manipulate OOXML-documents in memery on WP.

We’ll see what will happen to it in the future – what would you like to have changed in the SDK?

Correct according to spec or implementation?

In the recent SC34 WG4-meeting in Stockholm, validators quickly became the talk of the town - so to speak. As I am sure you all know, Alex Brown made the office-o-tron some time ago - a validator targeting both ODF and OOXML in their ISO-editions. A few weeks ago I myself made a validator - but mine only targets OOXML in its "latest-improved-and-approved-transitional-version". Alex Brown's is written in Java and mine is written in C# .

Anyways - both Alex and I had some lengthy discussions with Microsoft about our validators and the errors they report. The thing is - there is a bug in the OOXML-specification dealing with how to specify relationship type for the part containing document properties like "author", "created on", "last modified on" etc. This part is a "central" part in OOXML, and to the best of my knowledge, there is not a single implementation out there that doesn't use this part for storing these so-called "core properties".

If you have tried to validate an OOXML-file in my validator, you'd probably have encountered this error:

Checking relationshiptype http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties ...

RelationshipType http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties is not valid. It should have been http://schemas.openxmlformats.org/officedocument/2006/relationships/metadata/core-properties.

In OOXML the "glue" tying the document and its various parts together is "relationship types". So for a given media-type (content type), a relationship type has to be used to properly register it in the package. A few relationship types are defined for the common parts of OOXML documents, i.e. for wordprocessing files, for spreadsheets, for presentations, for headers, footers etc. Some of these are defined in Part 1, section 15 and this is where the bug is. It is obviously a typo, and it has already been included in our list of fixes for the next batch.

The trick is - this has rather drastic consequences - at least from a validation point of view. Because a typo in this area will affect almost every implementation of OOXML that persists these basic data-chunks.

The thing is ... each and every document created by Microsoft Office will likely fail due to this bug in the specification.

So what are you gonna do?

Well, we discussed several different approaches.

One was simply to correct my validator to not report this error. I don't really like this idea, since it opens a flood gate of other scenarios where small corrections should take place Also, if I did want to go down that road, it should require a strategy for handling these things since I wouldn't want to correct any one error based on what Microsoft Office does - being an IBM drone and all. As of yet, I haven't been able to come up with such a strategy.

A second was to report warnings instead of errors in areas where "known bugs" were already in our to-do list of future corrections. I am not sure I like this either since it makes the validator almost impossible to maintain and it muddens the results since no-one will be able to figure out if a warning was simply a warning or a "down-graded error".

A third option is to do nothing.

I like that.

If you have tried to validate the same document using my validator and Alex's you'd probably have noticed that Alex's validator emits many more errors than mine. This is due to the fact that I use the schemas with the first batch of corrections (the so-called COR1-set). I'll update the schemas whenever the next batch of corrections are approved by either SC34 or JTC1. Alex's validator uses the schemas that was in the originally approved version of ISO/IEC 29500:2008. So my validator is already pretty "graceful" as it is.

Aonther reason that I like the idea of "doing nothing" is that it emphasizes a crucial point: A document should be valid according to the spec and not according to whatever implementation one considers "reference". There are other standards out there where we have a strange mixture of behaviour defined in the specification and behaviour buried in a "reference implementation". I don't know about you - but I'd rather have the spec be "the truth" than several gigs of source-code from whatever implementation is the pet app-de-jour at the moment.

Additionally, this shows us that all the implementations that handle this have failed in terms of feeding their experiences back to the standardisation organisation maintaining the specification. They will all have encountered this issue - but failed to report it ... unless, of course

  • they haven't looked in the spec at all [0]
  • they haven't bothered to validate their documents

The puzzling thing is - Alex and Gareth discovered this bug in January 2010 and his validator has been reporting this error for months now. I guess the answer to why neither of the implementers of OOXML has reported this bug is ... blowing in the wind.

So what I am trying to say is this: My validator stays the way it is - validating documents according to the spec. If any vendor discover a problem that is clearly an error in the spec, they should prioritize notifying us about it so we can correct it (which we will).



[0] Truth be told, prioritizing "make it work with most important implementation" is not the un-heard of. I myself, when I created my first ODF-files, didn't look in the ODF-spec. I reverse-engineered ODF-documents created by OOo since I only cared about whether OOo would eat it or not. Other implementations insist on not "supporting OOXML" but "supporting Microsoft Office output".