a 'mooh' point

clearly an IBM drone

The complexity of SpreadsheetML - oh the sheer joy of it!

Having a bit of time on my hands while attending the SC34/WG4-meeting in Okinawa, I thought I'd write up a blog post I have wanted to write in quite some time.

The reason for me doing this was a requirement I am often presented by CIBER's customers - export my data to Excel. The data they want us to export are traditionally grouped into three categories:

  • Text (strings)
  • Numbers
  • Dates

Creating cells with numbers and text are really a no-brainer in OOXML. It is a bit more complicated when it comes to dates, because dates in e.g. ISO 8601-format are not as such supported as "built-in cell data types" in SpreadsheetML. Instead, dates are presented by styling content in number-cells. This means that to be able to display a date in SpreadsheetML, you need to be know "a bit" about styling in spreadsheets.

Now, as some of you remember, representation of dates in spreadsheets using OOXML is done in "serial form" meaning that dates are stored as numbers. These numbers are also known as "Julian days" - not to be mistaken with the "Julian Calendar". In even other words a date is represented as the number of days since some starting point in time.

So if I wanted to store the date "December 20nd 2009" in OOXML, I would have to convert it to a "julian representation" - in this case "40167". This is really just a minor annoyance - the conversion is trivial and a no-brainer. However - the fun has not started yet.

If you look at the markup required, it would have to be like this:

[code:xml]<sheetData>
  <row r="1">
    <c r="A1">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

So this will give me a cell with a serial representation of 2009-12-22. However, if I open this in an OOXML-compliant application, it will display "40167". As I mentioned above, it turns out that displaying the serial representation as a "proper date" requires styling of the cell content.

The key is an attribute on the <c>-element I omitted in the example above.

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

The "s"-attribute specified the style for the given cell. The specefication says this for this particular attribute:

The index of this cell's style. Style records are stored in the Styles Part.

Ok - cool so the good thing here is, that we now know what the attribute is used for. The bad thing is that we don't know anything about "how".

Styles for SpreadsheetML are described in section 3.8. The complete section is about 110 pages and it describes at length each element name and attribute but again it more answers "what" than "how".

(I just talked to another delegate about if a standard should describe both the hows and the whats, and it seems that the jury is still out on that one, so these are simply my personal observations of using the specification to solve a concrete problem).

So in figuring out how to do this, a good starting point would be to look at the list of valid child elements. These are defined as

[code:xml]<complexType name="CT_Stylesheet">
  <sequence>
    <element name="numFmts" type="CT_NumFmts" minOccurs="0" maxOccurs="1"/>
    <element name="fonts" type="CT_Fonts" minOccurs="0" maxOccurs="1"/>
    <element name="fills" type="CT_Fills" minOccurs="0" maxOccurs="1"/>
    <element name="borders" type="CT_Borders" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyleXfs" type="CT_CellStyleXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellXfs" type="CT_CellXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyles" type="CT_CellStyles" minOccurs="0" maxOccurs="1"/>
    <element name="dxfs" type="CT_Dxfs" minOccurs="0" maxOccurs="1"/>
    <element name="tableStyles" type="CT_TableStyles" minOccurs="0" maxOccurs="1"/>
    <element name="colors" type="CT_Colors" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
</complexType>[/code]

The elements that should (ahem) draw attention to them are "cellStyles", "cellStyleXfs" and "cellXfs".So, if you want to apply formatting directly to a cell, look at e.g. the element <cellXfs> defined in section 3.8.10. It says (in abstract)

This element contains the master formatting records (xf) which define the formatting applied to cells in this workbook. These records are the starting point for determining the formatting for a cell. Cells in the Sheet Part reference the xf records by zero-based index.

The <cellXfs>-element has a child element called <xf>. The element is defined as

[code:xml]<complexType name="CT_Xf">
  <sequence>
    <element name="alignment" type="CT_CellAlignment" minOccurs="0" maxOccurs="1"/>
    <element name="protection" type="CT_CellProtection" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
  <attribute name="numFmtId" type="ST_NumFmtId" use="optional"/>
  <attribute name="fontId" type="ST_FontId" use="optional"/>
  <attribute name="fillId" type="ST_FillId" use="optional"/>
  <attribute name="borderId" type="ST_BorderId" use="optional"/>
  <attribute name="xfId" type="ST_CellStyleXfId" use="optional"/>
  <attribute name="quotePrefix" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="pivotButton" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="applyNumberFormat" type="xsd:boolean" use="optional"/>
  <attribute name="applyFont" type="xsd:boolean" use="optional"/>
  <attribute name="applyFill" type="xsd:boolean" use="optional"/>
  <attribute name="applyBorder" type="xsd:boolean" use="optional"/>
  <attribute name="applyAlignment" type="xsd:boolean" use="optional"/>
  <attribute name="applyProtection" type="xsd:boolean" use="optional"/>
</complexType>[/code]

The attribute you want here is "numFmtId". The attribute is described as "Id of the number format (numFmt) record used for this cell format".

(are we getting there soon?)

Anywho, going to the reference of numFmt will lead you to paragraph 3.8.30 numFmt (Number Format) and it will tell you, that some of the values of the attribute are implied. That's really just another way of saying "reserved values". 

ID
formatCode
 
 0
 General
 1  0
 2  0.00
 3  #,##0
 4  #,##0.00
 9  0%
 10  0.00%
 11  0.00E+00
 12  # ?/?
 13  # ??/??
 14  mm-dd-yy
 15  d-mmm-yy
 16  d-mmm
 17  mmm-yy
 18  h:mm AM/PM
 19  h:mm:ss AM/PM
 20  h:mm
 21  h:mm:ss
 22  m/d/yy h:mm
 37  #,##0 ;(#,##0)
 38  #,##0 ;[Red](#,##0)
 39  #,##0.00 ;(#,##0.00)
 40  #,##0.00 ;[Red](#,##0.00
 45  mm:ss
 46  [h]:mm:ss
 47  mmss.0
 48  ##0.0E+0
 49  @


It looks like id 15 could be the one we are looking for. So I'm gonna add this number format to the xf-elements's numFmt-attribute and create this xml-fragment:

[code:xml]<cellXfs count="2">
  <xf numFmtId="15" (...)  />
</cellXfs>[/code]

Behold - it actually works. When I load this in Microsoft Office 2007, it will display this:



So what have I learned here (apart from the astounding complexity of this relatively trivial task)? Well, to display a date using SpreadsheetML, you need to know a bit about SpreadsheetML styles. You will also need to do a fair amount of digging in the specification as well as in existing OOXML-files, since I could not find this information anywhere. Luckily for you, the content of this blog is licensed under Creative Commons attribution license, so feel free to use it however you should wish to do so.

To sum it all up, you will need the following items to display a cell in SpreadsheetML:

1. The cell fragment

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

Notice that the cell is styled using the attribute "s" with a value of "0".

2. The style part

[code:xml]<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <cellXfs count="1">
    <xf numFmtId="15" (...) />
  </cellXfs>
</styleSheet>[/code]

Notice that index "0" of the <cellXfs>-collection has a numFmt-attribute with the value "15" resulting in displaying the date correctly.

I have created a small test file based on the walk-through above and it is available here: test_dates.xlsx (2.25 kb).

And in other news:

So, you might ask, how is this done using other document formats? Well, it turns out to be drastically less complex.

ODF

[code:xml]<table:table-row>
  <table:table-cell office:value-type="date" office:date-value="2009-12-20">
    <text:p>20-12-09</text:p>
  </table:table-cell>
</table:table-row>[/code]

OOXML IS29500

[code:xml]<sheetData>
  <row r="1">
    <c r="C4" t="d">
      <v>1976-11-22T08:30Z</v>
    </c>
  </row>
</sheetData>[/code] 

Both examples above should require no additional formatting.

You might also ask, if this could have been done in any other way in OOXML? Well, as far as I read the specification, there is no way around the style-part-trouble. But you could create your own number formatting if you should wish so. I would actually prefer this angle, since it would be a step away from pre-determined (implied) values in styles and keep the package content self-contained.

You know, this could actually be the basis for a nice new defect report for WG4: "Remove all implied values in the specification and move them to the transitional Part 4".

Is there an end of it?

I know this was quite a lenghty post - but is it of any value at all - and would you like more of these investigative posts in the future?

Smile

What's up Japan!

Jeeez ... has it been a long time since I last wrote a blog-entry here. It's not so much that I didn't want to write something ... but I have found myself pre-occupied with other tasks at the grinding mill. It also seems to me that most of the other participants of the disussions have done the same thing - maybe with on exception.

Anywho - in other news, OASIS submitted their response to the ISO/IEC JTC1/SC34-defect report.It is in form of the document "Open Document Format for Office Applications (OpenDocument) 1.0 Errata 01". The Danish mirror-committee to JTC1/SC34 talked about it at our meeting last Friday and we will look into the response as soon as possible. I have made a kindof-thorough look-through of the document and I was able to confirm and accept most of the corrections. A few were a bit odd, but it's not a big deal. To me, the most important thing is to move on, have the "chapter" on ODF 1.0 in ISO/IEC closed and now concentrate on development of ODF 1.2 and"ODFNext" or whatever the latest friendly name of ODF 1.2++ is. It is my feeling that the Danish mirror committee concurs with me here, so I would suspect us to approve the response from OASIS before the JTC1/SC34-plenary in Pragh in the end of March 2009. If anyone in JTC1/SC34 need a helping hand editing the response and turning it into a COR, please let me know.

On more thing, though. Can anyone tell me if the proposed changes to IS26300 in the Errata 01 document are all included in ODF 1.2?

So what about Japan?

Well, next week the first meeting of WG4 in SC34 will take place in Okinawa, Japan. The draft agenda can be seen at the Japanese SC34 website and is also listed here:

  1. Opening - 2009-01-28 10:00

  2. Roll Call of Delegates
  3. Adoption of the Agenda

  4. Overview of the JTC 1 Maintenance Process (WG 4 N0012)
  5. Defect Reports (WG 4 N 0015)

  6. Comment Collection Form
  7. Schedule for Reprints or Technical Corrigenda

  8. Accessing the SC 34/WG4 Email Archive and Document Repository (WG4 N0014)
  9. Future Meetings (F2F and Teleconferences)

  10. Any other business

  11. Closing

So we are basically going to bee looking at what to do next. How will we structure our work? How will we keep the pace up and (rather importantly), how will we collect suggestions or defects from the public. I know that it is important to the Danish mirror committee that the widest possible audience will be heard, so I am looking forward to some interesting discussions here.

We will of course take a look at the defects that have already made their way to our system. There are currently about 50 single defects reported, some by ECMA and some by various national bodies. The defects range from spelling errors through decisions from Geneva not being implemented correctly to errors in the XML-schemas for OOXML. Denmark will sadly not be able to contribute at this time, due to "shortage of labour" but we still hope that we will have something by the end of March.

And finally - I have not asked for a full list of participants to the meeting just yet, but the last figure I heard was about 20 people in total. That's a lot - but still less than the 120 we were in Geneva.

Smile

PS: Is it cold in Okinawa?

ISO publishes OOXML

ISO today decided to make OOXML ISO-edition (IS29500:2008) publically available.

You can choose to purchase a physical print of the specification or you can download it for free at the ITTF website.

Smile

 

Finally!

 

*sigh*

Microsoft Office 2007 - now with ODF-support

On October 22nd a long awaited email popped into my mailbox  - news of the release of first beta of Microsoft Office 2007 SP2. The reason for me longing to get my hands on this piece of software (and I have, in vain, tried to squize each and every single Microsoft employee I could to get it earlier) was not that it is a Service Pace for my current office application. Nor is it that I should now expect a more stable software package, because I am not troubled by instability in my everyday work with Microsoft Office.

My interest is caused by the fact that Microsoft Office 2007 SP2 includes support for ODF 1.1, and to be frank, it is not really because Microsoft has now chosen to support ODF natively in Microsoft Office - I am sure most would agree with me that they should have supported ODF a loooong time ago.

No, what will be interesting to see will be what it will mean for interoperability via ODF.

It's the standards, stupid

It has long been a public secret that you were walking in egg-shells when exchanging ODF-documents between ODF-supporting applications that are not somehow based/cloned from OpenOffice. Of course it is possible to exchange "BUI-documents" (yes, it is a acronym I have invented for this. It means Bold, Underline and Italics and represents rather simple documents without too much fancy pancy stuff in it.) but the best experience is when using OO spin-offs.

This makes perfect sense. When using the same program, you will get the least amount of problems. This is in essense the text-book/Page1 elevator pitch for Microsoft Office sales people

And this is exactly why ODF-support in Microsoft Office 2007 is interesting - it is the first major productivity application not based on OpenOffice that promises native ODF-support.

Now some people seem to think that as long as you use an open standard like ODF, PDF or OOXML, "interoperability" is somehow included. It is as if they are trying to apply some sort of Kant'ish "Das ding an sich"-thinking when they argue that achieved interoperability is somehow an intrinsic, guaranteed feature of an open standard. The funny thing is that every time I hear these arguments I always try (or fail, rather) to find a nice way of saying that they have understood squat of the problem and that they should try to work seriously with the subject at hand before speaking so bluntly about it.

The truth is of course somewhat different and this is why I genuinely applaud the work done with the OIIC in OASIS. The truth is that an open standard enables or facilitates good interoperability and that this potential is bigger for an open standard than for a closed standard. It is clear that both ODF and OOXML provide for better interoperability than the proprietary binary DOC-formats, but reversely the binary DOC-formats are also proof that fairly good interoperability is also possible when using non-open document formats. The world is not - once again - black/white, because it is clear that an open standard is not a requirement for interoperability - but it certainly helps a lot.

My point here is

Interoperability is not created by the standards. It is created in the applications based on the standard

All applications have bugs/quirks

This is the reason this is not about the standards - rather, it's about the applications. We are now in the situation that we have two big players supporting ODF (to a varying degree). But they will propably do it in different ways. We are now in a situation where we no longer have the luxury of the major ODF-producing/consuming applications being built on the same engine. My expectation is therefore that we will experience interoperability-problems with the ODF-applications, because Microsoft Office will likely do some things differently than the OpenOffice-clones (but comply to the ODF-spec at the same time).

This is why I asked Microsoft these two questions when I attented the first DII workshop in late July 2008 (they recently held another one but I did not attend).

1. How have you handled the possibility of using application specific settings in ODF?

As you know ODF has (and now also OOXML after BRM #¤"¤%¤#¤#&"#¤#"¤#¤%, thank you very much!) the so-called "config-item-set"-elements, which are used by the current ODF-implementations to store application specific behaviour. The problem with these elements and attributes is that they are not specified in the ODF spec, so there is really no obvious way to figure out what to do with the binary printer-blob that Lotus Symphony stores in ODF-documents produced by it. The short reply from Microsoft was: "We don't use it" and if you open the settings.xml-file in the ODF-package, it is empty. This is all fine and dandy - only problem is that you risk loosing information when exchanging documents.

2. How have you handled known bugs, features in other, major ODF-applications?

All applications have bugs - including ODF-supporting applications, so my question was perfectly legitimate. Again the answer was: "We don't handle it". With this answer Microsoft gets in line with alle the other application manufacturers that don't handle their competitor's bugs. There is e.g. a "bug" in KSpread's implementation of formulas (specifically the LOG-method). This is not handled by OpenOffice.org - even though it is fairly well known.The consequence is that strange things might happen when exchanging spreadsheets between KSpread and OOo Calc.

It didn't really matter before, 'cause not that many people use KSpread - but this picture is about to change with ODF-support in Microsoft Office 2007.

The bigger picture

I you will allow me to use one of my favorite, stupid expressions, then let's for a moment "step into the helicopter to see the bigger picture".

Because I believe that Microsoft's implementation of ODF will mean interoperability-problems using ODF-files in the short term. But I also think that it will mean better ODF-support on a broad scale - in the long run.

I have previously dealt with the MathML-support of OpenOffice.org which is slightly buggy. The ODF-spec says this about mathematical content:

Mathematical content is represented by MathML 2.0

And that's it.

As you might remember, the problems with OOo's MathML-support are due to the fact that OpenOffice.org requires a DOCTYPE-declaration in the MathML-object to display it. Also it seems that OOo will only display a certain kind of MathML. I have documented this in a previous post, but the short story here is that a simple mathematical equation in an ODF-document created using Microsoft Office 2007 SP2 will not display in OOo 3.0 nor Lotus Symphony 1.0 The ODF-file is perfectly valid and so is the MathML-fragment (tested using jing and the RelaxNG-schemas for ODF 1.1 and MathML as well as the MathML-tool from W3C, Amaya).

This example serves to illustrate my point: Microsoft's implementation of ODF will mean better support for ODF in the long run, because it forces existing problems in the applications to surface - and they can then be fixed.

And a small note for the trigger-happy ones: This is not due to the fact that Microsoft has implemented ODF - merely it is due to the fact that we will now have a new, major implementation of ODF to exchange documents with.

The problems described above have propably existed for years but no-one have noticed since most people use some kind of OpenOffice-clone for creation and display of ODF-documents. Now, on the other hand, errors in the applications (including in Microsoft Office) will be very obvious and the pressure to fix them will be much bigger. I also predict that Microsoft will have to speed up the release cycle of updates to their productivity-applications supporting ODF - at least when it comes to hotfixes of known problems. I don't think anyone will settle for bi-annual service packs for fixing trivial errors with big impact on productivity and interoperability.

Only remaining question now is: when will SP2 make it into Microsoft Office 2007? When it snows in Seattle?

(btw, I watched Grey's Anatomy yesterday, and according to them, it does snow in Seattle from time to time!)

JTC1/SC34 WG4 appointed Danish expert

On Friday, October 24th the Danish mirror-committee to JTC1/SC34 had its bi-monthly meeting. On the agenda was, amongst other things, assignment of participants to the newly created working groups in JTC1/SC34, WG4 and WG5.

For those of you not familiar with the establishment of these two groups, WG4 will deal with maintenance and development of OOXML. WG5 will work to "Develop principles of, and guidelines for, interoperability among documents represented using heterogeneous ISO/IEC document file formats." So the latter WG is not really about translating between document formats such as ODF and OOXML. No, it is about creating some guidelines that all (future or present) document formats could use as inspiration when designing the formats to be "interoperable".

I think the prospects of this could be really, really good and I hope as many stakeholders as possible chooses to join the work. It would be great to have som kind of guidelines for interoperability comparable to the Accessibility-guidelines from W3C (those that was added to OOXML during the BRM in Geneva).

We did not get any confirmed pledges to participate from the members of the Danish committee, but I was very pleased to hear that both ORACLE Denmark as well as the Technical University of Denmark would investigate if they could join the working group.

More interesting to me was assignment of participants for Working Group 4 to develop and maintain OOXML. Not surprisingly (since most of the participants of the committee are much more "anti-OOXML" than "pro-ODF" this point of the agenda received far less attention. We have in CIBER Denmark discussed for quite some time if we should join the working group, and we have reached the conclusion that we would. We do this of the following reasons:

  1. We believe that we would be able to deliver some technical skills that would be valuable to the work around OOXML
  2. We believe that it is important that development and maintenance of OOXML is not done exclusively by ECMA under the "ISO brand" and
  3. we believe that it is important to create a Danish "foot-print" on the development of the document format
So when the committee was asked if anyone would join, CIBER stepped up to the plate. I am happy to say that both the potential commitment of ORACLE Denmark and Technical University of Denmark and the confirmed commitment from CIBER received unanimous support from the other committee members.

So now what?

well, the first draft of the agenda for the meeting in Okinawa has been posted on the SC34-website. At present the agenda is this:

Draft agenda

  1. Opening - 2009-01-28 10:00
  2. Roll call of Delegates
  3. Adoption of the Agenda
  4. Defect Reports
  5. Any other business
  6. Closing

I think we will also talk about what to actually do in the foreseeable future both with respect to handling of defect reports and future maintenance. One of the things I will not accept (and I hope nor will the other appointed experts) is that the working group will primarily focus our time on defect handling - all while ECMA works on new stuff for OOXML and eventually dumping this on our table. So we will need to establish some sort of agreement around this.

Also we will need to talk about future places to meet. Next meeting will likely be held in Pragh, and I would like to some how make sure that future meetings are held in cities near major airport hubs around the world. It will take me about 24 hours to travel from Copenhagen to Okinawa, and that travel period would be cut in two, if the meeting was held in e.g. Tokyo or Kyoto. This is not a criticisme of the Japaneese decision to have the meeting in Okinawa, but I believe we would indirectly encourage more participation if the required travelling was not so extensive.

Oh ... and did anyone notice that I was only mentioned in the "Small news"-section of Alex Brown's recent post "More Standards news"? This really helps keeping both feet solidly on the ground and not thinking too much of myself.

Wink

IS 29500 has been sent to ITTF for publication

This email just landed in my mailbox this morning:

ISO/IEC JTC1/SC34 N1080
Final Text for ISO/IEC 29500-1, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 1: Fundamentals and Markup Language Reference
Status: This text has been submitted to ITTF for publication. It is circulated to the SC 34 members for information.
 
ISO/IEC JTC1/SC34 N1081
Final Text for ISO/IEC 29500-2, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 2: Open Packaging Conventions
Status: This text has been submitted to ITTF for publication. It is circulated to the SC 34 members for information.
 
ISO/IEC JTC1/SC34 N1082
Final Text for ISO/IEC 29500-3, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 3: Markup Compatibility and Extensibility
Status: This text has been submitted to ITTF for publication. It is circulated to the SC 34 members for information.
 
ISO/IEC JTC1/SC34 N1083
Final Text for ISO/IEC 29500-4, Information technology --Document description and processing languages -- Office Open XML File Formats -- Part 4: Transitional Migration Features Due date: --
Status: This text has been submitted to ITTF for publication. It is circulated to the SC 34 members for information.
 
This will finally make it possible for the NBs of ISO to verify that the editorial instructions from the BRM has made it into the final text. I have not yet had the time to investigate and verify that the Danish changes has been implemented, but I am sure lots of blogging will take place over the next days.
 
Smile

Day one of IS29500?

On August 15th 2008 ISO/IEC gave their "Go ahead" on the appeal against the IS29500-approval and the process leading to it. The decision was covered almost everywhere and the phrase that caused the most speculation was this:

According to the ISO/IEC rules, DIS 29500 can now proceed to publication as an ISO/IEC International Standard. This is expected to take place within the next few weeks on completion of final processing of the document, and subject to no further appeals against the decision.

(my emphasis)

So the battle was clearly not over since the appeal itself could also be appealed. The question was: until when? Then on September 1st news broke that the appealing countries would not appeal the decision to overthrow the appeals. Since it is my understanding that only the appeallants could appeal overthrowing the appeal (confused, anyone?), I suppose the case was finally closed.

But we are still waiting for the revised text from ITTF. I would imagine that they would hold the text until the period for appealing the appeal-overthrow was over with ... but when is that? This morning it occured to me that if the period was 30 days - today is the first working day after the deadline.

Could this be it then? Could today be the "Birthday" of IS29500?

Microsoft, its time to deliver

Just before OOXML was approved in JTC1/SC34, a lot of us spent a lot of time discussing the differences of between Sun's CNS, IBM's ISP and Microsoft's OSP. Specifically, a thread on Oliver Bell's blog dealt with this topic. The post was called "The OSP will apply to future versions of DIS29500". Oliver said

For developers wanting to use the ISO/IEC DIS29500 specification this has raised some questions around exactly what level of support Microsoft will pledge to future versions of the OpenXML specification as it continues to evolve through the ISO process.

This is an important issue, and to date I don’t think we have been clear enough around our intent in this area. This has come up in internal discussions several times recently and today a decision was taken to make a public statement to continue to make the intellectual property that developers or users may need available to future versions.

The statement will appear on http://microsoft.com shortly

This was in late March 2008. I just checked the OSP-page and this change has still not been applied to the OSP. The text still says:

Q: Does this OSP apply to all versions of the standard, including future revisions?

A: The Open Specification Promise applies to all existing versions of the specification(s) designated on the public list posted at http://www.microsoft.com/interop/osp/, unless otherwise noted with respect to a particular specification (see, for example, specific notes related to web services specifications).

Then in late May Microsoft announced their support of ODF in Microsoft Office 12 and joining ODF TC.I myself wrote a bit on it on my blog, and I made the following list of things Microsoft wanted to do:

  1. Microsoft will join OASIS ODF TC
  2. Microsoft will include ODF in their list of specifications covered by the Open Specification Promise (OSP)
  3. Microsoft will include full, native support for ODF 1.1 in Microsoft Office 14 and in Microsoft Office 12 SP2 - scheduled for Q2 2009. Microsoft Office 12 SP2 will have built-in support for the three most widely used ISO-standards for document formats, e.g. OOXML, ODF and PDF.

Well, I clearly misunderstood something with regards to OSP covering ODF, because that has not happened (yet). I was under the impression that it was a requirement when joining OASIS, but maybe Rob is right in saying that the OASIS IPR-policy participants in OASIS-work are required to sign actually trumps the ISP for IBM and perhaps also the OSP from Microsoft. On a funny note, I was actually quoted in their press release praising their modifications to specifications covered by their OSP ... but maybe they changed their mind.

Still, I think it would be a good move by Microsoft to include ODF in their OSP. As I wrote at that time

One of the aspects of the discussion that never really surfaced was that if IBM has software patents covering ODF - some of them quite possibly cover parts of OOXML as well. But the ISP of IBM does not mention OOXML - it only mentions ODF. This leaves me as a developer in quite a legal pickle, because by implementing OOXML I am covered by the OSP - but I am not covered by IBM's ISP (and vice versa). To me as a developer, Microsoft's coverage of ODF in their OSP is a good move, because it should remove all legal worries I might have around stepping into SW-patent covered territory.

This is still true, dear Microsoft.

I all bad, then? Well no - Microsoft recently won praise from no other than Groklaw with expanding their FAQ on their OSP - now specifically  making it clear that the OSP covers GPL-licensed implementations. Groklaw seemed so confused by the "good news" they had to ask: "Are pigs flying, or what?"

Smile

So Microsoft - what are you going do?

DII ODF workshop - the good stuff

... continued from DII-workshop in Redmond - round-table discussions.

So - let's get down to what was the real purpose of going to Redmond - apart from the great breakfast I had at Lowell's in Farmer's market in Seattle - to test the pre-alpha version of Microsoft Office 2007 SP2 and its ODF-support.

(let me start by appologizing for the late post, but I lost my USB-drive with my test-files on, and I didn't find it until a few days ago)

I have already listed some of the findings of the day in my previous post, so I'll try to get into more detail here.

What did I do?

Well, we would have some hands-on time with the latest build of Microsoft Office 2007 SP2 (apparently directly from a developer's machine) so I brought a bunch of documents I have worked on before - some of them was from the application interop-work I participated in in Fall 2007 for the Danish National IT- and Telecommunication Agency. Others I have created myself. I performed the following steps for each file:

  1. Load the ODF-file in OpenOffice.org 2.4
  2. Create a PDF-file of the document using a PDF printer driver (CutePDF)
  3. Load the ODF-file in Microsoft Office 2007 SP2
  4. Do a "Save as ODF" and prefix the original filename with "MSO". According to the Microsoft project managers I talked to, this would ensure I actually saved a version of the ODF-file that had been processed by the internal object model of Microsoft Office 2007 SP2.
  5. Create a PDF-file of the document using a PDF printer driver (CutePDF)

Below I have listed for each document the following data:

Original file: somefile
Original file New file
Generator: SomeApplication PDF Generator: Microsoft Office 2007 SP2 PDF

For each I will include some tech remarks on interesting subjects - if any.

There are a couple of things to note on a general level before we get started. Microsoft has chosen to follow implementation of ODF "by the book" in the sense that they have not looked so much about bugs or "features" in competing applications. This has the peculiar effect that perfectly legitimate ODF-files produced by Microsoft Office 2007 SP2 might not properly in competing applications. For more general ideas of what they did, you should check out Dennis Hamilton's post from the workshop. It is by far the most comprehensive of the ones posted since last week.

Original file: Testfile_03.odt
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is an ODT-file with an embedded ODS-spreadsheet. Loading this file into Microsoft Office shows a nice red cross and no spreadsheet. An inspection of the ODT-file shows that the content is pretty much preserved including the embedded ODS-spreadsheet. But when looking at the manifest file, the following appears:

[code=xml]<manifest:file-entry
 manifest:media-type="application/x-openoffice-gdimetafile;windows_formatname="GDIMetaFile""
 manifest:full-path="ObjectReplacements/Object 1"
/>[/code]

It is the location of the graphical representation of the embedded spreadsheet. The media-type seems to be an old StarView Metafile format (confirm, anyone?) and Microsoft Word doesn't understand this image format - hence the red cross. This example highlights one of the points of bad interoperability: Small errors can cause big problems. Everything but the missing image is preserved, but the document becomes useless regardless of this "small" error".

Original file: Testfile_07.odt
Original file New file
Generator: OpenOffice.org/2.0$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is included in the "Self-assesment"-package from the Danish National IT- and Telecom Agency. Loading the letter into Microsoft Office 2007 initially appears to produce an identical file, but even though the content itself is preserved, there are still areas with problems.

  1. There is a border around the logo image in the header
  2. The height of the header is not completely preserved
  3. The "right margin" (which is really a stretched text box) is gone since the text box is wrapped around the text instead of being preserved in its full length
  4. Page numbering is gone on the last page
A funny note: if you load the file generated by Microsoft Office 2007 in OOo 2.4, it loads perfectly fine as the original document. This suggests that the problems encountered by loading it in Microsoft Office 2007 are not problems with converting ODF to the internal object model of Microsoft Office 2007 but instead problems in the layout engines.

Original file: Testfile_08.odt
Original file New file
Generator: OpenOffice.org/2.2$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This is another document from the Self-assessment package. It contains a few different features; a TOC, colored text, text boxes, a drawing, an embedded spreadsheet as well as some change-modification. This generated document is kind of messy. The content has been "shuffled" around and again we have the problem with Microsoft Office 2007 SP2 not understanding the GDIMetafile image format. The embedded objects are fine themselves - the graphical representation of them is not.

Original file: Testfile_10.odt
Original file New file
Generator: Jesper Lund Stocholm PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is another one of my own files that I have created earlier. It contains a mathematical formula in MathML. When loading it in Microsoft Office 2007 SP2, the mathematical formula simply dissapears. I am kind of lost on the reason for this. It is not the DOCTYPE-declaration used by OOo (see next file for those details) so maybe it is the construction of my ODT-file that poses an issue for them.

Original file: Testfile_11.odt
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is almost identical to the one above - but it is generated by OOo 2.4 instead of me and carries all the styling and configuration that comes with it. Here the file and the mathematical content loads just fine. But an interesting thing happens when saving it again. The MathML-fragment is slightly altered from

[code=xml]<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE math:math PUBLIC "-//OpenOffice.org//DTD Modified W3C MathML 1.01//EN" "math.dtd">
<math:math xmlns:math="http://www.w3.org/1998/Math/MathML">
 <math:semantics>
  <math:mrow>
   <math:mi>cos</math:mi>
   <math:mrow>
    <math:mrow>
     <math:mo math:stretchy="false">(</math:mo>
     <math:mfrac>
      <math:mo math:stretchy="false">π</math:mo>
      <math:mn>4</math:mn>
     </math:mfrac>
     <math:mo math:stretchy="false">)</math:mo>
    </math:mrow>
    <math:mo math:stretchy="false">=</math:mo>
    <math:mfrac>
     <math:msqrt>
      <math:mn>2</math:mn>
     </math:msqrt>
     <math:mn>2</math:mn>
    </math:mfrac>
   </math:mrow>
  </math:mrow>
  <math:annotation math:encoding="StarMath 5.0">
    cos({%pi} over {4} ) = {sqrt{2} } over {2}
  </math:annotation>
 </math:semantics>
</math:math>[/code]

to

[code=xml]<?xml version="1.0" encoding="UTF-8"?>
<mml:math
  xmlns:mml="http://www.w3.org/1998/Math/MathML"
  xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
  <mml:mi mathvariant="normal">c</mml:mi>
  <mml:mi mathvariant="normal">o</mml:mi>
  <mml:mi mathvariant="normal">s</mml:mi>
  <mml:mo>(</mml:mo>
  <mml:mfrac>
    <mml:mrow>
      <mml:mi>π</mml:mi>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>4</mml:mn>
    </mml:mrow>
  </mml:mfrac>
  <mml:mo>)</mml:mo>
  <mml:mo>=</mml:mo>
  <mml:mfrac>
    <mml:mrow>
      <mml:msqrt>
        <mml:mn>2</mml:mn>
      </mml:msqrt>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>2</mml:mn>
    </mml:mrow>
  </mml:mfrac>
</mml:math>[/code]

The clever reader will notice that the semantic annotations used by OOo are removed from the MathML-fragment. The MathML is in general altered a bit, but it is not that big changes - most of them are visual things related to styling. The problem is that this MathML is un-consumable for OOo. The MathML-fragment produced by Microsoft Office 2007 SP2 is valid MathML (validated using Amaya) and even though I add the required !DOCTYPE, it still won't load in OOo.

Original file: Testfile_13.odt
Original file New file
Generator: OpenOffice.org/2.0$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

(file has been removed at the request of the originator of the file )

This file is a bit more complex, and as with Testfile_08 it consists of a lot of different parts. Key issues here is failure to read GDIMetaFiles, borders around images, errors in visual presentation of numbering/bulleted lists and lines being much thicker than in the original file. There is really nothing new in this file - just that it confirms the problems identified with Testfile_08.

Original file: Testfile_14.odt
Original file New file
Generator: OpenOffice.org/2.3$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

This file is one of those template-files that are used a lot almost everywhere. You know, someone has created a "standard" document with correct header, footer and images, and this file is then distributed in the organisation. The conversion is actually almost error-free. There is a slight error with respect to border around images and rendering of them, but that is just about it.

Original file: Testfile_20.ods
Original file New file
Generator: OpenOffice.org/2.4$Win32 PDF Generator: Microsoft Office 2007 SP2 PDF

Remarks

(both PFD-files have been created by OOo 2.4/Win32)

I created the file above to illustrate what would happen when working with spreadsheets. I used the infamous CEILING-function, but I was at that time not aware that Microsoft Office 2007 SP2 would throw out formulas from "unknown namespaces". Hence there is very little change - only the visible number of decimals after having been through Microsoft Office 2007 SP2 has been reduced to two. If you look in the XML generated, you will find one interesting thing, though:

[code=xml]<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<office:document-content
  xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
  xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
  xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
  xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
  xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
  xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
  xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
  xmlns:msoxl="http://schemas.microsoft.com/office/excel/formula"
  >
  (...)
</office:document-content>[/code]

Can you see it?

Conclusions

Well, the investigation above was done based on about 20 files tested and they were primarily text documents (and one spreadsheet). Some of them was created by me and some were created by various parts of the public sector in Denmark. I have only looked at about half of the files, but a few other files are also available shold you wish to play with them yourself. You can get them here: public.zip (3,02 mb).

Validation

I have made some effort to validate the ODF-files generated by Microsoft Office 2007 SP2. What I have done is to download the RelaxNG ODF 1.1-schemas from OASIS' website and I used JING to perform the schema-validation. Since there is a known bug in the schemas I have used JING with the "-i" flag set. Validating the structure of the package itself is a bit tricky (as reported by Rick Jellife) and I have not done that. I have done a schema-validation on the files "content.xml" and "styles.xml" based on the thought, that these are the most complex files in the package. The result of the validation is that all files generated by Microsoft Office 2007 SP2 are valid ODF 1.1-files. I piped the result of the validation into an output file available here for your viewing pleasure: output.txt (1,92 kb).

All in all I think Microsoft has done a pretty good job. Obviously there is still some way to go until it reaches production quality, but I was pleasantly surprised to see the big difference in conversion results compared with the results of the ODF Converter from SourceForge.net I have worked with earlier. There are a couple of things I would like to note, though:

Graphical representations of embedded objects

Microsoft Office 2007 SP2 has problems with reading the graphical representation of embedded objects if the file is created by OpenOffice. It seems that it simply doesn't support the GDIMetaFile-format used by OpenOffice (and its derivatives). I think the "nice" way to solve this would be to load the object (if supported) and render an image of it again. The dimension of the image is available in the <draw:frame>-element and could be used to determine the size of the image.

Embedded objects

I noticed that handling of embedded objects are done using a "don't touch"-approach, which means that when loading an ODF-file with an embedded object, the embedded object is simply copied and not touched by Microsoft Office 2007 SP2 (if they are not activated by the user). I think this is a good approach. Consuming applications should respect the "integrity" of the consumed package and not alter its content unless it has to.

mimetype

A funny little thing: The mimetype-file in the ODF-package is created using CAPITAL letters, i.e. the file will be called "MIMETYPE". This causes the OpenDocumentFellowship validator to fail since it cannot find the file (with non-capital letters). I have suggested to Microsoft to generate the file using non-capital letters to enhance interop and validation across platforms where some are "a bit more" case-sensitive than Windows.

config settings

Microsoft has chosen not to use the configuration elements otherwise to widely used by Lotus Symphony and OpenOffice.org . I am not sure if I think it is a good or a bad idea, but since they do not use the settings.xml-file at all, they should remove the file completely.

ISO says: continue with ISO/IEC 29500

This just in ...

The two ISO and IEC technical boards have given the go-ahead to publish ISO/IEC DIS 29500, Information technology – Office Open XML formats, as an ISO/IEC International Standard after appeals by four national standards bodies against the approval of the document failed to garner sufficient support.

Source: http://www.iso.org/iso/pressrelease.htm?refid=Ref1151

As I think you can imagine, I think this really good news ... more information to come.

Smile