a 'mooh' point

clearly an IBM drone

The complexity of SpreadsheetML - oh the sheer joy of it!

Having a bit of time on my hands while attending the SC34/WG4-meeting in Okinawa, I thought I'd write up a blog post I have wanted to write in quite some time.

The reason for me doing this was a requirement I am often presented by CIBER's customers - export my data to Excel. The data they want us to export are traditionally grouped into three categories:

  • Text (strings)
  • Numbers
  • Dates

Creating cells with numbers and text are really a no-brainer in OOXML. It is a bit more complicated when it comes to dates, because dates in e.g. ISO 8601-format are not as such supported as "built-in cell data types" in SpreadsheetML. Instead, dates are presented by styling content in number-cells. This means that to be able to display a date in SpreadsheetML, you need to be know "a bit" about styling in spreadsheets.

Now, as some of you remember, representation of dates in spreadsheets using OOXML is done in "serial form" meaning that dates are stored as numbers. These numbers are also known as "Julian days" - not to be mistaken with the "Julian Calendar". In even other words a date is represented as the number of days since some starting point in time.

So if I wanted to store the date "December 20nd 2009" in OOXML, I would have to convert it to a "julian representation" - in this case "40167". This is really just a minor annoyance - the conversion is trivial and a no-brainer. However - the fun has not started yet.

If you look at the markup required, it would have to be like this:

[code:xml]<sheetData>
  <row r="1">
    <c r="A1">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

So this will give me a cell with a serial representation of 2009-12-22. However, if I open this in an OOXML-compliant application, it will display "40167". As I mentioned above, it turns out that displaying the serial representation as a "proper date" requires styling of the cell content.

The key is an attribute on the <c>-element I omitted in the example above.

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

The "s"-attribute specified the style for the given cell. The specefication says this for this particular attribute:

The index of this cell's style. Style records are stored in the Styles Part.

Ok - cool so the good thing here is, that we now know what the attribute is used for. The bad thing is that we don't know anything about "how".

Styles for SpreadsheetML are described in section 3.8. The complete section is about 110 pages and it describes at length each element name and attribute but again it more answers "what" than "how".

(I just talked to another delegate about if a standard should describe both the hows and the whats, and it seems that the jury is still out on that one, so these are simply my personal observations of using the specification to solve a concrete problem).

So in figuring out how to do this, a good starting point would be to look at the list of valid child elements. These are defined as

[code:xml]<complexType name="CT_Stylesheet">
  <sequence>
    <element name="numFmts" type="CT_NumFmts" minOccurs="0" maxOccurs="1"/>
    <element name="fonts" type="CT_Fonts" minOccurs="0" maxOccurs="1"/>
    <element name="fills" type="CT_Fills" minOccurs="0" maxOccurs="1"/>
    <element name="borders" type="CT_Borders" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyleXfs" type="CT_CellStyleXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellXfs" type="CT_CellXfs" minOccurs="0" maxOccurs="1"/>
    <element name="cellStyles" type="CT_CellStyles" minOccurs="0" maxOccurs="1"/>
    <element name="dxfs" type="CT_Dxfs" minOccurs="0" maxOccurs="1"/>
    <element name="tableStyles" type="CT_TableStyles" minOccurs="0" maxOccurs="1"/>
    <element name="colors" type="CT_Colors" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
</complexType>[/code]

The elements that should (ahem) draw attention to them are "cellStyles", "cellStyleXfs" and "cellXfs".So, if you want to apply formatting directly to a cell, look at e.g. the element <cellXfs> defined in section 3.8.10. It says (in abstract)

This element contains the master formatting records (xf) which define the formatting applied to cells in this workbook. These records are the starting point for determining the formatting for a cell. Cells in the Sheet Part reference the xf records by zero-based index.

The <cellXfs>-element has a child element called <xf>. The element is defined as

[code:xml]<complexType name="CT_Xf">
  <sequence>
    <element name="alignment" type="CT_CellAlignment" minOccurs="0" maxOccurs="1"/>
    <element name="protection" type="CT_CellProtection" minOccurs="0" maxOccurs="1"/>
    <element name="extLst" type="CT_ExtensionList" minOccurs="0" maxOccurs="1"/>
  </sequence>
  <attribute name="numFmtId" type="ST_NumFmtId" use="optional"/>
  <attribute name="fontId" type="ST_FontId" use="optional"/>
  <attribute name="fillId" type="ST_FillId" use="optional"/>
  <attribute name="borderId" type="ST_BorderId" use="optional"/>
  <attribute name="xfId" type="ST_CellStyleXfId" use="optional"/>
  <attribute name="quotePrefix" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="pivotButton" type="xsd:boolean" use="optional" default="false"/>
  <attribute name="applyNumberFormat" type="xsd:boolean" use="optional"/>
  <attribute name="applyFont" type="xsd:boolean" use="optional"/>
  <attribute name="applyFill" type="xsd:boolean" use="optional"/>
  <attribute name="applyBorder" type="xsd:boolean" use="optional"/>
  <attribute name="applyAlignment" type="xsd:boolean" use="optional"/>
  <attribute name="applyProtection" type="xsd:boolean" use="optional"/>
</complexType>[/code]

The attribute you want here is "numFmtId". The attribute is described as "Id of the number format (numFmt) record used for this cell format".

(are we getting there soon?)

Anywho, going to the reference of numFmt will lead you to paragraph 3.8.30 numFmt (Number Format) and it will tell you, that some of the values of the attribute are implied. That's really just another way of saying "reserved values". 

ID
formatCode
 
 0
 General
 1  0
 2  0.00
 3  #,##0
 4  #,##0.00
 9  0%
 10  0.00%
 11  0.00E+00
 12  # ?/?
 13  # ??/??
 14  mm-dd-yy
 15  d-mmm-yy
 16  d-mmm
 17  mmm-yy
 18  h:mm AM/PM
 19  h:mm:ss AM/PM
 20  h:mm
 21  h:mm:ss
 22  m/d/yy h:mm
 37  #,##0 ;(#,##0)
 38  #,##0 ;[Red](#,##0)
 39  #,##0.00 ;(#,##0.00)
 40  #,##0.00 ;[Red](#,##0.00
 45  mm:ss
 46  [h]:mm:ss
 47  mmss.0
 48  ##0.0E+0
 49  @


It looks like id 15 could be the one we are looking for. So I'm gonna add this number format to the xf-elements's numFmt-attribute and create this xml-fragment:

[code:xml]<cellXfs count="2">
  <xf numFmtId="15" (...)  />
</cellXfs>[/code]

Behold - it actually works. When I load this in Microsoft Office 2007, it will display this:



So what have I learned here (apart from the astounding complexity of this relatively trivial task)? Well, to display a date using SpreadsheetML, you need to know a bit about SpreadsheetML styles. You will also need to do a fair amount of digging in the specification as well as in existing OOXML-files, since I could not find this information anywhere. Luckily for you, the content of this blog is licensed under Creative Commons attribution license, so feel free to use it however you should wish to do so.

To sum it all up, you will need the following items to display a cell in SpreadsheetML:

1. The cell fragment

[code:xml]<sheetData>
  <row r="1">
    <c r="A1" s="0">
      <v>40167</v>
    </c>
  </row>
</sheetData>[/code]

Notice that the cell is styled using the attribute "s" with a value of "0".

2. The style part

[code:xml]<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <cellXfs count="1">
    <xf numFmtId="15" (...) />
  </cellXfs>
</styleSheet>[/code]

Notice that index "0" of the <cellXfs>-collection has a numFmt-attribute with the value "15" resulting in displaying the date correctly.

I have created a small test file based on the walk-through above and it is available here: test_dates.xlsx (2.25 kb).

And in other news:

So, you might ask, how is this done using other document formats? Well, it turns out to be drastically less complex.

ODF

[code:xml]<table:table-row>
  <table:table-cell office:value-type="date" office:date-value="2009-12-20">
    <text:p>20-12-09</text:p>
  </table:table-cell>
</table:table-row>[/code]

OOXML IS29500

[code:xml]<sheetData>
  <row r="1">
    <c r="C4" t="d">
      <v>1976-11-22T08:30Z</v>
    </c>
  </row>
</sheetData>[/code] 

Both examples above should require no additional formatting.

You might also ask, if this could have been done in any other way in OOXML? Well, as far as I read the specification, there is no way around the style-part-trouble. But you could create your own number formatting if you should wish so. I would actually prefer this angle, since it would be a step away from pre-determined (implied) values in styles and keep the package content self-contained.

You know, this could actually be the basis for a nice new defect report for WG4: "Remove all implied values in the specification and move them to the transitional Part 4".

Is there an end of it?

I know this was quite a lenghty post - but is it of any value at all - and would you like more of these investigative posts in the future?

Smile

What's up Japan!

Jeeez ... has it been a long time since I last wrote a blog-entry here. It's not so much that I didn't want to write something ... but I have found myself pre-occupied with other tasks at the grinding mill. It also seems to me that most of the other participants of the disussions have done the same thing - maybe with on exception.

Anywho - in other news, OASIS submitted their response to the ISO/IEC JTC1/SC34-defect report.It is in form of the document "Open Document Format for Office Applications (OpenDocument) 1.0 Errata 01". The Danish mirror-committee to JTC1/SC34 talked about it at our meeting last Friday and we will look into the response as soon as possible. I have made a kindof-thorough look-through of the document and I was able to confirm and accept most of the corrections. A few were a bit odd, but it's not a big deal. To me, the most important thing is to move on, have the "chapter" on ODF 1.0 in ISO/IEC closed and now concentrate on development of ODF 1.2 and"ODFNext" or whatever the latest friendly name of ODF 1.2++ is. It is my feeling that the Danish mirror committee concurs with me here, so I would suspect us to approve the response from OASIS before the JTC1/SC34-plenary in Pragh in the end of March 2009. If anyone in JTC1/SC34 need a helping hand editing the response and turning it into a COR, please let me know.

On more thing, though. Can anyone tell me if the proposed changes to IS26300 in the Errata 01 document are all included in ODF 1.2?

So what about Japan?

Well, next week the first meeting of WG4 in SC34 will take place in Okinawa, Japan. The draft agenda can be seen at the Japanese SC34 website and is also listed here:

  1. Opening - 2009-01-28 10:00

  2. Roll Call of Delegates
  3. Adoption of the Agenda

  4. Overview of the JTC 1 Maintenance Process (WG 4 N0012)
  5. Defect Reports (WG 4 N 0015)

  6. Comment Collection Form
  7. Schedule for Reprints or Technical Corrigenda

  8. Accessing the SC 34/WG4 Email Archive and Document Repository (WG4 N0014)
  9. Future Meetings (F2F and Teleconferences)

  10. Any other business

  11. Closing

So we are basically going to bee looking at what to do next. How will we structure our work? How will we keep the pace up and (rather importantly), how will we collect suggestions or defects from the public. I know that it is important to the Danish mirror committee that the widest possible audience will be heard, so I am looking forward to some interesting discussions here.

We will of course take a look at the defects that have already made their way to our system. There are currently about 50 single defects reported, some by ECMA and some by various national bodies. The defects range from spelling errors through decisions from Geneva not being implemented correctly to errors in the XML-schemas for OOXML. Denmark will sadly not be able to contribute at this time, due to "shortage of labour" but we still hope that we will have something by the end of March.

And finally - I have not asked for a full list of participants to the meeting just yet, but the last figure I heard was about 20 people in total. That's a lot - but still less than the 120 we were in Geneva.

Smile

PS: Is it cold in Okinawa?