| 
 Introduction  | |
[ Back To Top ] 
With Office 2007, Microsoft decided to change the default 
application formats from old, proprietary, closed formats (DOC, XLS, PPT) to 
new, open and standardized XML formats (DOCX, XLSX and PPTX).  New formats share 
some similarities with old Office XML formats (WordML, SpreadsheetML) and some 
similarities with competing OpenOffice.org OpenDocument formats, but there are 
many differences.  Since new formats will be the default in Office 2007 and 
Microsoft Office is the most predominant office suite, these formats are 
destined to be popular and you will probably have to deal with them sooner or 
later. 
This article will explain the basics of Open XML file format 
and specifically XLSX format, the new format for Excel 2007.  Presented is a 
demo application which writes/reads tabular data to/from XLSX files.  The 
application is written in C# using Visual Studio 2005.  Created XLSX files can 
be opened using Excel 2007 Beta (we used build 
12.0.3820.1003). 
 | |
| Microsoft Open XML format | |
[ Back To Top ] 
Every Open XML file is essentially a ZIP archive containing 
many other files.  Office-specific data is stored in multiple XML files inside 
that archive.  This is in direct contrast with old WordML and SpreadsheetML 
formats which were single, non-compressed XML files.  Although more complex, the 
new approach offers a few benefits. 
•        You do not need to process entire files in order to 
extract specific data. 
•        Images and multimedia are now encoded in native 
format, not as text streams. 
•        Files are smaller as a result of compression and 
native multimedia storage. 
In Microsoft’s terminology, an open XML ZIP file is called a 
package.  Files inside that package are called parts.  It is important to know 
that every part has a defined content type and there are no default type 
presumptions based on the file extension.  Content type can describe anything, 
application XML, user XML, images, sounds, video or any other binary objects. 
 Every part must be connected to some other part using a relationship.  Inside 
the package are special XML files with “.rels” extension which define 
relationship between parts.  There is also a start part (sometimes called 
“root”, which is a bit misleading because graph containing all parts does not 
have to be a tree structure), so the entire structure looks like Figure 1. 
Figure 1 
To cut a long story short, in order to read the data from an 
Open XML file you need to: 
1)       Open the package as a ZIP archive; any standard ZIP 
library will do. 
2)       Find parts that contain data you want to read.  You 
can navigate through relationship graph (more complex) or you can presume that 
certain parts have a defined name and path (Microsoft can change that in the 
future). 
3)       Read parts you are interested in using standard XML 
library (if they are XML) or some other method (if they are images, sounds or of 
some other type). 
On the other hand, if you want to create a new Open XML file, 
you need to: 
1)       Create/get all necessary parts by using some 
standard XML library (if they are XML), by copying them or by using some other 
method. 
2)       Create all relationships by creating “.rels” 
files. 
3)       Create content types by creating a 
“[Content_Types].xml” file. 
4)       Package everything into a ZIP file with an 
appropriate extension (DOCX, XLSX or PPTX), any standard ZIP library will 
do. 
The whole story about packages, parts, content types and 
relations is the same for all Open XML documents (regardless of their 
originating application) and Microsoft refers to it as Open Packaging 
Conventions. 
 | |
| Excel 2007 Open XML specifics | |
[ Back To Top ] 
Excel 2007 extends on the basis of Open Packaging Conventions 
by adding its own application-specific XML types.  Reference schemas for all XML 
files used in Office can be downloaded from MSDN, but some things are still open 
to change until the final Excel 2007 release. 
We just want to write/read worksheet data, so we need to look 
in the folder “\xl\worksheets” inside XLSX file where all the worksheets are 
located.  For every worksheet there is a separate XML file, “sheet1.xml,” 
 “sheet2.xml” and so on.  When you open such a file you will notice that all of 
the sheet data is inside the <sheetData> element.  For every row there is 
a <row> element and for every cell there is a <c> element.  Finally, 
the value of the cell is stored in a <v> element. 
However, real world XML is never as simple as schoolbook XML. 
 You will notice that numbers get encoded as numbers inside <v> 
element. 
Listing 1 
<c r="A1">
<v>100</v> 
</c>
However, string value (like “John”) also gets encoded as a 
number.  
Listing 2   
<c r="B1" t="s">
<v>0</v> 
</c>
That is because MS Excel uses internal table of unique 
strings (for performance reasons).  Zero is an index of that string in an 
internal table of strings and attribute t="s" tells us that the underlying type 
is a string, not a number.  So where is the table of unique strings located?  It 
is in “\xl\sharedStrings.xml” XML file and contains all strings used in the 
entire workbook, not just a specific worksheet. 
This approach is used for many other things: cell styles, 
borders, charts, number formats, etc.  In fact, that becomes the major 
programming problem when working with XLSX files- updating and maintaining 
various tables of some unique Excel objects.  In this article we will just 
write/read data values, but if you require some complex formatting you should 
probably be using some commercial component which does all the tedious work for 
you. 
 | |
| Implementation | |
[ Back To Top ] 
Our demo is a Web Forms application (see Listing 2) written 
in C# using Visual Studio 2005. Since there is no support for ZIP files in .NET 
Framework 2.0 (only for ZIP algorithm), our demo is using an open-source ZIP 
library called SharpZipLib (available at: 
http://www.icsharpcode.net/OpenSource/SharpZipLib).  For demonstration purposes 
we will extract entire ZIP files to TEMP folder, so we can examine contents of 
that folder and files while debugging demo application.  In a real world 
application you may want to avoid extracting to a temporary folder and just read 
to/write from the ZIP file directly. 
For XML processing the choice is simple.  For reading XML 
files we use XmlTextReader class and for writing we use XmlTextWriter class. 
 Both come with .NET Framework, but you can also use any other XML processing 
library. 
Figure 2  
Data reading 
We want to read a simple “In.xlsx” file (in the “Input” 
folder) and copy its contents to the DataTable.  That file contains a list of 
people with their first and last names (text values) and their IDs (number 
values).  When “Read input .xlsx file” button in clicked, the following code is 
executed: 
Listing 3 
protected void buttonReadInput_Click(object sender, EventArgs e)
{
  // Input file name.
  string fileName = Request.PhysicalApplicationPath + @"..\Input\In.xlsx";
 
  // Delete contents of the temporary directory.
  ExcelRW.DeleteDirectoryContents(tempDir);
 
  // Unzip input XLSX file to the temporary directory.
  ExcelRW.UnzipFile(fileName, tempDir);
 
  // Open XML file with table of all unique strings used in the workbook..
  FileStream fs = new FileStream(tempDir + @"\xl\sharedStrings.xml",
    FileMode.Open, FileAccess.Read);
  /* ..and call helper method that parses that XML and returns an array of
    Strings*/
  ArrayList stringTable = ExcelRW.ReadStringTable(fs);
 
  // Get DataTable with people from session variable.
  DataTable data = (DataTable)Session["people"];
 
  // Open XML file with worksheet data..
  fs = new FileStream(tempDir + @"\xl\worksheets\sheet1.xml", FileMode.Open,
    FileAccess.Read);
  /* ..and call helper method that parses that XML and fills DataTable with
     values. */
  ExcelRW.ReadWorksheet(fs, stringTable, data);
 
  // Update GridView.
  this.SetDataBinding();
}
Nothing unusual happens here.  XLSX file is unzipped to the 
TEMP folder and then necessary XML parts (now files) are processed.  File 
“sharedStrings.xml” contains a global table of unique strings while file 
“sheet1.xml” contains data for the first sheet.  Helper methods are pretty 
straightforward XML reading code; you can download demo application code to 
examine them in more detail. 
If everything is OK, after you click the button all data will 
show up in the GridView. 
Data writing 
Now we want to write data from a DataTable to the “Out.xlsx” 
file in the “Output” folder.  When “Write output .xlsx file” button is clicked, 
the following code is executed: 
Listing 4 
protected void buttonWriteOutput_Click(object sender, EventArgs e)
{
  // Output file name.
  string fileName = Request.PhysicalApplicationPath + @"..\Output\Out.xlsx";
 
  // Delete contents of the temporary directory.
  ExcelRW.DeleteDirectoryContents(tempDir);
 
  // Unzip template XLSX file to the temporary directory.
  ExcelRW.UnzipFile(templateFile, tempDir);
 
  // We will need two string tables; a lookup Hashtable for fast searching and
  // an ordinary ArrayList where items are sorted by their index.
  Hashtable lookupTable;
 
  DataTable data = (DataTable)Session["people"];
 
  // Call helper methods which creates both tables from input data.
  ArrayList stringTable = ExcelRW.CreateStringTables(data, out lookupTable);
 
  // Create XML file..
  FileStream fs = new FileStream(tempDir + @"\xl\sharedStrings.xml",
    FileMode.Create);
  // ..and fill it with unique strings used in the workbook
  ExcelRW.WriteStringTable(fs, stringTable);
 
  // Create XML file..
  fs = new FileStream(tempDir + @"\xl\worksheets\sheet1.xml", FileMode.Create);
  // ..and fill it with rows and columns of the DataTable.
  ExcelRW.WriteWorksheet(fs, data, lookupTable);
 
  // ZIP temporary directory to the XLSX file.
  ExcelRW.ZipDirectory(tempDir, fileName);
 
  if (this.CheckBoxDownload.Checked)
  {
    // Stream XLS to browser.
    Response.Clear();
    Response.ContentType = "application/vnd.ms-excel";
    Response.AddHeader("Content-Disposition", "attachment; filename=Out.xlsx");
    Response.WriteFile(fileName);
    Response.End();
  }
}
This time code is a bit more complicated.  In order not to 
generate all necessary parts needed for XLSX file, we decide to use a template 
file.  We extract template file to the temporary folder and then just change XML 
parts containing shared string table and worksheet data.  All other parts, 
relationships and content types stay the same.  We do not need to generate any 
of that.  Note that we use two string tables: a lookup Hashtable for fast 
searching and an ordinary ArrayList where items are sorted by their index.  We 
could pull it out only with ArrayList, but then we would need to search entire 
ArrayList every time we add a new string (to check if it is already there). 
 CreateStringTables() helper method builds both string tables, the 
WriteStringTable() helper method writes string table XML and WriteWorksheet() 
helper method writes worksheet data XML.  
Again, download demo application code to examine helper 
methods in more detail. 
 | |
As always in programming, there is more than one method to 
achieve the same thing. 
You could use Office automation to start an instance of Excel 
2007 (or any other Office application) and then use interop calls to create a 
document and save it.  However, using automation has some drawbacks I have 
already written about (see at: http://www.gemboxsoftware.com/GBSpreadsheet.htm#Automation). 
The next version of .NET Framework (codename WinFX) will have 
support for Open Packaging Conventions (package handling and navigating the 
relationships), but it seems there will be no support for accessing application 
specific data so you will still need to process XML parts manually. 
As another option, you could use some third party component 
which will come with support for Open XML format.  This will probably cost you 
some money, but has the advantage that usually more than one format (for 
example; XLS, XLSX, CSV) are supported within the same API, so your application 
will be able to target different file formats using the same 
code. 
 | |
| Summary | |
[ Back To Top ] 
New Open XML formats are a big step forward compared with 
from old, proprietary, closed formats (DOC, XLS, PPT).  They are a little bit 
more complicated to generate than old Office XML formats, but they are much more 
powerful and flexible.  They are here to stay, so you should invest your time to 
learn how to work with them. 
 | 
Comments
Post a Comment