A Python module for extracting data from MS Excel ™ spreadsheet files.
Backporting to Python 2.1 was partially funded by Journyx - provider of timesheet and project accounting solutions.
This module presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode. Earlier spreadsheets have a "codepage" number indicating the local representation; this is used to derive an "encoding" which is used to translate to Unicode.
In reality, there are no such things. What you have are floating point numbers and pious hope. There are several problems with Excel dates:
(1) Dates are not stored as a separate data type; they are stored as floating point numbers and you have to rely on (a) the "number format" applied to them in Excel and/or (b) knowing which cells are supposed to have dates in them. This module helps with (a) by inspecting the format that has been applied to each number cell; if it appears to be a date format, the cell is classified as a date rather than a number. Feedback on this feature, especially from non-English-speaking locales, would be appreciated.
(2) Excel for Windows stores dates by default as the number of days (or fraction thereof) since 1899-12-31T00:00:00. Excel for Macintosh uses a default start date of 1904-01-01T00:00:00. The date system can be changed in Excel on a per-workbook basis (for example: Tools -> Options -> Calculation, tick the "1904 date system" box). This is of course a bad idea if there are already dates in the workbook. There is no good reason to change it even if there are no dates in the workbook. Which date system is in use is recorded in the workbook. A workbook transported from Windows to Macintosh (or vice versa) will work correctly with the host Excel. When using this module's xldate_as_tuple function to convert numbers from a workbook, you must use the datemode attribute of the Book object. If you guess, or make a judgement depending on where you believe the workbook was created, you run the risk of being 1462 days out of kilter.
Reference: http://support.microsoft.com/default.aspx?scid=KB;EN-US;q180162
(3) The Windows-default 1900-based date system works on the incorrect premise that 1900 was a leap year. It interprets the number 60 as meaning 1900-02-29, which is not a valid date. Consequently any number less than 61 is ambiguous. Example: is 59 the result of 1900-02-28 entered directly, or is it 1900-03-01 minus 2 days?
Reference: http://support.microsoft.com/default.aspx?scid=kb;en-us;214326
(4) The Macintosh-default 1904-based date system counts 1904-01-02 as day 1 and 1904-01-01 as day zero. Thus any number such that (0.0 <= number < 1.0) is ambiguous. Is 0.625 a time of day (15:00:00), independent of the calendar, or should it be interpreted as an instant on a particular day (1904-01-01T15:00:00)? The xldate_* functions in this module take the view that such a number is a calendar-independent time of day (like Python's datetime.time type) for both date systems. This is consistent with more recent Microsoft documentation (for example, the help file for Excel 2002 which says that the first day in the 1904 date system is 1904-01-02).
(5) Usage of the Excel DATE() function may leave strange dates in a spreadsheet. Quoting the help file, in respect of the 1900 date system: "If year is between 0 (zero) and 1899 (inclusive), Excel adds that value to 1900 to calculate the year. For example, DATE(108,1,2) returns January 2, 2008 (1900+108)." This gimmick, semi-defensible only for arguments up to 99 and only in the pre-Y2K-awareness era, means that DATE(1899, 12, 31) is interpreted as 3799-12-31.
For further information, please refer to the documentation for the xldate_* functions.
Contents of a "workbook".
For more information about this class, see The Book Class.
For debugging: dump the file's BIFF records in char & hex.
Open a spreadsheet file for data extraction.
This dictionary can be used to produce a text version of the internal codes that Excel uses for error cells. Here are its contents:
0x00: '#NULL!', # Intersection of two cell ranges is empty 0x07: '#DIV/0!', # Division by zero 0x0F: '#VALUE!', # Wrong type of operand 0x17: '#REF!', # Illegal or deleted cell reference 0x1D: '#NAME?', # Wrong function or range name 0x24: '#NUM!', # Value range overflow 0x2A: '#N/A!', # Argument or function not available
Contains the data for one cell.
For more information about this class, see The Cell Class.
There is one and only one instance of an empty cell -- it's a singleton. This is it. You may use a test like "acell is empty_cell".
Contains the data for one worksheet.
For more information about this class, see The Sheet Class.
Convert an Excel number (presumed to represent a date, a datetime or a time) into a tuple suitable for feeding to datetime or mx.DateTime constructors.
Convert a date tuple (year, month, day) to an Excel date.
Convert a datetime tuple (year, month, day, hour, minute, second) to an Excel date value. For more details, refer to other xldate_from_*_tuple functions.
Convert a time tuple (hour, minute, second) to an Excel "date" value (fraction of a day).
Contents of a "workbook".
WARNING: You don't call this class yourself. You use the Book object that was returned when you called xlrd.open_workbook("myfile.xls").
Version of BIFF (Binary Interchange File Format) used to create the file. Latest is 8.0 (represented here as 80), introduced with Excel 97. Earliest supported by this module: 3.0 (rep'd as 30).
An integer denoting the character set used for strings in this file. For BIFF 8 and later, this will be 1200, meaning Unicode; more precisely, UTF_16_LE. For earlier versions, this is used to derive the appropriate Python encoding to be used to convert to Unicode. Examples: 1252 -> 'cp1252', 10000 -> 'mac_roman'
A tuple containing the (telephone system) country code for:
[0]: the user-interface setting when the file was created.
[1]: the regional settings.
Example: (1, 61) meaning (USA, Australia).
This information may give a clue to the correct encoding for an unknown codepage.
For a long list of observed values, refer to the OpenOffice.org documentation for
the COUNTRY record.
Which date system was in force when this file was last saved.
0 => 1900 system (the Excel for Windows default).
1 => 1904 system (the Excel for Macintosh default).
The encoding that was derived from the codepage.
Time in seconds to extract the XLS image as a contiguous string (or mmap equivalent).
Time in seconds to parse the data from the contiguous string (or mmap equivalent).
The number of worksheets in the workbook.
What (if anything) is recorded as the name of the last user to save the file.
Contains the data for one cell.
WARNING: You don't call this class yourself. You access Cell objects via methods of the Sheet object(s) that you found in the Book object that was returned when you called xlrd.open_workbook("myfile.xls").
Cell objects have two attributes: ctype is an int, and value which depends on ctype. The following table describes the types of cells and how their values are represented in Python.
Type symbol | Type number | Python value |
---|---|---|
XL_CELL_EMPTY | 0 | empty string u'' |
XL_CELL_TEXT | 1 | a Unicode string |
XL_CELL_NUMBER | 2 | float |
XL_CELL_DATE | 3 | float |
XL_CELL_BOOLEAN | 4 | int; 1 means TRUE, 0 means FALSE |
XL_CELL_ERROR | 5 | int representing internal Excel codes; for a text representation, refer to the supplied dictionary error_text_from_code |
Contains the data for one worksheet.
In the cell access functions, "rowx" is a row index, counting from zero, and "colx" is a column index, counting from zero. Negative values for row/column indexes and slice positions are supported in the expected fashion.
For information about cell types and cell values, refer to the documentation of the Cell class.
WARNING: You don't call this class yourself. You access Sheet objects via the Book object that was returned when you called xlrd.open_workbook("myfile.xls").
Type of the cell in the given row and column. Refer to the documentation of the Cell class.
Value of the cell in the given row and column.
Returns a sequence of the Cell objects in the given column.
Returns a slice of the Cell objects in the given column.
Name of sheet.
Number of columns in sheet. A column index is in range(thesheet.ncols).
Number of rows in sheet. A row index is in range(thesheet.nrows).
Returns a sequence of the Cell objects in the given row.
Returns a slice of the Cell objects in the given row.
Returns a sequence of the types of the cells in the given row.
Returns a sequence of the values of the cells in the given row.