The xlrd Module

A Python module for extracting data from MS Excel ™ spreadsheet files.

General information

Acknowledgements

Backporting to Python 2.1 was partially funded by Journyx - provider of timesheet and project accounting solutions.

Unicode

This module presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode. Earlier spreadsheets have a "codepage" number indicating the local representation; this is used to derive an "encoding" which is used to translate to Unicode.

Dates in Excel spreadsheets

In reality, there are no such things. What you have are floating point numbers and pious hope. There are several problems with Excel dates:

(1) Dates are not stored as a separate data type; they are stored as floating point numbers and you have to rely on (a) the "number format" applied to them in Excel and/or (b) knowing which cells are supposed to have dates in them. This module helps with (a) by inspecting the format that has been applied to each number cell; if it appears to be a date format, the cell is classified as a date rather than a number. Feedback on this feature, especially from non-English-speaking locales, would be appreciated.

(2) Excel for Windows stores dates by default as the number of days (or fraction thereof) since 1899-12-31T00:00:00. Excel for Macintosh uses a default start date of 1904-01-01T00:00:00. The date system can be changed in Excel on a per-workbook basis (for example: Tools -> Options -> Calculation, tick the "1904 date system" box). This is of course a bad idea if there are already dates in the workbook. There is no good reason to change it even if there are no dates in the workbook. Which date system is in use is recorded in the workbook. A workbook transported from Windows to Macintosh (or vice versa) will work correctly with the host Excel. When using this module's xldate_as_tuple function to convert numbers from a workbook, you must use the datemode attribute of the Book object. If you guess, or make a judgement depending on where you believe the workbook was created, you run the risk of being 1462 days out of kilter.

Reference: http://support.microsoft.com/default.aspx?scid=KB;EN-US;q180162

(3) The Windows-default 1900-based date system works on the incorrect premise that 1900 was a leap year. It interprets the number 60 as meaning 1900-02-29, which is not a valid date. Consequently any number less than 61 is ambiguous. Example: is 59 the result of 1900-02-28 entered directly, or is it 1900-03-01 minus 2 days?

Reference: http://support.microsoft.com/default.aspx?scid=kb;en-us;214326

(4) The Macintosh-default 1904-based date system counts 1904-01-02 as day 1 and 1904-01-01 as day zero. Thus any number such that (0.0 <= number < 1.0) is ambiguous. Is 0.625 a time of day (15:00:00), independent of the calendar, or should it be interpreted as an instant on a particular day (1904-01-01T15:00:00)? The xldate_* functions in this module take the view that such a number is a calendar-independent time of day (like Python's datetime.time type) for both date systems. This is consistent with more recent Microsoft documentation (for example, the help file for Excel 2002 which says that the first day in the 1904 date system is 1904-01-02).

(5) Usage of the Excel DATE() function may leave strange dates in a spreadsheet. Quoting the help file, in respect of the 1900 date system: "If year is between 0 (zero) and 1899 (inclusive), Excel adds that value to 1900 to calculate the year. For example, DATE(108,1,2) returns January 2, 2008 (1900+108)." This gimmick, semi-defensible only for arguments up to 99 and only in the pre-Y2K-awareness era, means that DATE(1899, 12, 31) is interpreted as 3799-12-31.

For further information, please refer to the documentation for the xldate_* functions.

Module Contents

Book(filename=None, file_contents=None, logfile=sys.stdout, verbosity=0, pickleable=True, use_mmap=USE_MMAP, ) (class) [#]

Contents of a "workbook".

For more information about this class, see The Book Class.

dump(filename, outfile=sys.stdout) [#]

For debugging: dump the file's BIFF records in char & hex.

filename
The path to the file to be dumped.
outfile
An open file, to which the dump is written.

open_workbook(filename=None, logfile=sys.stdout, verbosity=0, pickleable=True, use_mmap=USE_MMAP, file_contents=None, ) [#]

Open a spreadsheet file for data extraction.

filename
The path to the spreadsheet file to be opened.
logfile
An open file to which messages and diagnostics are written.
verbosity
Increases the volume of trace material written to the logfile.
pickleable
Default = True. Setting to False *may* cause use of array.array objects which save some memory but can't be pickled in Python 2.4 or earlier.
use_mmap
Whether to use the mmap module is determined heuristically. Use this arg to override the result. Current heuristic: mmap is used if it exists.
file_contents
... as a string or an mmap.mmap object or some other behave-alike object. If file_contents is supplied, filename will not be used, except (possibly) in messages.
Returns:
An instance of the Book class.

error_text_from_code (variable) [#]

This dictionary can be used to produce a text version of the internal codes that Excel uses for error cells. Here are its contents:

0x00: '#NULL!',  # Intersection of two cell ranges is empty
0x07: '#DIV/0!', # Division by zero
0x0F: '#VALUE!', # Wrong type of operand
0x17: '#REF!',   # Illegal or deleted cell reference
0x1D: '#NAME?',  # Wrong function or range name
0x24: '#NUM!',   # Value range overflow
0x2A: '#N/A!',   # Argument or function not available

Cell(ctype, value) (class) [#]

Contains the data for one cell.

For more information about this class, see The Cell Class.

empty_cell (variable) [#]

There is one and only one instance of an empty cell -- it's a singleton. This is it. You may use a test like "acell is empty_cell".

Sheet( biff_version, position, logfile, pickleable=False, name='', number=0, verbosity=0, ) (class) [#]

Contains the data for one worksheet.

For more information about this class, see The Sheet Class.

xldate_as_tuple(xldate, datemode) [#]

Convert an Excel number (presumed to represent a date, a datetime or a time) into a tuple suitable for feeding to datetime or mx.DateTime constructors.

xldate
The Excel number
datemode
0: 1900-based, 1: 1904-based.
WARNING: when using this function to interpret the contents of a workbook, you should pass in the Book.datemode attribute of that workbook. Whether the workbook has ever been anywhere near a Macintosh is irrelevant.
Returns:
Gregorian (year, month, day, hour, minute, nearest_second).
Special case: if 0.0 <= xldate < 1.0, it is assumed to represent a time; (0, 0, 0, hour, minute, second) will be returned.
Note: 1904-01-01 is not regarded as a valid date in the datemode 1 system; its "serial number" is zero.
Raises XLDateNegative:
xldate < 0.00
Raises XLDateAmbiguous:
The 1900 leap-year problem (datemode == 0 and 1.0 <= xldate < 61.0)
Raises XLDateTooLarge:
Gregorian year 10000 or later
Raises XLDateBadDatemode:
datemode arg is neither 0 nor 1
Raises XLDateError:
Covers the 4 specific errors

xldate_from_date_tuple((year, month, day), datemode) [#]

Convert a date tuple (year, month, day) to an Excel date.

year
Gregorian year.
month
1 <= month <= 12
day
1 <= day <= last day of that (year, month)
datemode
0: 1900-based, 1: 1904-based.
Raises XLDateAmbiguous:
The 1900 leap-year problem (datemode == 0 and 1.0 <= xldate < 61.0)
Raises XLDateBadDatemode:
datemode arg is neither 0 nor 1
Raises XLDateBadTuple:
(year, month, day) is too early/late or has invalid component(s)
Raises XLDateError:
Covers the specific errors

xldate_from_datetime_tuple(datetime_tuple, datemode) [#]

Convert a datetime tuple (year, month, day, hour, minute, second) to an Excel date value. For more details, refer to other xldate_from_*_tuple functions.

datetime_tuple
(year, month, day, hour, minute, second)
datemode
0: 1900-based, 1: 1904-based.

xldate_from_time_tuple((hour, minute, second)) [#]

Convert a time tuple (hour, minute, second) to an Excel "date" value (fraction of a day).

hour
0 <= hour < 24
minute
0 <= minute < 60
second
0 <= second < 60
Raises XLDateBadTuple:
Out-of-range hour, minute, or second

The Book Class

Book(filename=None, file_contents=None, logfile=sys.stdout, verbosity=0, pickleable=True, use_mmap=USE_MMAP, ) (class) [#]

Contents of a "workbook".

WARNING: You don't call this class yourself. You use the Book object that was returned when you called xlrd.open_workbook("myfile.xls").

biff_version [#]

Version of BIFF (Binary Interchange File Format) used to create the file. Latest is 8.0 (represented here as 80), introduced with Excel 97. Earliest supported by this module: 3.0 (rep'd as 30).

codepage [#]

An integer denoting the character set used for strings in this file. For BIFF 8 and later, this will be 1200, meaning Unicode; more precisely, UTF_16_LE. For earlier versions, this is used to derive the appropriate Python encoding to be used to convert to Unicode. Examples: 1252 -> 'cp1252', 10000 -> 'mac_roman'

countries [#]

A tuple containing the (telephone system) country code for:
[0]: the user-interface setting when the file was created.
[1]: the regional settings.
Example: (1, 61) meaning (USA, Australia). This information may give a clue to the correct encoding for an unknown codepage. For a long list of observed values, refer to the OpenOffice.org documentation for the COUNTRY record.

datemode [#]

Which date system was in force when this file was last saved.
0 => 1900 system (the Excel for Windows default).
1 => 1904 system (the Excel for Macintosh default).

encoding [#]

The encoding that was derived from the codepage.

load_time_stage_1 [#]

Time in seconds to extract the XLS image as a contiguous string (or mmap equivalent).

load_time_stage_2 [#]

Time in seconds to parse the data from the contiguous string (or mmap equivalent).

nsheets [#]

The number of worksheets in the workbook.

sheet_by_index(sheetx) [#]
sheetx
Sheet index in range(nsheets)
Returns:
An object of the Sheet class

sheet_by_name(sheet_name) [#]
sheet_name
Name of sheet required
Returns:
An object of the Sheet class

sheet_names() [#]
Returns:
A list of the names of the sheets in the book.

user_name [#]

What (if anything) is recorded as the name of the last user to save the file.

The Cell Class

Cell(ctype, value) (class) [#]

Contains the data for one cell.

WARNING: You don't call this class yourself. You access Cell objects via methods of the Sheet object(s) that you found in the Book object that was returned when you called xlrd.open_workbook("myfile.xls").

Cell objects have two attributes: ctype is an int, and value which depends on ctype. The following table describes the types of cells and how their values are represented in Python.

Type symbol Type number Python value
XL_CELL_EMPTY 0 empty string u''
XL_CELL_TEXT 1 a Unicode string
XL_CELL_NUMBER 2 float
XL_CELL_DATE 3 float
XL_CELL_BOOLEAN 4 int; 1 means TRUE, 0 means FALSE
XL_CELL_ERROR 5 int representing internal Excel codes; for a text representation, refer to the supplied dictionary error_text_from_code

The Sheet Class

Sheet( biff_version, position, logfile, pickleable=False, name='', number=0, verbosity=0, ) (class) [#]

Contains the data for one worksheet.

In the cell access functions, "rowx" is a row index, counting from zero, and "colx" is a column index, counting from zero. Negative values for row/column indexes and slice positions are supported in the expected fashion.

For information about cell types and cell values, refer to the documentation of the Cell class.

WARNING: You don't call this class yourself. You access Sheet objects via the Book object that was returned when you called xlrd.open_workbook("myfile.xls").

cell_type(rowx, colx) [#]

Type of the cell in the given row and column. Refer to the documentation of the Cell class.

cell_value(rowx, colx) [#]

Value of the cell in the given row and column.

col(colx) [#]

Returns a sequence of the Cell objects in the given column.

col_slice(colx, start_rowx=0, end_rowx=None) [#]

Returns a slice of the Cell objects in the given column.

name [#]

Name of sheet.

ncols [#]

Number of columns in sheet. A column index is in range(thesheet.ncols).

nrows [#]

Number of rows in sheet. A row index is in range(thesheet.nrows).

row(rowx) [#]

Returns a sequence of the Cell objects in the given row.

row_slice(rowx, start_colx=0, end_colx=None) [#]

Returns a slice of the Cell objects in the given row.

row_types(rowx) [#]

Returns a sequence of the types of the cells in the given row.

row_values(rowx) [#]

Returns a sequence of the values of the cells in the given row.