RPARSEXML is a simple tool for transforming an XML document into a Python datastructure.
RPARSEXML requires the following Python libraries to be available for import:
string, pyRXP, types, time, pprint
RPARSEXML is a simple tool for transforming an XML document into a Python datastructure.
The programmer interface to the parser is
import rparsexml parsedstructure = rparsexml.parsexml(text)
Where text is a string containing the XML markup to be parsed.
The top level structure returned by parsexml
always looks
like this (for consistency with the structures shown below)
("", list, None, None)
The list will usually contain just one tag, corresponding to the top level tag of the xml text.
RPARSEXML transforms a tag of form
<NAME ATTRIBUTES>CONTENT</NAME>
into a python tuple of form
(name_string, attributes_dictionary, list_of_content_fragments, miscellaneous)
The miscellaneous slot does not now contain anything useful. It is provided for future featurization (such as line number annotations).
The attributes dictionary maps string names of attributes to their string values.
The attributes dictionary may be None
if there are no attributes.
If there are no tags in the content, then the content fragment list will just contain the content string. Otherwise, if there are tags in the content then the contents list will be made up of the string fragments that don't contain tags intermixed with the tuple structures for the parsed tags. The strings and tuples in the list will be in the same order as they were in the content. For a tag with content, the content may be an empty list if the content is empty, but will never be None (as it is for the tag form shown below)
A tag of form
<NAME ATTRIBUTES/>
transform into a python tuple of form
(name_string, attributes_dictionary, list_of_content_fragments, miscellaneous)
Where the list of contents is None
.
When the pyXRP extension module is available RPARSEXML is a validating parser, and therefore must be able to locate any DTD (Document Type Definition) required and will report errors on any validation failure.
Here is a simple program that parses a small XML text in the RML (Report Markup Language) format.
text = """\ <?xml version="1.0" encoding="iso-8859-1" standalone="no" ?> <!DOCTYPE document SYSTEM "rml.dtd"> <document filename="outfile.pdf"> <template pagesize="(595, 842)" leftMargin="1in"> <pagetemplate id="main"> <frame id="main" x1="0.5in" y1="5.5in" width="6in" height="4.3in"/> </pagetemplate> </template> <stylesheet> </stylesheet> <story> <para style="h1">hello world</para> </story> </document>""" import rparsexml, pprint parsed = rparsexml.parsexml(text) pprint.pprint(parsed)
And here is its output (which simple dumps the parsed structure indented nicely).
('', None, [('document', {'filename': 'outfile.pdf'}, ['\012 ', ('template', {'author': '(unauthored)', 'leftMargin': '1in', 'pagesize': '(595, 842)', 'title': '(untitled)'}, ['\012 ', ('pagetemplate', {'id': 'main'}, ['\012 ', ('frame', {'height': '4.3in', 'id': 'main', 'width': '6in', 'x1': '0.5in', 'y1': '5.5in'}, None, None), '\012\011'], None), '\012 '], None), '\012 ', ('stylesheet', None, ['\012 '], None), '\012 ', ('story', None, ['\012 ', ('para', {'style': 'h1'}, ['hello world.'], None), '\012 '], None), '\012'], None)], None)
Note that whitespace isn't trimmed out - all fragments such as "\012" (a 'newline' character) are included in the content fragments.
Normally, the output of this program will be further processed by another program (one that is not part of this package). This external program will need to understand the structure of the XML but will not need to handle any actual parsing.