Created: 1998-04-04
Author: Fredrik Lundh
Copyright © 1998-2002 by Secret Labs AB

The sgmlop Module

Overview

The sgmlop module provides a simple and fast parser/lexer for reading XML documents. By design, the parsers provided by this module are very tolerant. They will parse virtually anything into a stream of start tags, entities, end tags, and text sections. If you need careful well-formedness checking, use expat instead.

The sgmlop module can also be used to parse SGML and HTML documents.

Concepts

FIXME: needs work

Patterns

FIXME: needs work

class xml_handler:
    def finish_starttag(self, tag, attrs):
        ...
    def finish_endtag(self,tag):
        ...
    def handle_data(self,data):
        ...

parser = sgmlop.XMLParser()

target = xml_handler()
parser.register(target)

while 1:
    data = file.read(8192)
    if not data:
        break
    parser.feed(data)
parser.close()

FIXME: cookbook: how to use entity resolvers

FIXME: cookbook: how to count lines

FIXME: cookbook: how to parse unicode strings

FIXME: cookbook: how to parse dtd

FIXME: cookbook: how to parse external entities

Classes

XMLParser

XMLParser()

Create an XML parser.

SGMLParser

SGMLParser()

Create an SGML parser.

Parser Methods

register

register(target)

Register a parser target object. This method looks up a number of target methods in this object, and registers them with the parser.

For a list of target methods used by this method, see the target interface description below.

feed

feed(string)

Feed a string (or string buffer) to the parser.

close

close()

Flush the parser buffers, and shut down the parser. This method should always be called after the last call to feed, to make sure all data has been returned.

This method also releases references to registered handler methods. To avoid memory leaks caused by cyclical references, you must call this method when done parsing.

parse

parse(string)

Same as feed followed by a close. Don't mix this method with feed and close; either call this method once for the entire document, or use feed/close to parse your document piece by piece.

Target Interface

The target object can implement one or more of the following methods. A typical target object should implement at least finish_starttag, finish_endtag, and handle_data.

finish_starttag

finish_starttag(tag, attrib)

Handle a start tag. The XML parser represents attributes as a dictionary, the SGML parser as a list of (key, value)-tuples.

finish_endtag

finish_endtag(tag)

Handle an end tag.

handle_proc

handle_proc(instruction, content)

Handle a processing instruction. If omitted, processing instructions are ignored.

handle_special

handle_special(content)

Handle a special element, including the special elements that make up an internal DTD. If omitted, special elements are ignored.

FIXME: add more information here

handle_charref

handle_charref(ref)

Handle a decimal or hexadecimal character reference. You usually don't have to define this method; if it's not defined, the parser will convert the entity to a character string, and pass it to the handle_data method.

handle_entityref

handle_entityref(ref)

Handle a named entity reference in character data. If present, this method is called also for standard entities (gt, amp, etc), and for malformed character entities.

If not defined, the parser resolves internal entities by itself, and uses the resolve_entity method for other entities. The resulting string is then passed to the handle_data methods instead.

If an entity cannot be resolved, it is ignored, unless running in strict mode.

resolve_entityref

resolve_entityref(ref) => string or None

Resolve a named entity reference. This is used for entities in attribute values, and also for character data, if handle_entityref is not defined.

If successful, this method should return a character string. If the entity should be resolved, return None. Otherwise, the method should raise a suitable exception.

FIXME: add support for external entities?

handle_data

handle_data(text)

Handle character data.

handle_cdata

handle_cdata(text)

Handle a CDATA section. If not defined, the character contents are passed to the handle_data method instead.

handle_comment

handle_comment(text)

Handle an XML comment. If not defined, comments are ignored.