Markup Streams
A stream is the common representation of markup as a stream of events.
Contents
1 Basics
A stream can be attained in a number of ways. It can be:
- the result of parsing XML or HTML text, or
- programmatically generated, or
- the result of selecting a subset of another stream filtered by an XPath expression.
For example, the functions XML() and HTML() can be used to convert literal XML or HTML text to a markup stream:
>>> from genshi import XML >>> stream = XML('<p class="intro">Some text and ' ... '<a href="http://example.org/">a link</a>.' ... '<br/></p>') >>> stream <genshi.core.Stream object at 0x6bef0>
The stream is the result of parsing the text into events. Each event is a tuple of the form (kind, data, pos), where:
- kind defines what kind of event it is (such as the start of an element, text, a comment, etc).
- data is the actual data associated with the event. How this looks depends on the event kind.
- pos is a (filename, lineno, column) tuple that describes where the event “comes from”.
>>> for kind, data, pos in stream: ... print kind, `data`, pos ... START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0) TEXT u'Some text and ' ('<string>', 1, 31) START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31) TEXT u'a link' ('<string>', 1, 67) END u'a' ('<string>', 1, 67) TEXT u'.' ('<string>', 1, 72) START (u'br', []) ('<string>', 1, 72) END u'br' ('<string>', 1, 77) END u'p' ('<string>', 1, 77)
2 Filtering
One important feature of markup streams is that you can apply filters to the stream, either filters that come with Genshi, or your own custom filters.
A filter is simply a callable that accepts the stream as parameter, and returns the filtered stream:
def noop(stream): """A filter that doesn't actually do anything with the stream.""" for kind, data, pos in stream: yield kind, data, pos
Filters can be applied in a number of ways. The simplest is to just call the filter directly:
stream = noop(stream)
The Stream class also provides a filter() method, which takes an arbitrary number of filter callables and applies them all:
stream = stream.filter(noop)
Finally, filters can also be applied using the bitwise or operator (|), which allows a syntax similar to pipes on Unix shells:
stream = stream | noop
One example of a filter included with Genshi is the HTMLSanitizer in genshi.filters. It processes a stream of HTML markup, and strips out any potentially dangerous constructs, such as Javascript event handlers. HTMLSanitizer is not a function, but rather a class that implements __call__, which means instances of the class are callable.
Both the filter() method and the pipe operator allow easy chaining of filters:
from genshi.filters import HTMLSanitizer stream = stream.filter(noop, HTMLSanitizer())
That is equivalent to:
stream = stream | noop | HTMLSanitizer()
3 Serialization
The Stream class provides two methods for serializing this list of events: serialize() and render(). The former is a generator that yields chunks of Markup objects (which are basically unicode strings that are considered safe for output on the web). The latter returns a single string, by default UTF-8 encoded.
Here's the output from serialize():
>>> for output in stream.serialize(): ... print `output` ... <Markup u'<p class="intro">'> <Markup u'Some text and '> <Markup u'<a href="http://example.org/">'> <Markup u'a link'> <Markup u'</a>'> <Markup u'.'> <Markup u'<br/>'> <Markup u'</p>'>
And here's the output from render():
>>> print stream.render() <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>
Both methods can be passed a method parameter that determines how exactly the events are serialzed to text. This parameter can be either “xml” (the default), “xhtml”, “html”, “text”, or a custom serializer class:
>>> print stream.render('html') <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>
Note how the <br> element isn't closed, which is the right thing to do for HTML.
In addition, the render() method takes an encoding parameter, which defaults to “UTF-8”. If set to None, the result will be a unicode string.
The different serializer classes in genshi.output can also be used directly:
>>> from genshi.filters import HTMLSanitizer >>> from genshi.output import TextSerializer >>> print TextSerializer()(HTMLSanitizer()(stream)) Some text and a link.
The pipe operator allows a nicer syntax:
>>> print stream | HTMLSanitizer() | TextSerializer() Some text and a link.
4 Using XPath
XPath can be used to extract a specific subset of the stream via the select() method:
>>> substream = stream.select('a') >>> substream <genshi.core.Stream object at 0x7118b0> >>> print substream <a href="http://example.org/">a link</a>
Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a stream in a list:
>>> from genshi import Stream >>> substream = Stream(list(stream.select('a'))) >>> substream <genshi.core.Stream object at 0x7118b0> >>> print substream <a href="http://example.org/">a link</a> >>> print substream.select('@href') http://example.org/ >>> print substream.select('text()') a link
See also: genshi.core, Documentation