Writing Extensions for Python-Markdown
======================================
Overview
--------
Python-Markdown includes an API for extension writers to plug their own
custom functionality and/or syntax into the parser. There are preprocessors
which allow you to alter the source before it is passed to the parser,
inline patterns which allow you to add, remove or override the syntax of
any inline elements, and postprocessors which allow munging of the
output of the parser before it is returned. If you really want to dive in,
there are also blockprocessors which are part of the core BlockParser.
As the parser builds an [ElementTree][] object which is later rendered
as Unicode text, there are also some helpers provided to ease manipulation of
the tree. Each part of the API is discussed in its respective section below.
Additionaly, reading the source of some [[Available Extensions]] may be helpful.
For example, the [[Footnotes]] extension uses most of the features documented
here.
* [Preprocessors][]
* [InlinePatterns][]
* [Treeprocessors][]
* [Postprocessors][]
* [BlockParser][]
* [Working with the ElementTree][]
* [Integrating your code into Markdown][]
* [extendMarkdown][]
* [OrderedDict][]
* [registerExtension][]
* [Config Settings][]
* [makeExtension][]
Preprocessors
Preprocessors munge the source text before it is passed into the Markdown
core. This is an excellent place to clean up bad syntax, extract things the
parser may otherwise choke on and perhaps even store it for later retrieval.
Preprocessors should inherit from ``markdown.Preprocessor`` and implement
a ``run`` method with one argument ``lines``. The ``run`` method of each
Preprocessor will be passed the entire source text as a list of Unicode strings.
Each string will contain one line of text. The ``run`` method should return
either that list, or an altered list of Unicode strings.
A pseudo example:
class MyPreprocessor(markdown.Preprocessor):
def run(self, lines):
new_lines = []
for line in lines:
m = MYREGEX.match(line)
if m:
# do stuff
else:
new_lines.append(line)
return new_lines
Inline Patterns
Inline Patterns implement the inline HTML element syntax for Markdown such as
``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be
instances of classes that inherit from ``markdown.Pattern`` or one of its
children. Each pattern object uses a single regular expression and must have
the following methods:
* ``getCompiledRegExp()``: Returns a compiled regular expression.
* ``handleMatch(m)``: Accepts a match object and returns an ElementTree
element of a plain Unicode string.
Note that any regular expression returned by ``getCompiledRegExp`` must capture
the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end
with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method
provided in the ``Pattern`` you can pass in a regular expression without that
and ``getCompiledRegExp`` will wrap your expression for you. This means that
the first group of your match will be ``m.group(2)`` as ``m.group(1)`` will
match everything before the pattern.
For an example, consider this simplified emphasis pattern:
class EmphasisPattern(markdown.Pattern):
def handleMatch(self, m):
el = markdown.etree.Element('em')
el.text = m.group(3)
return el
As discussed in [Integrating Your Code Into Markdown][], an instance of this
class will need to be provided to Markdown. That instance would be created
like so:
# an oversimplified regex
MYPATTERN = r'\*([^*]+)\*'
# pass in pattern and create instance
emphasis = EmphasisPattern(MYPATTERN)
Actually it would not be necessary to create that pattern (and not just because
a more sophisticated emphasis pattern already exists in Markdown). The fact is,
that example pattern is not very DRY. A pattern for `**strong**` text would
be almost identical, with the exception that it would create a 'strong' element.
Therefore, Markdown provides a number of generic pattern classes that can
provide some common functionality. For example, both emphasis and strong are
implemented with separate instances of the ``SimpleTagPettern`` listed below.
Feel free to use or extend any of these Pattern classes.
**Generic Pattern Classes**
* ``SimpleTextPattern(pattern)``:
Returns simple text of ``group(2)`` of a `pattern`.
* ``SimpleTagPattern(pattern, tag)``:
Returns an element of type "`tag`" with a text attribute of ``group(3)``
of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em').
* ``SubstituteTagPattern(pattern, tag)``:
Returns an element of type "`tag`" with no children or text (i.e.: 'br').
There may be other Pattern classes in the Markdown source that you could extend
or use as well. Read through the source and see if there is anything you can
use. You might even get a few ideas for different approaches to your specific
situation.
Treeprocessors
Treeprocessors manipulate an ElemenTree object after it has passed through the
core BlockParser. This is where additional manipulation of the tree takes
place. Additionaly, the InlineProcessor is a Treeprocessor which steps through
the tree and runs the InlinePatterns on the text of each Element in the tree.
An Treeprocessor should inherit from ``markdown.Treeprocessor``,
over-ride the ``run`` method which takes one argument ``root`` (an Elementree
object) and returns either that root element or a modified root element.
A pseudo example:
class MyTreeprocessor(markdown.Treeprocessor):
def run(self, root):
#do stufff
return my_modified_root
For specifics on manipulating the ElementTree, see
[Working with the ElementTree][] below.
Postprocessors
Postprocessors manipulate the document after the ElementTree has been
serialized into a string. Postprocessors should be used to work with the
text just before output.
A Postprocessor should inherit from ``markdown.Postprocessor`` and
over-ride the ``run`` method which takes one argument ``text`` and returns a
Unicode string.
Postprocessors are run after the ElementTree has been serialized back into
Unicode text. For example, this may be an appropriate place to add a table of
contents to a document:
class TocPostprocessor(markdown.Postprocessor):
def run(self, text):
return MYMARKERRE.sub(MyToc, text)
BlockParser
Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what
you need. Perhaps you want a new type of block type that needs to be integrated
into the core parsing. In such a situation, you can add/change/remove
functionality of the core ``BlockParser``. The BlockParser is composed of a
number of Blockproccessors. The BlockParser steps through each block of text
(split by blank lines) and passes each block to the appropriate Blockprocessor.
That Blockprocessor parses the block and adds it to the ElementTree. The
[[Definition Lists]] extension would be a good example of an extension that
adds/modifies Blockprocessors.
A Blockprocessor should inherit from ``markdown.BlockProcessor`` and implement
both the ``test`` and ``run`` methods.
The ``test`` method is used by BlockParser to identify the type of block.
Therefore the ``test`` method must return a boolean value. If the test returns
``True``, then the BlockParser will call that Blockprocessor's ``run`` method.
If it returns ``False``, the BlockParser will move on to the next
BlockProcessor.
The **``test``** method takes two arguments:
* **``parent``**: The parent etree Element of the block. This can be useful as
the block may need to be treated differently if it is inside a list, for
example.
* **``block``**: A string of the current block of text. The test may be a
simple string method (such as ``block.startswith(some_text)``) or a complex
regular expression.
The **``run``** method takes two arguments:
* **``parent``**: A pointer to the parent etree Element of the block. The run
method will most likely attach additional nodes to this parent. Note that
nothing is returned by the method. The Elementree object is altered in place.
* **``blocks``**: A list of all remaining blocks of the document. Your run
method must remove (pop) the first block from the list (which it altered in
place - not returned) and parse that block. You may find that a block of text
legitimately contains multiple block types. Therefore, after processing the
first type, you processor can insert the remaining text into the beginning
of the ``blocks`` list for future parsing.
Please be aware that a single block can span multiple text blocks. For example,
The official Markdown syntax rules state that a blank line does not end a
Code Block. If the next block of text is also indented, then it is part of
the previous block. Therefore, the BlockParser was specifically designed to
address these types of situations. If you notice the ``CodeBlockProcessor``,
in the core, you will note that is checks the last child of the ``parent``.
If the last child is a code block (``...
``), then it
appends that block to the previous code block rather than creating a new
code block.
Each BlockProcessor has the following utility methods available:
* **``lastChild(parent)``**: Returns the last child of the given etree Element
or ``None`` if it had no children.
* **``detab(text)``**: Removes one level of indent (four spaces by default)
from the front of each line of the given text string.
* **``looseDetab``**: Removes one level if indent from the front of each line
of the given text string. However, this methods allows secondary lines to
not be indented as does some parts of the Markdown syntax.
Each BlockProcessor also has a pointer to the containing BlockParser instance at
``self.parser``, which can be used to check or alter the state of the parser.
The BlockParser tracks it's state in a stack at ``parser.state``. The state
stack is an instance of the ``State`` class.
**``State``** is a subclass of ``list`` and has the additional methods:
* **``set(state)``**: Set a new state to string ``state``. The new state is
appended to the end of the stack.
* **``reset()``**: Step back one step in the stack. The last state at the end
is removed from the stack.
* **``isstate(state)``**: Test that the top (current) level of the stack is of
the given string ``state``.
Note that to ensure that the state stack doesn't become corrupted, each time a
state is set for a block, that state *must* be reset when the parser finishes
that parsing that block.
An instance of the **``BlockParser``** is found at ``Markdown.parser``.
``BlockParser`` has the following methods:
* **``parseDocument(lines)``**: Given a list of lines, an ElementTree object is
returned. This should be passed an entire document and is the only method
the ``Markdown`` class calls directly.
* **``parseChunk(parent, text)``**: Parses a chunk of markdown text composed of
multiple blocks and attaches those blocks to the ``parent`` Element. The
``parent`` is altered in place and nothing is returned. Extensions would
most likely use this method for block parsing
* **``parseBlocks``**: Parses a list of blocks of text and attaches those
blocks to the ``parent`` Element. The ``parent`` is altered in place and
nothing is returned. This method will generally only be used internally to
recursively parse nested blocks of text.
While is is not recommended, an extension could subclass of completely replace
the ``BlockParser``. The new class would have to provide the same public API.
However, be aware that other extensions may expect the core parser provided
and will not work with such a drastically different parser.
Working with the ElementTree
As mentioned, the Markdown parser converts a source document to an
[ElementTree][] object before serializing that back to Unicode text.
Markdown has provided some helpers to ease that manipulation within the context
of the Markdown module.
First, to get access to the ElementTree module import ElementTree from
``markdown`` rather than importing it directly. This will ensure you are using
the same version of ElementTree as markdown. The module is named ``etree``
within Markdown.
from markdown import etree
``markdown.etree`` tries to import ElementTree from any known location, first
as a standard library module (from ``xml.etree`` in Python 2.5), then as a third
party package (``Elementree``). In each instance, ``cElementTree`` is tried
first, then ``ElementTree`` if the faster C implementation is not available on
your system.
Sometimes you may want text inserted into an element to be parsed by
[InlinePatterns][]. In such a situation, simply insert the text as you normally
would and the text will be automatically run through the InlinePatterns.
However, if you do *not* want some text to be parsers by InlinePatterns,
then insert the text as an ``AtomicString``.
Here's a basic example which creates an HTML table (note that the contents of
the second cell (``td2``) will be run through InlinePatterns latter):
table = etree.Element("table")
table.set("cellpadding", "2") # Set cellpadding to 2
tr = etree.SubElement(table, "tr") # Add child tr to table
td1 = etree.SubElement(tr, "td") # Add child td1 to tr
td1.text = markdown.AtomicString("Cell content") # Add plain text content
td2 = etree.SubElement(tr, "td") # Add second td to tr
td2.text = "*text* with **inline** formatting." # Add markup text
table.tail = "Text after table" # Add text after table
You can also manipulate an existing tree. Consider the following example which
adds a ``class`` attribute to ``a`` elements:
def set_link_class(self, element):
for child in element:
if child.tag == "a":
child.set("class", "myclass") #set the class attribute
set_link_class(child) # run recursively on children
For more information about working with ElementTree see the ElementTree
[Documentation](http://effbot.org/zone/element-index.htm)
([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)).