diff options
author | Waylan Limberg <waylan@gmail.com> | 2011-08-23 10:53:49 -0400 |
---|---|---|
committer | Waylan Limberg <waylan@gmail.com> | 2011-08-23 10:53:49 -0400 |
commit | a6fd7b5cba69cad049046b3a372d0f2d6c495abe (patch) | |
tree | 42a8987d08eb5a05eb2b3cb1399540583e8cbf67 /docs/writing_extensions.md | |
parent | e5b2813bbf02710c7deb148896085a3dac4828dc (diff) | |
download | markdown-a6fd7b5cba69cad049046b3a372d0f2d6c495abe.tar.gz markdown-a6fd7b5cba69cad049046b3a372d0f2d6c495abe.tar.bz2 markdown-a6fd7b5cba69cad049046b3a372d0f2d6c495abe.zip |
Renamed *.txt -> *.md in docs.
Diffstat (limited to 'docs/writing_extensions.md')
-rw-r--r-- | docs/writing_extensions.md | 611 |
1 files changed, 611 insertions, 0 deletions
diff --git a/docs/writing_extensions.md b/docs/writing_extensions.md new file mode 100644 index 0000000..2ecd4c9 --- /dev/null +++ b/docs/writing_extensions.md @@ -0,0 +1,611 @@ +Writing Extensions for Python-Markdown +====================================== + +Overview +-------- + +Python-Markdown includes an API for extension writers to plug their own +custom functionality and/or syntax into the parser. There are preprocessors +which allow you to alter the source before it is passed to the parser, +inline patterns which allow you to add, remove or override the syntax of +any inline elements, and postprocessors which allow munging of the +output of the parser before it is returned. If you really want to dive in, +there are also blockprocessors which are part of the core BlockParser. + +As the parser builds an [ElementTree][] object which is later rendered +as Unicode text, there are also some helpers provided to ease manipulation of +the tree. Each part of the API is discussed in its respective section below. +Additionaly, reading the source of some [[Available Extensions]] may be helpful. +For example, the [[Footnotes]] extension uses most of the features documented +here. + +* [Preprocessors][] +* [InlinePatterns][] +* [Treeprocessors][] +* [Postprocessors][] +* [BlockParser][] +* [Working with the ElementTree][] +* [Integrating your code into Markdown][] + * [extendMarkdown][] + * [OrderedDict][] + * [registerExtension][] + * [Config Settings][] + * [makeExtension][] + +<h3 id="preprocessors">Preprocessors</h3> + +Preprocessors munge the source text before it is passed into the Markdown +core. This is an excellent place to clean up bad syntax, extract things the +parser may otherwise choke on and perhaps even store it for later retrieval. + +Preprocessors should inherit from ``markdown.preprocessors.Preprocessor`` and +implement a ``run`` method with one argument ``lines``. The ``run`` method of +each Preprocessor will be passed the entire source text as a list of Unicode +strings. Each string will contain one line of text. The ``run`` method should +return either that list, or an altered list of Unicode strings. + +A pseudo example: + + from markdown.preprocessors import Preprocessor + + class MyPreprocessor(Preprocessor): + def run(self, lines): + new_lines = [] + for line in lines: + m = MYREGEX.match(line) + if m: + # do stuff + else: + new_lines.append(line) + return new_lines + +<h3 id="inlinepatterns">Inline Patterns</h3> + +Inline Patterns implement the inline HTML element syntax for Markdown such as +``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be +instances of classes that inherit from ``markdown.inlinepatterns.Pattern`` or +one of its children. Each pattern object uses a single regular expression and +must have the following methods: + +* **``getCompiledRegExp()``**: + + Returns a compiled regular expression. + +* **``handleMatch(m)``**: + + Accepts a match object and returns an ElementTree element of a plain + Unicode string. + +Note that any regular expression returned by ``getCompiledRegExp`` must capture +the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end +with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method +provided in the ``Pattern`` you can pass in a regular expression without that +and ``getCompiledRegExp`` will wrap your expression for you and set the +`re.DOTALL` and `re.UNICODE` flags. This means that the first group of your +match will be ``m.group(2)`` as ``m.group(1)`` will match everything before the +pattern. + +For an example, consider this simplified emphasis pattern: + + from markdown.inlinepatterns import Pattern + from markdown.util import etree + + class EmphasisPattern(Pattern): + def handleMatch(self, m): + el = etree.Element('em') + el.text = m.group(3) + return el + +As discussed in [Integrating Your Code Into Markdown][], an instance of this +class will need to be provided to Markdown. That instance would be created +like so: + + # an oversimplified regex + MYPATTERN = r'\*([^*]+)\*' + # pass in pattern and create instance + emphasis = EmphasisPattern(MYPATTERN) + +Actually it would not be necessary to create that pattern (and not just because +a more sophisticated emphasis pattern already exists in Markdown). The fact is, +that example pattern is not very DRY. A pattern for `**strong**` text would +be almost identical, with the exception that it would create a 'strong' element. +Therefore, Markdown provides a number of generic pattern classes that can +provide some common functionality. For example, both emphasis and strong are +implemented with separate instances of the ``SimpleTagPettern`` listed below. +Feel free to use or extend any of the Pattern classes found at `markdown.inlinepatterns`. + +**Generic Pattern Classes** + +* **``SimpleTextPattern(pattern)``**: + + Returns simple text of ``group(2)`` of a ``pattern``. + +* **``SimpleTagPattern(pattern, tag)``**: + + Returns an element of type "`tag`" with a text attribute of ``group(3)`` + of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em'). + +* **``SubstituteTagPattern(pattern, tag)``**: + + Returns an element of type "`tag`" with no children or text (i.e.: 'br'). + +There may be other Pattern classes in the Markdown source that you could extend +or use as well. Read through the source and see if there is anything you can +use. You might even get a few ideas for different approaches to your specific +situation. + +<h3 id="treeprocessors">Treeprocessors</h3> + +Treeprocessors manipulate an ElemenTree object after it has passed through the +core BlockParser. This is where additional manipulation of the tree takes +place. Additionally, the InlineProcessor is a Treeprocessor which steps through +the tree and runs the InlinePatterns on the text of each Element in the tree. + +A Treeprocessor should inherit from ``markdown.treeprocessors.Treeprocessor``, +over-ride the ``run`` method which takes one argument ``root`` (an Elementree +object) and returns either that root element or a modified root element. + +A pseudo example: + + from markdown.treprocessors import Treeprocessor + + class MyTreeprocessor(Treeprocessor): + def run(self, root): + #do stuff + return my_modified_root + +For specifics on manipulating the ElementTree, see +[Working with the ElementTree][] below. + +<h3 id="postprocessors">Postprocessors</h3> + +Postprocessors manipulate the document after the ElementTree has been +serialized into a string. Postprocessors should be used to work with the +text just before output. + +A Postprocessor should inherit from ``markdown.postprocessors.Postprocessor`` +and over-ride the ``run`` method which takes one argument ``text`` and returns +a Unicode string. + +Postprocessors are run after the ElementTree has been serialized back into +Unicode text. For example, this may be an appropriate place to add a table of +contents to a document: + + from markdown.postprocessors import Postprocessor + + class TocPostprocessor(Postprocessor): + def run(self, text): + return MYMARKERRE.sub(MyToc, text) + +<h3 id="blockparser">BlockParser</h3> + +Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what +you need. Perhaps you want a new type of block type that needs to be integrated +into the core parsing. In such a situation, you can add/change/remove +functionality of the core ``BlockParser``. The BlockParser is composed of a +number of Blockproccessors. The BlockParser steps through each block of text +(split by blank lines) and passes each block to the appropriate Blockprocessor. +That Blockprocessor parses the block and adds it to the ElementTree. The +[[Definition Lists]] extension would be a good example of an extension that +adds/modifies Blockprocessors. + +A Blockprocessor should inherit from ``markdown.blockprocessors.BlockProcessor`` +and implement both the ``test`` and ``run`` methods. + +The ``test`` method is used by BlockParser to identify the type of block. +Therefore the ``test`` method must return a boolean value. If the test returns +``True``, then the BlockParser will call that Blockprocessor's ``run`` method. +If it returns ``False``, the BlockParser will move on to the next +BlockProcessor. + +The **``test``** method takes two arguments: + +* **``parent``**: The parent etree Element of the block. This can be useful as + the block may need to be treated differently if it is inside a list, for + example. + +* **``block``**: A string of the current block of text. The test may be a + simple string method (such as ``block.startswith(some_text)``) or a complex + regular expression. + +The **``run``** method takes two arguments: + +* **``parent``**: A pointer to the parent etree Element of the block. The run + method will most likely attach additional nodes to this parent. Note that + nothing is returned by the method. The Elementree object is altered in place. + +* **``blocks``**: A list of all remaining blocks of the document. Your run + method must remove (pop) the first block from the list (which it altered in + place - not returned) and parse that block. You may find that a block of text + legitimately contains multiple block types. Therefore, after processing the + first type, your processor can insert the remaining text into the beginning + of the ``blocks`` list for future parsing. + +Please be aware that a single block can span multiple text blocks. For example, +The official Markdown syntax rules state that a blank line does not end a +Code Block. If the next block of text is also indented, then it is part of +the previous block. Therefore, the BlockParser was specifically designed to +address these types of situations. If you notice the ``CodeBlockProcessor``, +in the core, you will note that it checks the last child of the ``parent``. +If the last child is a code block (``<pre><code>...</code></pre>``), then it +appends that block to the previous code block rather than creating a new +code block. + +Each BlockProcessor has the following utility methods available: + +* **``lastChild(parent)``**: + + Returns the last child of the given etree Element or ``None`` if it had no + children. + +* **``detab(text)``**: + + Removes one level of indent (four spaces by default) from the front of each + line of the given text string. + +* **``looseDetab(text, level)``**: + + Removes "level" levels of indent (defaults to 1) from the front of each line + of the given text string. However, this methods allows secondary lines to + not be indented as does some parts of the Markdown syntax. + +Each BlockProcessor also has a pointer to the containing BlockParser instance at +``self.parser``, which can be used to check or alter the state of the parser. +The BlockParser tracks it's state in a stack at ``parser.state``. The state +stack is an instance of the ``State`` class. + +**``State``** is a subclass of ``list`` and has the additional methods: + +* **``set(state)``**: + + Set a new state to string ``state``. The new state is appended to the end + of the stack. + +* **``reset()``**: + + Step back one step in the stack. The last state at the end is removed from + the stack. + +* **``isstate(state)``**: + + Test that the top (current) level of the stack is of the given string + ``state``. + +Note that to ensure that the state stack doesn't become corrupted, each time a +state is set for a block, that state *must* be reset when the parser finishes +parsing that block. + +An instance of the **``BlockParser``** is found at ``Markdown.parser``. +``BlockParser`` has the following methods: + +* **``parseDocument(lines)``**: + + Given a list of lines, an ElementTree object is returned. This should be + passed an entire document and is the only method the ``Markdown`` class + calls directly. + +* **``parseChunk(parent, text)``**: + + Parses a chunk of markdown text composed of multiple blocks and attaches + those blocks to the ``parent`` Element. The ``parent`` is altered in place + and nothing is returned. Extensions would most likely use this method for + block parsing. + +* **``parseBlocks(parent, blocks)``**: + + Parses a list of blocks of text and attaches those blocks to the ``parent`` + Element. The ``parent`` is altered in place and nothing is returned. This + method will generally only be used internally to recursively parse nested + blocks of text. + +While is is not recommended, an extension could subclass or completely replace +the ``BlockParser``. The new class would have to provide the same public API. +However, be aware that other extensions may expect the core parser provided +and will not work with such a drastically different parser. + +<h3 id="working_with_et">Working with the ElementTree</h3> + +As mentioned, the Markdown parser converts a source document to an +[ElementTree][] object before serializing that back to Unicode text. +Markdown has provided some helpers to ease that manipulation within the context +of the Markdown module. + +First, to get access to the ElementTree module import ElementTree from +``markdown`` rather than importing it directly. This will ensure you are using +the same version of ElementTree as markdown. The module is found at +``markdown.util.etree`` within Markdown. + + from markdown.util import etree + +``markdown.util.etree`` tries to import ElementTree from any known location, +first as a standard library module (from ``xml.etree`` in Python 2.5), then as +a third party package (``Elementree``). In each instance, ``cElementTree`` is +tried first, then ``ElementTree`` if the faster C implementation is not +available on your system. + +Sometimes you may want text inserted into an element to be parsed by +[InlinePatterns][]. In such a situation, simply insert the text as you normally +would and the text will be automatically run through the InlinePatterns. +However, if you do *not* want some text to be parsed by InlinePatterns, +then insert the text as an ``AtomicString``. + + from markdown.util import AtomicString + some_element.text = AtomicString(some_text) + +Here's a basic example which creates an HTML table (note that the contents of +the second cell (``td2``) will be run through InlinePatterns latter): + + table = etree.Element("table") + table.set("cellpadding", "2") # Set cellpadding to 2 + tr = etree.SubElement(table, "tr") # Add child tr to table + td1 = etree.SubElement(tr, "td") # Add child td1 to tr + td1.text = markdown.AtomicString("Cell content") # Add plain text content + td2 = etree.SubElement(tr, "td") # Add second td to tr + td2.text = "*text* with **inline** formatting." # Add markup text + table.tail = "Text after table" # Add text after table + +You can also manipulate an existing tree. Consider the following example which +adds a ``class`` attribute to ``<a>`` elements: + + def set_link_class(self, element): + for child in element: + if child.tag == "a": + child.set("class", "myclass") #set the class attribute + set_link_class(child) # run recursively on children + +For more information about working with ElementTree see the ElementTree +[Documentation](http://effbot.org/zone/element-index.htm) +([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)). + +<h3 id="integrating_into_markdown">Integrating Your Code Into Markdown</h3> + +Once you have the various pieces of your extension built, you need to tell +Markdown about them and ensure that they are run in the proper sequence. +Markdown accepts a ``Extension`` instance for each extension. Therefore, you +will need to define a class that extends ``markdown.extensions.Extension`` and +over-rides the ``extendMarkdown`` method. Within this class you will manage +configuration options for your extension and attach the various processors and +patterns to the Markdown instance. + +It is important to note that the order of the various processors and patterns +matters. For example, if we replace ``http://...`` links with <a> elements, and +*then* try to deal with inline html, we will end up with a mess. Therefore, +the various types of processors and patterns are stored within an instance of +the Markdown class in [OrderedDict][]s. Your ``Extension`` class will need to +manipulate those OrderedDicts appropriately. You may insert instances of your +processors and patterns into the appropriate location in an OrderedDict, remove +a built-in instance, or replace a built-in instance with your own. + +<h4 id="extendmarkdown">extendMarkdown</h4> + +The ``extendMarkdown`` method of a ``markdown.extensions.Extension`` class +accepts two arguments: + +* **``md``**: + + A pointer to the instance of the Markdown class. You should use this to + access the [OrderedDict][]s of processors and patterns. They are found + under the following attributes: + + * ``md.preprocessors`` + * ``md.inlinePatterns`` + * ``md.parser.blockprocessors`` + * ``md.treepreprocessors`` + * ``md.postprocessors`` + + Some other things you may want to access in the markdown instance are: + + * ``md.htmlStash`` + * ``md.output_formats`` + * ``md.set_output_format()`` + * ``md.registerExtension()`` + * ``md.html_replacement_text`` + * ``md.tab_length`` + * ``md.enable_attributes`` + * ``md.smart_emphasis`` + +* **``md_globals``**: + + Contains all the various global variables within the markdown module. + +Of course, with access to those items, theoretically you have the option to +changing anything through various [monkey_patching][] techniques. However, you +should be aware that the various undocumented or private parts of markdown +may change without notice and your monkey_patches may break with a new release. +Therefore, what you really should be doing is inserting processors and patterns +into the markdown pipeline. Consider yourself warned. + +[monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch + +A simple example: + + from markdown.extensions import Extension + + class MyExtension(Extension): + def extendMarkdown(self, md, md_globals): + # Insert instance of 'mypattern' before 'references' pattern + md.inlinePatterns.add('mypattern', MyPattern(md), '<references') + +<h4 id="ordereddict">OrderedDict</h4> + +An OrderedDict is a dictionary like object that retains the order of it's +items. The items are ordered in the order in which they were appended to +the OrderedDict. However, an item can also be inserted into the OrderedDict +in a specific location in relation to the existing items. + +Think of OrderedDict as a combination of a list and a dictionary as it has +methods common to both. For example, you can get and set items using the +``od[key] = value`` syntax and the methods ``keys()``, ``values()``, and +``items()`` work as expected with the keys, values and items returned in the +proper order. At the same time, you can use ``insert()``, ``append()``, and +``index()`` as you would with a list. + +Generally speaking, within Markdown extensions you will be using the special +helper method ``add()`` to add additional items to an existing OrderedDict. + +The ``add()`` method accepts three arguments: + +* **``key``**: A string. The key is used for later reference to the item. + +* **``value``**: The object instance stored in this item. + +* **``location``**: Optional. The items location in relation to other items. + + Note that the location can consist of a few different values: + + * The special strings ``"_begin"`` and ``"_end"`` insert that item at the + beginning or end of the OrderedDict respectively. + + * A less-than sign (``<``) followed by an existing key (i.e.: + ``"<somekey"``) inserts that item before the existing key. + + * A greater-than sign (``>``) followed by an existing key (i.e.: + ``">somekey"``) inserts that item after the existing key. + +Consider the following example: + + >>> from markdown.odict import OrderedDict + >>> od = OrderedDict() + >>> od['one'] = 1 # The same as: od.add('one', 1, '_begin') + >>> od['three'] = 3 # The same as: od.add('three', 3, '>one') + >>> od['four'] = 4 # The same as: od.add('four', 4, '_end') + >>> od.items() + [("one", 1), ("three", 3), ("four", 4)] + +Note that when building an OrderedDict in order, the extra features of the +``add`` method offer no real value and are not necessary. However, when +manipulating an existing OrderedDict, ``add`` can be very helpful. So let's +insert another item into the OrderedDict. + + >>> od.add('two', 2, '>one') # Insert after 'one' + >>> od.values() + [1, 2, 3, 4] + +Now let's insert another item. + + >>> od.add('twohalf', 2.5, '<three') # Insert before 'three' + >>> od.keys() + ["one", "two", "twohalf", "three", "four"] + +Note that we also could have set the location of "twohalf" to be 'after two' +(i.e.: ``'>two'``). However, it's unlikely that you will have control over the +order in which extensions will be loaded, and this could affect the final +sorted order of an OrderedDict. For example, suppose an extension adding +'twohalf' in the above examples was loaded before a separate extension which +adds 'two'. You may need to take this into consideration when adding your +extension components to the various markdown OrderedDicts. + +Once an OrderedDict is created, the items are available via key: + + MyNode = od['somekey'] + +Therefore, to delete an existing item: + + del od['somekey'] + +To change the value of an existing item (leaving location unchanged): + + od['somekey'] = MyNewObject() + +To change the location of an existing item: + + t.link('somekey', '<otherkey') + +<h4 id="registerextension">registerExtension</h4> + +Some extensions may need to have their state reset between multiple runs of the +Markdown class. For example, consider the following use of the [[Footnotes]] +extension: + + md = markdown.Markdown(extensions=['footnotes']) + html1 = md.convert(text_with_footnote) + md.reset() + html2 = md.convert(text_without_footnote) + +Without calling ``reset``, the footnote definitions from the first document will +be inserted into the second document as they are still stored within the class +instance. Therefore the ``Extension`` class needs to define a ``reset`` method +that will reset the state of the extension (i.e.: ``self.footnotes = {}``). +However, as many extensions do not have a need for ``reset``, ``reset`` is only +called on extensions that are registered. + +To register an extension, call ``md.registerExtension`` from within your +``extendMarkdown`` method: + + + def extendMarkdown(self, md, md_globals): + md.registerExtension(self) + # insert processors and patterns here + +Then, each time ``reset`` is called on the Markdown instance, the ``reset`` +method of each registered extension will be called as well. You should also +note that ``reset`` will be called on each registered extension after it is +initialized the first time. Keep that in mind when over-riding the extension's +``reset`` method. + +<h4 id="configsettings">Config Settings</h4> + +If an extension uses any parameters that the user may want to change, +those parameters should be stored in ``self.config`` of your +``markdown.Extension`` class in the following format: + + self.config = {parameter_1_name : [value1, description1], + parameter_2_name : [value2, description2] } + +When stored this way the config parameters can be over-ridden from the +command line or at the time Markdown is initiated: + + markdown.py -x myextension(SOME_PARAM=2) inputfile.txt > output.txt + +Note that parameters should always be assumed to be set to string +values, and should be converted at run time. For example: + + i = int(self.getConfig("SOME_PARAM")) + +<h4 id="makeextension">makeExtension</h4> + +Each extension should ideally be placed in its own module starting +with the ``mdx_`` prefix (e.g. ``mdx_footnotes.py``). The module must +provide a module-level function called ``makeExtension`` that takes +an optional parameter consisting of a dictionary of configuration over-rides +and returns an instance of the extension. An example from the footnote +extension: + + def makeExtension(configs=None) : + return FootnoteExtension(configs=configs) + +By following the above example, when Markdown is passed the name of your +extension as a string (i.e.: ``'footnotes'``), it will automatically import +the module and call the ``makeExtension`` function initiating your extension. + +You may have noted that the extensions packaged with Python-Markdown do not +use the ``mdx_`` prefix in their module names. This is because they are all +part of the ``markdown.extensions`` package. Markdown will first try to import +from ``markdown.extensions.extname`` and upon failure, ``mdx_extname``. If both +fail, Markdown will continue without the extension. + +However, Markdown will also accept an already existing instance of an extension. +For example: + + import markdown + import myextension + configs = {...} + myext = myextension.MyExtension(configs=configs) + md = markdown.Markdown(extensions=[myext]) + +This is useful if you need to implement a large number of extensions with more +than one residing in a module. + +[Preprocessors]: #preprocessors +[InlinePatterns]: #inlinepatterns +[Treeprocessors]: #treeprocessors +[Postprocessors]: #postprocessors +[BlockParser]: #blockparser +[Working with the ElementTree]: #working_with_et +[Integrating your code into Markdown]: #integrating_into_markdown +[extendMarkdown]: #extendmarkdown +[OrderedDict]: #ordereddict +[registerExtension]: #registerextension +[Config Settings]: #configsettings +[makeExtension]: #makeextension +[ElementTree]: http://effbot.org/zone/element-index.htm |