From b62ddeda02fadcd09def9354eb2ef46a7562a106 Mon Sep 17 00:00:00 2001 From: Waylan Limberg Date: Wed, 6 Dec 2017 23:18:29 -0500 Subject: Switch docs to MKDocs (#602) Fixes #601. Merged in 6f87b32 from the md3 branch and did a lot of cleanup. Changes include: * Removed old docs build tool, templates, etc. * Added MkDocs config file, etc. * filename.txt => filename.md * pythonhost.org/Markdown => Python-Markdown.github.io * Markdown lint and other cleanup. * Automate pages deployment in makefile with `mkdocs gh-deploy` Assumes a git remote is set up named "pages". Do git remote add pages https://github.com/Python-Markdown/Python-Markdown.github.io.git ... before running `make deploy` the first time. --- docs/extensions/api.md | 720 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 720 insertions(+) create mode 100644 docs/extensions/api.md (limited to 'docs/extensions/api.md') diff --git a/docs/extensions/api.md b/docs/extensions/api.md new file mode 100644 index 0000000..3d8cfff --- /dev/null +++ b/docs/extensions/api.md @@ -0,0 +1,720 @@ +title: Extensions API + +# Writing Extensions for Python-Markdown + +Python-Markdown includes an API for extension writers to plug their own +custom functionality and/or syntax into the parser. There are Preprocessors +which allow you to alter the source before it is passed to the parser, +inline patterns which allow you to add, remove or override the syntax of +any inline elements, and Postprocessors which allow munging of the +output of the parser before it is returned. If you really want to dive in, +there are also Blockprocessors which are part of the core BlockParser. + +As the parser builds an [ElementTree][ElementTree] object which is later rendered +as Unicode text, there are also some helpers provided to ease manipulation of +the tree. Each part of the API is discussed in its respective section below. +Additionally, reading the source of some [Available Extensions][] may be +helpful. For example, the [Footnotes][] extension uses most of the features +documented here. + +## Preprocessors {: #preprocessors } + +Preprocessors munge the source text before it is passed into the Markdown +core. This is an excellent place to clean up bad syntax, extract things the +parser may otherwise choke on and perhaps even store it for later retrieval. + +Preprocessors should inherit from `markdown.preprocessors.Preprocessor` and +implement a `run` method with one argument `lines`. The `run` method of +each Preprocessor will be passed the entire source text as a list of Unicode +strings. Each string will contain one line of text. The `run` method should +return either that list, or an altered list of Unicode strings. + +A pseudo example: + +```python +from markdown.preprocessors import Preprocessor + +class MyPreprocessor(Preprocessor): + def run(self, lines): + new_lines = [] + for line in lines: + m = MYREGEX.match(line) + if m: + # do stuff + else: + new_lines.append(line) + return new_lines +``` + +## Inline Patterns {: #inlinepatterns } + +Inline Patterns implement the inline HTML element syntax for Markdown such as +`*emphasis*` or `[links](http://example.com)`. Pattern objects should be +instances of classes that inherit from `markdown.inlinepatterns.Pattern` or +one of its children. Each pattern object uses a single regular expression and +must have the following methods: + +* **`getCompiledRegExp()`**: + + Returns a compiled regular expression. + +* **`handleMatch(m)`**: + + Accepts a match object and returns an ElementTree element of a plain + Unicode string. + +Also, Inline Patterns can define the property `ANCESTOR_EXCLUDES` with either +a list or tuple of undesirable ancestors. The pattern should not match if it +would cause the content to be a descendant of one of the defined tag names. + +Note that any regular expression returned by `getCompiledRegExp` must capture +the whole block. Therefore, they should all start with `r'^(.*?)'` and end +with `r'(.*?)!'`. When using the default `getCompiledRegExp()` method +provided in the `Pattern` you can pass in a regular expression without that +and `getCompiledRegExp` will wrap your expression for you and set the +`re.DOTALL` and `re.UNICODE` flags. This means that the first group of your +match will be `m.group(2)` as `m.group(1)` will match everything before the +pattern. + +For an example, consider this simplified emphasis pattern: + +```python +from markdown.inlinepatterns import Pattern +from markdown.util import etree + +class EmphasisPattern(Pattern): + def handleMatch(self, m): + el = etree.Element('em') + el.text = m.group(3) + return el +``` + +As discussed in [Integrating Your Code Into Markdown][], an instance of this +class will need to be provided to Markdown. That instance would be created +like so: + +```python +# an oversimplified regex +MYPATTERN = r'\*([^*]+)\*' +# pass in pattern and create instance +emphasis = EmphasisPattern(MYPATTERN) +``` + +Actually it would not be necessary to create that pattern (and not just because +a more sophisticated emphasis pattern already exists in Markdown). The fact is, +that example pattern is not very DRY. A pattern for `**strong**` text would +be almost identical, with the exception that it would create a 'strong' element. +Therefore, Markdown provides a number of generic pattern classes that can +provide some common functionality. For example, both emphasis and strong are +implemented with separate instances of the `SimpleTagPattern` listed below. +Feel free to use or extend any of the Pattern classes found at +`markdown.inlinepatterns`. + +### Generic Pattern Classes + +* **`SimpleTextPattern(pattern)`**: + + Returns simple text of `group(2)` of a `pattern`. + +* **`SimpleTagPattern(pattern, tag)`**: + + Returns an element of type "`tag`" with a text attribute of `group(3)` + of a `pattern`. `tag` should be a string of a HTML element (i.e.: 'em'). + +* **`SubstituteTagPattern(pattern, tag)`**: + + Returns an element of type "`tag`" with no children or text (i.e.: `br`). + +There may be other Pattern classes in the Markdown source that you could extend +or use as well. Read through the source and see if there is anything you can +use. You might even get a few ideas for different approaches to your specific +situation. + +## Treeprocessors {: #treeprocessors } + +Treeprocessors manipulate an ElementTree object after it has passed through the +core BlockParser. This is where additional manipulation of the tree takes +place. Additionally, the InlineProcessor is a Treeprocessor which steps through +the tree and runs the Inline Patterns on the text of each Element in the tree. + +A Treeprocessor should inherit from `markdown.treeprocessors.Treeprocessor`, +over-ride the `run` method which takes one argument `root` (an ElementTree +object) and either modifies that root element and returns `None` or returns a +new ElementTree object. + +A pseudo example: + +```python +from markdown.treeprocessors import Treeprocessor + +class MyTreeprocessor(Treeprocessor): + def run(self, root): + root.text = 'modified content' +``` + +Note that Python class methods return `None` by default when no `return` +statement is defined. Additionally all Python variables refer to objects by +reference. Therefore, the above `run` method modifies the `root` element +in place and returns `None`. The changes made to the `root` element and its +children are retained. + +Some may be inclined to return the modified `root` element. While that would +work, it would cause a copy of the entire ElementTree to be generated each +time the Treeprocessor is run. Therefore, it is generally expected that +the `run` method would only return `None` or a new ElementTree object. + +For specifics on manipulating the ElementTree, see +[Working with the ElementTree][workingwithetree] below. + +## Postprocessors {: #postprocessors } + +Postprocessors manipulate the document after the ElementTree has been +serialized into a string. Postprocessors should be used to work with the +text just before output. + +A Postprocessor should inherit from `markdown.postprocessors.Postprocessor` +and over-ride the `run` method which takes one argument `text` and returns +a Unicode string. + +Postprocessors are run after the ElementTree has been serialized back into +Unicode text. For example, this may be an appropriate place to add a table of +contents to a document: + +```python +from markdown.postprocessors import Postprocessor + +class TocPostprocessor(Postprocessor): + def run(self, text): + return MYMARKERRE.sub(MyToc, text) +``` + +## BlockParser {: #blockparser } + +Sometimes, Preprocessors, Treeprocessors, Postprocessors, and Inline Patterns +are not going to do what you need. Perhaps you want a new type of block type +that needs to be integrated into the core parsing. In such a situation, you can +add/change/remove functionality of the core `BlockParser`. The BlockParser is +composed of a number of Blockprocessors. The BlockParser steps through each +block of text (split by blank lines) and passes each block to the appropriate +Blockprocessor. That Blockprocessor parses the block and adds it to the +ElementTree. The +[Definition Lists][] extension would be a good example of an extension that +adds/modifies Blockprocessors. + +A Blockprocessor should inherit from `markdown.blockprocessors.BlockProcessor` +and implement both the `test` and `run` methods. + +The `test` method is used by BlockParser to identify the type of block. +Therefore the `test` method must return a Boolean value. If the test returns +`True`, then the BlockParser will call that Blockprocessor's `run` method. +If it returns `False`, the BlockParser will move on to the next +Blockprocessor. + +The **`test`** method takes two arguments: + +* **`parent`**: The parent ElementTree Element of the block. This can be useful + as the block may need to be treated differently if it is inside a list, for + example. + +* **`block`**: A string of the current block of text. The test may be a + simple string method (such as `block.startswith(some_text)`) or a complex + regular expression. + +The **`run`** method takes two arguments: + +* **`parent`**: A pointer to the parent ElementTree Element of the block. The run + method will most likely attach additional nodes to this parent. Note that + nothing is returned by the method. The ElementTree object is altered in place. + +* **`blocks`**: A list of all remaining blocks of the document. Your run + method must remove (pop) the first block from the list (which it altered in + place - not returned) and parse that block. You may find that a block of text + legitimately contains multiple block types. Therefore, after processing the + first type, your processor can insert the remaining text into the beginning + of the `blocks` list for future parsing. + +Please be aware that a single block can span multiple text blocks. For example, +The official Markdown syntax rules state that a blank line does not end a +Code Block. If the next block of text is also indented, then it is part of +the previous block. Therefore, the BlockParser was specifically designed to +address these types of situations. If you notice the `CodeBlockProcessor`, +in the core, you will note that it checks the last child of the `parent`. +If the last child is a code block (`
...
`), then it +appends that block to the previous code block rather than creating a new +code block. + +Each Blockprocessor has the following utility methods available: + +* **`lastChild(parent)`**: + + Returns the last child of the given ElementTree Element or `None` if it + had no children. + +* **`detab(text)`**: + + Removes one level of indent (four spaces by default) from the front of each + line of the given text string. + +* **`looseDetab(text, level)`**: + + Removes "level" levels of indent (defaults to 1) from the front of each line + of the given text string. However, this methods allows secondary lines to + not be indented as does some parts of the Markdown syntax. + +Each Blockprocessor also has a pointer to the containing BlockParser instance at +`self.parser`, which can be used to check or alter the state of the parser. +The BlockParser tracks it's state in a stack at `parser.state`. The state +stack is an instance of the `State` class. + +**`State`** is a subclass of `list` and has the additional methods: + +* **`set(state)`**: + + Set a new state to string `state`. The new state is appended to the end + of the stack. + +* **`reset()`**: + + Step back one step in the stack. The last state at the end is removed from + the stack. + +* **`isstate(state)`**: + + Test that the top (current) level of the stack is of the given string + `state`. + +Note that to ensure that the state stack does not become corrupted, each time a +state is set for a block, that state *must* be reset when the parser finishes +parsing that block. + +An instance of the **`BlockParser`** is found at `Markdown.parser`. +`BlockParser` has the following methods: + +* **`parseDocument(lines)`**: + + Given a list of lines, an ElementTree object is returned. This should be + passed an entire document and is the only method the `Markdown` class + calls directly. + +* **`parseChunk(parent, text)`**: + + Parses a chunk of markdown text composed of multiple blocks and attaches + those blocks to the `parent` Element. The `parent` is altered in place + and nothing is returned. Extensions would most likely use this method for + block parsing. + +* **`parseBlocks(parent, blocks)`**: + + Parses a list of blocks of text and attaches those blocks to the `parent` + Element. The `parent` is altered in place and nothing is returned. This + method will generally only be used internally to recursively parse nested + blocks of text. + +While is is not recommended, an extension could subclass or completely replace +the `BlockParser`. The new class would have to provide the same public API. +However, be aware that other extensions may expect the core parser provided +and will not work with such a drastically different parser. + +## Working with the ElementTree {: #working_with_et } + +As mentioned, the Markdown parser converts a source document to an +[ElementTree][ElementTree] object before serializing that back to Unicode text. +Markdown has provided some helpers to ease that manipulation within the context +of the Markdown module. + +First, to get access to the ElementTree module import ElementTree from +`markdown` rather than importing it directly. This will ensure you are using +the same version of ElementTree as markdown. The module is found at +`markdown.util.etree` within Markdown. + +```python +from markdown.util import etree +``` + +`markdown.util.etree` tries to import ElementTree from any known location, +first as a standard library module (from `xml.etree` in Python 2.5), then as +a third party package (ElementTree). In each instance, `cElementTree` is +tried first, then ElementTree if the faster C implementation is not +available on your system. + +Sometimes you may want text inserted into an element to be parsed by +[Inline Patterns][]. In such a situation, simply insert the text as you normally +would and the text will be automatically run through the Inline Patterns. +However, if you do *not* want some text to be parsed by Inline Patterns, +then insert the text as an `AtomicString`. + +```python +from markdown.util import AtomicString +some_element.text = AtomicString(some_text) +``` + +Here's a basic example which creates an HTML table (note that the contents of +the second cell (`td2`) will be run through Inline Patterns latter): + +```python +table = etree.Element("table") +table.set("cellpadding", "2") # Set cellpadding to 2 +tr = etree.SubElement(table, "tr") # Add child tr to table +td1 = etree.SubElement(tr, "td") # Add child td1 to tr +td1.text = markdown.util.AtomicString("Cell content") # Add plain text content +td2 = etree.SubElement(tr, "td") # Add second td to tr +td2.text = "*text* with **inline** formatting." # Add markup text +table.tail = "Text after table" # Add text after table +``` + +You can also manipulate an existing tree. Consider the following example which +adds a `class` attribute to `` elements: + +```python +def set_link_class(self, element): + for child in element: + if child.tag == "a": + child.set("class", "myclass") #set the class attribute + set_link_class(child) # run recursively on children +``` + +For more information about working with ElementTree see the ElementTree +[Documentation](http://effbot.org/zone/element-index.htm) +([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)). + +## Integrating Your Code Into Markdown {: #integrating_into_markdown } + +Once you have the various pieces of your extension built, you need to tell +Markdown about them and ensure that they are run in the proper sequence. +Markdown accepts an `Extension` instance for each extension. Therefore, you +will need to define a class that extends `markdown.extensions.Extension` and +over-rides the `extendMarkdown` method. Within this class you will manage +configuration options for your extension and attach the various processors and +patterns to the Markdown instance. + +It is important to note that the order of the various processors and patterns +matters. For example, if we replace `http://...` links with `` elements, +and *then* try to deal with inline HTML, we will end up with a mess. +Therefore, the various types of processors and patterns are stored within an +instance of the Markdown class in [OrderedDict][]s. Your `Extension` class +will need to manipulate those OrderedDicts appropriately. You may insert +instances of your processors and patterns into the appropriate location in an +OrderedDict, remove a built-in instance, or replace a built-in instance with +your own. + +### `extendMarkdown` {: #extendmarkdown } + +The `extendMarkdown` method of a `markdown.extensions.Extension` class +accepts two arguments: + +* **`md`**: + + A pointer to the instance of the Markdown class. You should use this to + access the [OrderedDict][]s of processors and patterns. They are found + under the following attributes: + + * `md.preprocessors` + * `md.inlinePatterns` + * `md.parser.blockprocessors` + * `md.treeprocessors` + * `md.postprocessors` + + Some other things you may want to access in the markdown instance are: + + * `md.htmlStash` + * `md.output_formats` + * `md.set_output_format()` + * `md.output_format` + * `md.serializer` + * `md.registerExtension()` + * `md.html_replacement_text` + * `md.tab_length` + * `md.enable_attributes` + * `md.smart_emphasis` + +* **`md_globals`**: + + Contains all the various global variables within the markdown module. + +!!! Warning + With access to the above items, theoretically you have the option to + change anything through various [monkey_patching][] techniques. However, + you should be aware that the various undocumented parts of markdown may + change without notice and your monkey_patches may break with a new release. + Therefore, what you really should be doing is inserting processors and + patterns into the markdown pipeline. Consider yourself warned! + +[monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch + +A simple example: + +```python +from markdown.extensions import Extension + +class MyExtension(Extension): + def extendMarkdown(self, md, md_globals): + # Insert instance of 'mypattern' before 'references' pattern + md.inlinePatterns.add('mypattern', MyPattern(md), '`) followed by an existing key (i.e.: + `">somekey"`) inserts that item after the existing key. + +Consider the following example: + +```pycon +>>> from markdown.odict import OrderedDict +>>> od = OrderedDict() +>>> od['one'] = 1 # The same as: od.add('one', 1, '_begin') +>>> od['three'] = 3 # The same as: od.add('three', 3, '>one') +>>> od['four'] = 4 # The same as: od.add('four', 4, '_end') +>>> od.items() +[("one", 1), ("three", 3), ("four", 4)] +``` + +Note that when building an OrderedDict in order, the extra features of the +`add` method offer no real value and are not necessary. However, when +manipulating an existing OrderedDict, `add` can be very helpful. So let's +insert another item into the OrderedDict. + +```pycon +>>> od.add('two', 2, '>one') # Insert after 'one' +>>> od.values() +[1, 2, 3, 4] +``` + +Now let's insert another item. + +```pycon +>>> od.add('two-point-five', 2.5, '>> od.keys() +["one", "two", "two-point-five", "three", "four"] +``` + +Note that we also could have set the location of "two-point-five" to be 'after two' +(i.e.: `'>two'`). However, it's unlikely that you will have control over the +order in which extensions will be loaded, and this could affect the final +sorted order of an OrderedDict. For example, suppose an extension adding +"two-point-five" in the above examples was loaded before a separate extension +which adds 'two'. You may need to take this into consideration when adding your +extension components to the various markdown OrderedDicts. + +Once an OrderedDict is created, the items are available via key: + +```python +MyNode = od['somekey'] +``` + +Therefore, to delete an existing item: + +```python +del od['somekey'] +``` + +To change the value of an existing item (leaving location unchanged): + +```python +od['somekey'] = MyNewObject() +``` + +To change the location of an existing item: + +```python +t.link('somekey', '