Parser for RST documents
RST is not describable by a context free grammar, so that the common parser approaches won't work.
Parser basics -------------
We decided to implement a parser roughly following the scheme of common shift reduce parsers with a dynamic lookahead.
There is a map of parser tokens to internal methods for callbacks, which are called in the defined order if the main parser methods reach the respective token in the provided token array. Each shift method is called with the relating token and the array of subsequent, yet unhandled, tokens.
These methods are expected to return either false, if the current token cannot be shifted by the called rule, true, when the token has been handled, but no document node has been created from it or a new ezcDocumentRstNode object, which is some AST node. When a shift method returned false the next shift method in the array is called to handle the token.
The returned ezcDocumentRstNode objects are put on the document stack in the order they are found in the token array.
The reductions array defines an array with a mapping of node types to rection callbacks, which are called if such a node has been added to the document stack. Each reduction method may either return false, if it could not handle the given node, or a new node. The reduction methods often manipulate the document stack, like searching backwards and aggregating nodes.
If a reduction method returns a node the parser reenters the reduction process with the new node.
The state of the RST parser heavily depends on the current indentation level, which is stored in the class property $indentation, and mainly modified in the special shift method updateIndentation(), which is called on each line break token.
Some of the shift methods aggregate additional tokens from the token array, bypassing the main parser method. This should only be done, if no common handling is required for the aggregated tokens.
Tables ------
The handling of RST tables is quite complex and the affiliation of tokens to nodes depend on the line and character position of the token. In this case the tokens are first aggregated into their cell contexts and reenter the parser afterwards.
For token lists, which are required to reenter the parser - independently from the current global parser state - the method reenterParser() takes such token lists, removes the overall indentation and returns a new document of the provided token array.
Source for this file: /Document/src/document/rst/parser.php
ezcDocumentParser | --ezcDocumentRstParser
Version: | //autogen// |
REGEXP_INLINE_LINK
= '(
|
PCRE regular expression for detection of URLs in texts. |
protected array |
$blockNodes
= array(
List of node types, which are valid block nodes, where we can indentation changes after, or which can be aggregated into sections. |
protected ezcDocumentRstStack |
$documentStack
= null
Contains a list of detected syntax elements. At the end of a successfull parsing process this should only contain one document syntax element. During the process it may contain a list of elements, which are up to reduction. Each element in the stack has to be an object extending from ezcDocumentRstNode, which may again contain any amount such objects. This way an abstract syntax tree is constructed. |
protected int |
$indentation
= 0
Current indentation of a paragraph / lsit item. |
protected int |
$postIndentation
= null
For the special case of dense bullet lists we need to update the indetation right after we created a new paragraph in one action. We store the indetation to update past the paragraph creation in this case in this variable. |
protected array |
$reductions
= array(
Array containing simplified reduce ruleset We cannot express the RST syntax as a usual grammar using a BNF. This structure implements a pseudo grammar by assigning a number of callbacks for internal methods implementing reduction rules for a detected syntax element.
|
protected array |
$shifts
= array(
Array containing simplified shift ruleset We cannot express the RST syntax as a usual grammar using a BNF. With the pumping lemma for context free grammars [1] you can easily prove, that the word a^n b c^n d e^n is not a context free grammar, and this is what the title definitions are. This structure contains an array with callbacks implementing the shift rules for all tokens. There may be multiple rules for one single token. The callbacks itself create syntax elements and push them to the document stack. After each push the reduction callbacks will be called for the pushed elements. The array should look like:
[1] http://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages |
protected array |
$shortDirectives
= array(
List of builtin directives, which do not aggregate more text the parameters and options. User defined directives always aggregate following indeted text. |
protected array |
$textNodes
= array(
List of node types, which can be considered as inline text nodes. |
protected array |
$titleLevels
= array()
Array with title levels used by the document author in their order. |
From ezcDocumentParser | |
---|---|
protected |
ezcDocumentParser::$options
|
protected |
ezcDocumentParser::$properties
|
public ezcDocumentRstParser |
__construct(
[ $options
= null] )
Construct new document |
protected void |
detectFootnoteType(
$name
)
Tries to detect footnote type |
protected void |
dumpStack(
)
Print a dump of the document stack |
protected ezcDocumentRstSubstitutionNode |
handleSpecialDirectives(
$substitution
, $node
)
Handle special directives |
protected void |
isEnumeratedList(
$tokens
, [ $token
= null] )
Is enumerated list? |
protected boolean |
isInlineEndToken(
$token
, $tokens
)
Check if token is an inline markup end token. |
protected boolean |
isInlineStartToken(
$token
, $tokens
)
Check if token is an inline markup start token. |
public void |
parse(
$tokens
)
Shift- / reduce-parser for RST token stack |
protected array |
readGridTableSpecification(
&$tokens
, $tokens
)
Read grid table specifications |
protected array |
readMutlipleIndentedLines(
$tokens
, [ $strict
= false] )
Read multiple lines |
protected array |
readSimpleCells(
$cellStarts
, &$tokens
, $tokens
)
Read simple cells |
protected array |
readSimpleTableSpecification(
&$tokens
, $tokens
)
Read simple table specifications |
protected array |
readUntil(
$tokens
, $until
)
Read all token until one of the given tokens occurs |
protected array |
realignTokens(
$tokens
)
Re-align tokens |
protected void |
reduceBlockquote(
$node
)
Reduce paragraph to blockquote |
protected void |
reduceBlockquoteAnnotation(
$node
)
Reduce blockquote annotation |
protected void |
reduceBlockquoteAnnotationParagraph(
$node
)
Reduce blockquote annotation content |
protected void |
reduceInternalTarget(
$node
)
Reduce internal target |
protected void |
reduceInterpretedText(
$node
)
Reduce interpreted text inline markup |
protected void |
reduceLink(
$node
)
Reduce link |
protected void |
reduceList(
$node
)
Reduce item to bullet list |
protected void |
reduceListItem(
$node
)
Reduce paragraph to bullet lsit |
protected void |
reduceMarkup(
$node
)
Reduce markup |
protected void |
reduceParagraph(
$node
)
Reduce paragraph |
protected void |
reduceReference(
$node
)
Reduce reference |
protected void |
reduceSection(
$node
)
Reduce prior sections, if a new section has been found. |
protected void |
reduceTitle(
$node
)
Reduce all elements to one document node. |
protected ezcDocumentRstDocumentNode |
reenterParser(
$tokens
, [ $reindent
= true] )
Reenter parser with a list of tokens |
protected ezcDocumentRstMarkupEmphasisNode |
shiftAnonymousHyperlinks(
$token
, $tokens
)
Detect inline markup |
protected ezcDocumentRstMarkupEmphasisNode |
shiftAnonymousReference(
$token
, $tokens
)
Shift anonymous reference target |
protected ezcDocumentRstTextLineNode |
shiftAsWhitespace(
$token
, $tokens
)
Keep the newline as a single whitespace to maintain readability in texts. |
protected ezcDocumentRstTitleNode |
shiftBackslash(
$token
, $tokens
)
Escaping of special characters |
protected ezcDocumentRstMarkupEmphasisNode |
shiftBlockquoteAnnotation(
$token
, $tokens
)
Blockquote annotations |
protected ezcDocumentRstMarkupEmphasisNode |
shiftBulletList(
$token
, $tokens
)
Bullet point lists |
protected ezcDocumentRstMarkupEmphasisNode |
shiftComment(
$token
, $tokens
)
Shift comment |
protected ezcDocumentRstMarkupEmphasisNode |
shiftDefinitionList(
$token
, $tokens
)
Shift definition lists |
protected ezcDocumentRstDirectiveNode |
shiftDirective(
$directive
, $tokens
)
Shift directives |
protected ezcDocumentRstDocumentNode |
shiftDocument(
$token
, $tokens
)
Create new document node |
protected ezcDocumentRstMarkupEmphasisNode |
shiftEnumeratedList(
$token
, $tokens
)
Enumerated lists |
protected ezcDocumentRstMarkupEmphasisNode |
shiftExternalReference(
$token
, $tokens
)
Detect inline markup |
protected ezcDocumentRstMarkupEmphasisNode |
shiftFieldList(
$token
, $tokens
)
Shift field lists |
protected ezcDocumentRstMarkupEmphasisNode |
shiftGridTable(
$token
, $tokens
)
Shift grid table |
protected ezcDocumentRstMarkupEmphasisNode |
shiftInlineLiteral(
$token
, $tokens
)
Detect inline literal |
protected ezcDocumentRstMarkupEmphasisNode |
shiftInlineMarkup(
$token
, $tokens
)
Detect inline markup |
protected ezcDocumentRstMarkupEmphasisNode |
shiftInterpretedTextMarkup(
$token
, $tokens
)
Detect interpreted text inline markup |
protected mixed |
shiftInterpretedTextRole(
$token
, $tokens
)
Try to shift a interpreted text role |
protected ezcDocumentRstTitleNode |
shiftLineBlock(
$token
, $tokens
)
Shift line blocks |
protected ezcDocumentRstMarkupEmphasisNode |
shiftLiteralBlock(
$token
, $tokens
)
Shift literal block |
protected ezcDocumentRstTitleNode |
shiftParagraph(
$token
, $tokens
)
Shift a paragraph node on two newlines |
protected ezcDocumentRstMarkupEmphasisNode |
shiftReference(
$token
, $tokens
)
Detect reference |
protected ezcDocumentRstMarkupEmphasisNode |
shiftSimpleTable(
$token
, $tokens
)
Shift simple table |
protected ezcDocumentRstTextLineNode |
shiftSpecialCharsAsText(
$token
, $tokens
)
Just keep text as text nodes |
protected ezcDocumentRstTitleNode |
shiftText(
$token
, $tokens
)
Just keep text as text nodes |
protected ezcDocumentRstTitleNode |
shiftTitle(
$token
, $tokens
)
Create new title node from titles with a top and bottom line |
protected ezcDocumentRstTitleNode |
shiftTransition(
$token
, $tokens
)
Shift transistions, which are separators in the document. |
protected ezcDocumentRstTextLineNode |
shiftWhitespaceAsText(
$token
, $tokens
)
Just keep text as text nodes |
protected bool |
updateIndentation(
$token
, $tokens
)
Update the current indentation after each newline. |
From ezcDocumentParser | |
---|---|
public ezcDocumentParser |
ezcDocumentParser::__construct()
Construct new document |
public array |
ezcDocumentParser::getErrors()
Return list of errors occured during visiting the document. |
public void |
ezcDocumentParser::triggerError()
Trigger parser error |
Construct new document
Name | Type | Description |
---|---|---|
$options |
ezcDocumentParserOptions |
Method | Description |
---|---|
ezcDocumentParser::__construct() |
Construct new document |
Tries to detect footnote type
The type of the footnote
Name | Type | Description |
---|---|---|
$name |
array |
Print a dump of the document stack
This function is only for use during dubbing of the document stack structure.
Handle special directives
Handle special directives like replace, which require reparsing of the directives contents, which is not possible to do during visiting, but is required to already be done inside the parser.
Name | Type | Description |
---|---|---|
$substitution |
ezcDocumentRstSubstitutionNode | |
$node |
ezcDocumentRstDirectiveNode |
Is enumerated list?
As defined at http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#bullet-lists
Checks if the curretn token with thw following tokens may be an enumerated list. Used by the repective shifting method and when checking for indentation updates.
Returns true, if the tokens may be an enumerated list, and false otherwise.
Name | Type | Description |
---|---|---|
$tokens |
ezcDocumentRstStack | |
$token |
mixed |
Check if token is an inline markup end token.
For a user readable list of the following rules, see: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Check if token is an inline markup start token.
For a user readable list of the following rules, see: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift- / reduce-parser for RST token stack
Name | Type | Description |
---|---|---|
$tokens |
array |
Read grid table specifications
Read the column specification headers of a grid table and return the sizes of the specified columns in an array.
Name | Type | Description |
---|---|---|
$tokens |
ezcDocumentRstStack | |
&$tokens |
Read multiple lines
Reads the content of multiple indented lines, where the indentation can bei either handled strict, or lose, when literal text is expected.
Returns an array with the collected tokens, until the indentation changes.
Name | Type | Description |
---|---|---|
$tokens |
ezcDocumentRstStack | |
$strict |
bool |
Read simple cells
Read cells as defined in simple tables. Cells are maily structured by whitespaces, but may also exceed one cell.
Returns an array with cells, ordered by their rows and columns.
Name | Type | Description |
---|---|---|
$cellStarts |
array | |
$tokens |
ezcDocumentRstStack | |
&$tokens |
Read simple table specifications
Read the column specification headers of a simple table and return the sizes of the specified columns in an array.
Name | Type | Description |
---|---|---|
$tokens |
ezcDocumentRstStack | |
&$tokens |
Read all token until one of the given tokens occurs
Reads all tokens and removes them from the token stack, which do not match of the given tokens. Escaping is maintained.
Name | Type | Description |
---|---|---|
$tokens |
ezcDocumentRstStack | |
$until |
array |
Re-align tokens
Realign tokens, so that the line start positions start at 0 again, even they were indeted before.
Name | Type | Description |
---|---|---|
$tokens |
array |
Reduce paragraph to blockquote
Indented paragraphs are blockquotes, which should be wrapped in such a node.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce blockquote annotation
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce blockquote annotation content
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce internal target
Internal targets are listed before the literal markup block, so it may be found and reduced after we found a markup block.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce interpreted text inline markup
Tries to find the opening tag for a markup definition.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce link
Uses the preceding element as the hyperlink content. This should be either a literal markup section, or just the last word.
As we do not get workd content out of the tokenizer (too much overhead), we split out the previous text node up, in case we got one.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce item to bullet list
Called for all items, which may be part of bullet lists. Depending on the indentation level we reduce some amount of items to a bullet list.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce paragraph to bullet lsit
Indented paragraphs are bllet lists, if prefixed by a bullet list indicator.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce markup
Tries to find the opening tag for a markup definition.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce paragraph
Aggregates all nodes which are allowed as subnodes into a paragraph.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce reference
Reduce references as defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce prior sections, if a new section has been found.
If a new section has been found all sections with a higher depth level can be closed, and all items fitting into sections may be aggregated by the respective sections as well.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstNode |
Reduce all elements to one document node.
Name | Type | Description |
---|---|---|
$node |
ezcDocumentRstTitleNode |
Reenter parser with a list of tokens
Returns a parsed document created from the given tokens. With optional, but default, reindetation of the tokens relative to the first token.
Name | Type | Description |
---|---|---|
$tokens |
array | |
$reindent |
bool |
Detect inline markup
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift anonymous reference target
Shift the short version of anonymous reference targets, the long version is handled in the shiftComment() method. Both are specified at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#anonymous-hyperlinks
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Keep the newline as a single whitespace to maintain readability in texts.
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Escaping of special characters
A backslash is used for character escaping, as defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#escaping-mechanism
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Blockquote annotations
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Bullet point lists
As defined at http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#bullet-lists
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift comment
Shift comments. Comments are introduced by '..' and just contain text. There are several other block, which are introduced the same way, but where the first token determines the actual type.
This method implements the parsing and detection of those different items.
Comments are basically described here, but there are crosscutting concerns throughout the complete specification: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#comments
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift definition lists
Shift definition lists, which are introduced by an indentation change without speration by a paragraph. Because of this the method is called form the updateIndentation method, which handles such indentation changes.
The text for the definition and its classifiers is already on the document stack because of this.
Definition lists are specified at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-lists
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift directives
Shift directives as a subaction of the shiftComment method, though the signature differs from the common shift methods.
This method aggregated options and parameters of directives, but leaves the content aggregation to the common comment aggregation.
Name | Type | Description |
---|---|---|
$directive |
ezcDocumentRstDirectiveNode | |
$tokens |
ezcDocumentRstStack |
Create new document node
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Enumerated lists
As defined at http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#bullet-lists
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Detect inline markup
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift field lists
Shift field lists, which are introduced by a term surrounded by columns and any text following. Field lists are specified at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#field-lists
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift grid table
In "Grid tables" the values are embedded in a complete grid visually describing a a table using characters. http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#grid-tables
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Detect inline literal
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-literals
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Detect inline markup
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Detect interpreted text inline markup
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#interpreted-text
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Try to shift a interpreted text role
Text role shifting is only called directly from the shiftInterpretedTextMarkup() method and tries to find the associated role.
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift line blocks
Shift line blocks, which are specified at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#line-blocks
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift literal block
Shift a complete literal block into one node. The behaviour of literal blocks is defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#literal-blocks
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift a paragraph node on two newlines
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Detect reference
As defined at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift simple table
"Simple tables" are not defined by a complete grid, but only by top and bottome lines. There exact specification can be found at: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#simple-tables
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Just keep text as text nodes
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Just keep text as text nodes
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Create new title node from titles with a top and bottom line
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Shift transistions, which are separators in the document.
Transitions are specified here: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#transitions
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Just keep text as text nodes
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |
Update the current indentation after each newline.
Name | Type | Description |
---|---|---|
$token |
ezcDocumentRstToken | |
$tokens |
ezcDocumentRstStack |