3 The plasTeX Document

The plasTeX document is very similar to an XML DOM structure. In fact, you can use XML DOM methods to create and populate nodes, delete or move nodes, etc. The biggest difference between the plasTeX document and an XML document is that in XML the attributes of an element are simply string values, whereas attributes in a plasTeX document are generally document fragments that contain the arguments of a macro. Attributes can be canfigured to hold other Python objects like lists, dictionaries, and strings as well (see the section 4 for more information).

While XML document objects have a very strict syntax, LaTeX documents are a little more free-form. Because of this, the plasTeX framework does a lot of normalizing of the LaTeX document to make it conform to a set of rules. This set of rules means that you will always get a consistent output document which is necessary for easy manipulation and programability.

The overall document structure should not be surprising. There is a document element at the top level which corresponds to the XML Document node. The child nodes of the Document node begin with the preamble to the LaTeX document. This includes things like the \documentclass, \newcommands, \title, \author, counter settings, etc. For the most part, these nodes can be ignored. While they are a useful part of the document, they are generally only used by internal processes in plasTeX. What is important is the last node in the document which corresponds to LaTeX’s document environment.

The document environment has a very simple structure. It consists solely of paragraphs (actually \pars in TeX’s terms) and sections1. In fact, all sections have this same format including parts, chapters, sections, subsections, subsubsections, paragraphs, and subparagraphs. plasTeX can tell which pieces of a document correspond to a sectioning element by looking at the level attribute of the Python class that corresponds to the given macro. The section levels in plasTeX are the same as those used by LaTeX: -1 for part, 0 for chapter, 1 for section, etc. You can create your own sectioning commands simply by subclassing an existing macro class, or by setting the level attribute to a value that corresponds to the level of section you want to mimic. All level values less than 100 are reserved for sectioning so you aren’t limited to LaTeX’s sectioning depth. Figure 3.1 below shows an example of the overall document structure.

\includegraphics[width=4in]{docstructure}
Figure 3.1: The overall plasTeX document structure

This document is constructed during the parsing process by calling the digest method on each node. The digest method is passed an iterator of document nodes that correspond to the nodes in the document that follow the current node. It is the responsibility of the current node to only absorb the nodes that belong to it during the digest process. Luckily, the default digest method will work in nearly all cases. See section 4 for more information on the digestion process.

Part of this digestion process is grouping nodes into paragraphs. This is done using the paragraphs method available in all Macro based classes. This method uses the same technique as TeX to group paragraphs of content. Section 3.2 has more information about the details of paragraph grouping.

In addition to the level attribute of sections, there is also a mixin class that assists in generating the table of contents and navigation elements during rendering. If you create your own sectioning commands, you should include plasTeX.Base.LaTeX.Sectioning.SectionUtils as a base class as well. All of the standard LaTeX section commands already inherit from this class, so if you subclass one of those, you’ll get the helper methods for free. For more information on these helper methods see section 3.1.

The structure of the rest of the document is also fairly simple and well-defined. LaTeX commands are each converted into a document node with it’s arguments getting placed into the attributes dictionary. LaTeX environments also create a single node in the document, where the child nodes of the environment include everything between the \begin and \end commands. By default, the child nodes of an environment are simply inserted in the order that they appear in the document. However, there are some environments that require further processing due to their more complex structures. These structures include arrays and tabular environments, as well as itemized lists. For more information on these structures see sections 3.3.3 and 3.3.1, respectively. Figures 3.2 and 3.3 shows a common LaTeX document fragment and the resulting plasTeX document node structure.

\begin{center}
Every \textbf{good} boy does \textit{fine}.
\end{center}
Figure 3.2: Sample LaTeX document fragment code
\includegraphics[width=3in]{docfrag}
Figure 3.3: Resulting plasTeX document node structure

You may have noticed that in the document structure in Figure 3.3 the text corresponding to the argument for \textbf and \textit is actually a child node and not an attribute. This is actually a convenience feature in plasTeX. For macros like this where there is only one argument and that argument corresponds to the content of the macro, it is common to put that content into the child nodes. This is done in the args attribute of the macro class by setting the argument’s name to “self”. This magical value will link the attribute called “self” to the child nodes array. For more information on the args attribute and how it populates the attributes dictionary see section 4.

In the plasTeX framework, the input LaTeX document is parsed and digested until the document is finished. At this point, you should have an output document that conforms to the rules described above. The document should have a regular enough structure that working with it programatically using DOM methods or Python practices should be fairly straight-forward. The following sections give more detail on document structure elements that require extra processing beyond the standard parse-digest process.

Footnotes

  1. “sections” in this document is used loosely to mean any type of section: part, chapter, section, etc.