Skip to content
Andrew Stacey edited this page Jan 7, 2014 · 3 revisions

Abstract

This is a class designed for those that want to use LaTeX to publish material on the internet. As it becomes more common to publish material via some content-management system, so it becomes rarer to generate (X)HTML documents directly and content for the web is usually written in some limited markup language. This class is designed to make it possible to author material for such systems using the facility of LaTeX.

Introduction

It is important at the outset to know what this class is designed for and the easiest way to do that is to explain what it is not for. It is not intended as a way to take an already authored LaTeX document and convert it into a format suitable for putting on the internet. It is also not intended as a "plugin" for a content-management system so that authors can use LaTeX as the input format for their blog, forum, wiki, or whatever.

The first of these is both impossible and undesirable. It is impossible for the following simple reason. Because TeX controls the whole process of creating its output, it can do some amazing things. A web document, on the other hand, is extremely malleable and the reader can transform it considerably; therefore, not everything that TeX can do can be (easily) done on the web. It is, via a considerable amount of trickery, to get quite close but only at the expense of this malleability. And this flexibility is a good thing as it allows the reader to make the document as easy for them to access as they can. This is why it is undesirable to support absolutely everything that LaTeX can do.

The second, the plugin, is also undesirable. There is a good reason that the current input formats were chosen: they are simple. They are simple to understand by a human: when writing on a wiki, it would be extremely inconvenient to have to learn the current page's set of macros so there is an advantage to having them consistent across the whole system. They are also simple to understand by a computer: parsing a file in, say, markdown is extremely quick and can be done in real time by scripting languages such as Perl and PHP. Parsing TeX is much harder due to its complexity, and so either the scripting language has to limit the syntax or it has to pass it to an external program, both of which can put severe limitations on the system.

So this package is designed for someone who wants to write a web page, not necessarily directly, using their LaTeX skills. They want the full flexibility of LaTeX together with its familiarity (presumably they write other documents in LaTeX already) but know at the outset that the document will end up on the web.

Usage

Usage of this package is extremely simple. It is designed as a class and so should be loaded with a simple:

\documentclass{internet}

There are various options available for the class. These are used to determine what type of output to produce.

The first pair of options switch between a normal LaTeX document and a "special" one. In addition to making it easy to produce a PDF version of the final document, it is often an easier workflow to use the PDF when writing as it is simple to view. The two options that control this output are:

  • doc This is the default and processes the document as if it were a LaTeX document.

  • text This produces the alternative output.

At the next level, one needs to select the desired output format. These are further split into two groups: a main format and a mathematical format. The main formats are as follows:

  • markdown The standard Markdown format.

  • markdownextra The Markdown Extra format, which extends the Markdown format.

  • maruku This is the Ruby implementation of Markdown, which extends Markdown Extra.

  • xhtml This produces XHTML code.

  • epub This produces an ePub3 document.

  • wordpress This is basically the markdownextra format, but with a couple of modifications for a wordpress blog.

  • instiki This is suitable for the instiki wiki system. Note that this also selects the itex maths format.

The mathematical formats are currently quite limited:

  • itex This modifies the mathematics to be suitable for processing by the itextomml program.

These can take options, which are passed on to the appropriate module (and where one module is built on top of another, they are passed down until one accepts the option). The current options are:

  • section level=integer for the Markdown and XHTML formats, sets the level of the top-level section.

  • split at=section type for the ePub format, sets the level at which the document will be split into separate files.

There is one further option. The use filename option means that TeX will try to determine the output format from the filename. It will split off the last part of the filename (without the extension) using an underscore as the separator. This will then be passed in as if it were a class option. The expected use of this is to have symlinks to the main file with name file_text.tex and file_doc.tex whereupon running pdflatex on one or the other produces the appropriate document.

Requirements

The code is written in LaTeX3. My version dates from December 2012. It may be compatible with earlier versions, but I cannot say for sure. When producing a document with any kinds of links in them, it is common to use the hyperref package. To use the hyperref package with this class, it needs to be at least version 6.83f (dated 2012-09-26) which introduced support for the custom driver option which this class uses.

One File, Many Outputs

Although the intention is that a document written using this system be written knowing the eventual output, it is certainly not unreasonable to use a single file for several outputs, or to use a single fragment in documents on different systems. Whilst the attempt is to make it so that the same input works for all outputs, there will be times when one output type requires a slightly different input to another. For that situation, the imode command is provided. It works in much the same fashion as the beamer command \mode. The syntax is either \imode<mode>{stuff} or just \imode<mode>, the latter must be by itself on a line. In both cases, mode is one of the possible outputs or one of the key words doc, text, or all.

Let us take \imode<mode>{stuff} first. If mode matches what was specified as a class option, then the contents of the argument is executed. If not, it is thrown away.

The second use, \imode<mode> is more complicated. If it mode matches, then TeX carries on as normal. If not, it starts gobbling stuff until it reaches a line with just imode or \begin{document} or \end{document} on it. In the first case, the next line should be of the form <mode> and TeX reevaluates the mode. In the second or third cases, it starts acting normally again.

Using the \imode command it is possible to specify material that is only processed in one mode.

(There is not yet support for specifying multi-modes.)

Post-Processing

Whatever the output format, pdflatex produces a PDF. When a text format is desired, this needs to be converted to text. There are many excellent tools for doing that, the program pdftotext from the xpdf system works well. Use it as follows:

pdftotext -enc ASCII7 -layout -nopgbrk texfile.pdf

Since line-breaks and indentation are often significant in text formats, the text modes define an extremely wide page in the hope that no paragraph will be quite that long. Whilst pdftotext is fairly good at preserving the layout, it is not reliable as to preserving the exact number of spaces. For this reason, in a mode where indentation is significant, every line starts with a string of the form XXXY where the number of Xs is the number of spaces to indent. A simple perl script can convert those to spaces. A shell script, latex2txt.sh is provided that automates the process of going from LaTeX document to text output with correct indentation.

Limitations

Too many to mention!

The major limitation is to do with external packages. Most external packages will not work directly with this class (at least, in a text mode). This harks back to the problem of taking an already-written document and converting it. LaTeX packages were written with the understanding that the final outcome would be a static document, not some text that will be further processed.

That is not to say that no packages will ever be supported. Packages could be supported by writing an alternative which translates the commands to a sensible output. However, this will need to be done on a case-by-case basis as the need arises.

Thus, for now, it is best to put all the \usepackage statements inside a \imode<doc>{...} command.

Obtaining the Code

Fork this repository!