catdvi
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
catdvi can be used to transform TeX DVI files into text, losing formatting its main aim on SBo is to be used by recoll, when it cannot extract text from pdf files by other means. catdvi is a program that translates TeX Device Independent (DVI) files into readable plain text. The program is under development. It produces satisfactory results in many cases, but still has some issues with complicated input. Goals Actually, "translate to plain text" can mean several different things, depending on the intended use: Output formatted text that resembles the layout of the DVI file as closely as possible, suitable for e.g. preview on a character cell terminal or printing on a teletype style printer. Output unformatted text in "read order". (Rather than "print order", which makes quite a difference with e.g. multi-column page layouts). Useful for searching, indexing and other kinds of postprocessing, and maybe also for export to different text processors. Output (not completely plain) text in read order with the formatting distilled into some kind of markup so that paragraph breaks, subscripts, superscripts, etc. can still be recognized. This functionality is essentially a (La-)TeX decompiler, useful for recovery of lost or otherwise unavailable .tex files. catdvi's principal target is to create human-readable text files from DVI input, and hence the first kind of translation. The second kind is supported as well because one of the developers needed it and it could be obtained as an easy by-product (based on the mostly true assumption that read order = order in the source file = order in the DVI file). The third kind of translation is the most difficult one to achieve since a DVI file does not contain logical markup information. The structure of the text has to be guessed from heuristic principles and an analysis of certain characteristics of TeX's output. No attempt in this direction has been made so far. But knowledge of some aspects of text structure would also help to improve the quality of layout in case 1. If it turns out these can reliably be guessed, an option to show them as markup will probably follow. This feature has low priority at the moment, especially since nobody has expressed a need for it.