Formatting information

A beginner's introduction to typesetting with LATEX

Chapter 10 — Compatibility with other systems

Peter Flynn

Silmaril Consultants
Textual Therapy Division


v. 3.6 (March 2005)

Contents

Introduction
Foreword
Preface
  1. Installing TEX and LATEX
  2. Using your editor to create documents
  3. Basic document structures
  4. Typesetting, viewing and printing
  5. CTAN, packages, and online help
  6. Other document structures
  7. Textual tools
  8. Fonts and layouts
  9. Programmability (macros)
  10. Compatibility with other systems
  1. Configuring TEX search paths
  2. TEX Users Group membership
  3. The ASCII character set
  4. GNU Free Documentation License
References
Index

This edition of Formatting Information was prompted by the generous help I have received from TEX users too numerous to mention individually. Shortly after TUGboat published the November 2003 edition, I was reminded by a spate of email of the fragility of documentation for a system like LATEX which is constantly under development. There have been revisions to packages; issues of new distributions, new tools, and new interfaces; new books and other new documents; corrections to my own errors; suggestions for rewording; and in one or two cases mild abuse for having omitted package X which the author felt to be indispensable to users. ¶ I am grateful as always to the people who sent me corrections and suggestions for improvement. Please keep them coming: only this way can this book reflect what people want to learn. The same limitation still applies, however: no mathematics, as there are already a dozen or more excellent books on the market — as well as other online documents — dealing with mathematical typesetting in TEX and LATEX in finer and better detail than I am capable of. ¶ The structure remains the same, but I have revised and rephrased a lot of material, especially in the earlier chapters where a new user cannot be expected yet to have acquired any depth of knowledge. Many of the screenshots have been updated, and most of the examples and code fragments have been retested. ¶ As I was finishing this edition, I was asked to review an article for The PracTEX Journal, which grew out of the Practical TEX Conference in 2004. The author specifically took the writers of documentation to task for failing to explain things more clearly, and as I read more, I found myself agreeing, and resolving to clear up some specific problems areas as far as possible. It is very difficult for people who write technical documentation to remember how they struggled to learn what has now become a familiar system. So much of what we do is second nature, and a lot of it actually has nothing to do with the software, but more with the way in which we view and approach information, and the general level of knowledge of computing. If I have obscured something by making unreasonable assumptions about your knowledge, please let me know so that I can correct it.

Peter Flynn is author of The HTML Handbook and Understanding SGML and XML Tools, and editor of The XML FAQ.

CHAPTER
10

 

Compatibility with other systems

 

  1. Converting into LATEX
  2. Converting out of LATEX
ToC

As we saw in Chapter 2, LATEX uses plain-text files, so they can be read and written by any standard application that can open text files. This helps preserve your information over time, as the plain-text format cannot be obsoleted or hijacked by any manufacturer or sectoral interest, and it will always be readable on any computer, from your handheld (yes, LATEX is available for some PDAs, see Figure 10.1) to the biggest supercomputer.


Figure 10.1LATEX editing and processing on the Sharp Zaurus 5500 PDA
   

However, LATEX is intended as the last stage of the editorial process: formatting for print or display. If you have a requirement to re-use the text in some other environment — a database perhaps, or on the Web or a CD-ROM or DVD, or in Braille or voice output — then it should probably be edited, stored, and maintained in something neutral like the Extensible Markup Language (XML), and only converted to LATEX when a typeset copy is needed.

Although LATEX has many structured-document features in common with SGML and XML, it can still only be processed by the LATEX and pdfLATEX programs. Because its macro features make it almost infinitely redefinable, processing it requires a program which can unravel arbitrarily complex macros, and LATEX and its siblings are the only programs which can do that effectively. Like other typesetters and formatters (Quark XPress, PageMaker, FrameMaker, Microsoft Publisher, 3B2 etc.), LATEX is largely a one-way street leading to typeset printing or display formatting.

Converting LATEX to some other format therefore means you will unavoidably lose some formatting, as LATEX has features that others systems simply don't possess, so they cannot be translated — although there are several ways to minimise this loss. Similarly, converting other formats into LATEX often means editing back the stuff the other formats omit because they only store appearances, not structure.

However, there are at least two excellent systems for converting LATEX directly to HyperText Markup Language (HTML) so you can publish it on the web, as we shall see in section 10.2.

ToC10.1 Converting into LATEX

There are several systems which will save their text in LATEX format. The best known is probably the LYX editor (see Figure 2.1), which is a wordprocessor-like interface to LATEX for Windows and Unix. Both the AbiWord and Kword wordprocessors on Linux systems have a very good Save As...LATEX output, so they can be used to open Microsoft Word documents and convert to LATEX. Several maths packages like the EuroMath editor, and the Mathematica and Maple analysis packages, can also save material in LATEX format.

In general, most other wordprocessors and DTP systems either don't have the level of internal markup sophistication needed to create a LATEX file, or they lack a suitable filter to enable them to output what they do have. Often they are incapable of outputting any kind of structured document, because they only store what the text looks like, not why it's there or what role it fulfills. There are two ways out of this:

There is of course a third way, suitable for large volumes only: send it off to the Pacific Rim to be retyped into XML or LATEX. There are hundreds of companies from India to Polynesia who do this at high speed and low cost with very high accuracy. It sounds crazy when the document is already in electronic form, but it's a good example of the problem of low quality of wordprocessor markup that this solution exists at all.

You will have noticed that most of the solutions lead to one place: SGML1 or XML. As explained above and elsewhere, these formats are the only ones devised so far capable of storing sufficient information in machine-processable, publicly-accessible form to enable your document to be recreated in multiple output formats. Once your document is in XML, there is a large range of software available to turn it into other formats, including LATEX. Processors in any of the common SGML/XML processing languages like the Document Style Semantics and Specification Language (DSSSL), the Extensible Stylesheet Language [Transformations] (XSLT), Omnimark, Metamorphosis, Balise, etc. can easily be written to output LATEX, and this approach is extremely common.

Much of this will be simplified when wordprocessors support native, arbitrary XML/XSLT as a standard feature, because LATEX output will become much simpler to produce.

When these efforts coalesce into generalised support for arbitrary DTDs and Schemas, it will mean a wider choice of editing interfaces, and when they achieve the ability to run XSLT conversion into LATEX from within these editors, such as is done at the moment with Emacs or XML Spy, we will have full convertability.

  1. The Standard Generalized Markup Language (SGML) itself is little used now for new projects, as the software support for its daughter XML is far greater, but there are still hundreds of large document repositories in SGML still growing their collection by adding documents.
  1. Which is silly, given that Microsoft used to make one of the best Word-to-SGML converters ever, which was bi-directional (yes, it could round-trip Word to SGML and back to Word and back into SGML). But they dropped it on the floor when XML arrived.

ToC10.1.1 Getting LATEX out of XML

Assuming you can get your document out of its wordprocessor format into XML by some method, here is a very brief example of how to turn it into LATEX.

You can of course buy and install a fully-fledged commercial XML editor with XSLT support, and run this application within it. However, this is beyond the reach of many users, so to do this unaided you just need to install three pieces of software: Java, Saxon and the DocBook 4.2 DTD (URIs are correct at the time of writing). None of these has a visual interface: they are run from the command-line in the same way as is possible with LATEX.

As an example, let's take the above paragraph, as typed or imported into AbiWord (see Figure 10.1). This is stored as a single paragraph with highlighting on the product names (italics), and the names are also links to their Internet sources, just as they are in this document. This is a convenient way to store two pieces of information in the same place.


Figure 10.1Sample paragraph in AbiWord converted to XML

AbiWord can export in DocBook format, which is an XML vocabulary for describing technical (computer) documents–it's what I use for this book. AbiWord can also export LATEX, but we're going make our own version, working from the XML (Brownie points for the reader who can guess why I'm not just accepting the LATEX conversion output).

Although AbiWord's default is to output an XML book document type, we'll convert it to a LATEX article document class. Notice that AbiWord has correctly output the expected section and title markup empty, even though it's not used. Here's the XML output (I've changed the linebreaks to keep it within the bounds of this page size):

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 
        "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<book>
<!-- ===================================================================== -->
<!-- This DocBook file was created by AbiWord.                    -->
<!-- AbiWord is a free, Open Source word processor.                -->
<!-- You may obtain more information about AbiWord at www.abisource.com -->
<!-- ===================================================================== -->
  <chapter>
    <title></title>
    <section role="unnumbered">
      <title></title>
      <para>You can of course buy and install a fully-fledged commercial XML 
editor with XSLT support, and run this application within it. However, this 
is beyond the reach of many users, so to do this unaided you just need to 
install three pieces of software: <ulink
url="http://java.sun.com/j2se/1.4.2/download.html"><emphasis>Java</emphasis></ulink>,
<ulink 
url="http://saxon.sourceforge.net"><emphasis>Saxon</emphasis></ulink>, and 
the <ulink url="http://www.docbook.org/xml/4.2/index.html">DocBook 4.2 DTD</ulink> 
(URIs are correct at the time of writing). None of these has a visual 
interface: they are run from the command-line in the same way as is possible 
with L<superscript>A</superscript>T<subscript>E</subscript>X.</para> 
    </section>
  </chapter>
</book>
	  

The XSLT language lets us create templates for each type of element in an XML document. In our example, there are only three which need handling, as we did not create chapter or section titles (DocBook requires them to be present, but they don't have to be used).

  • para, for the paragraph[s];

  • ulink, for the URIs;

  • emphasis, for the italicisation.

I'm going to cheat over the superscripting and subscripting of the letters in the LATEX logo, and use my editor to replace the whole thing with the \LaTeX command. In the other three cases, we already know how LATEX deals with these, so we can write our templates (see Figure 10.2).


Figure 10.2XSLT script to convert the paragraph
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text>\documentclass{article}
\usepackage{url}</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="book">
    <xsl:text>\begin{document}</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>\end{document}</xsl:text>
  </xsl:template>

  <xsl:template match="para">
    <xsl:apply-templates/>
    <xsl:text>&#x000A;</xsl:text>
  </xsl:template>

  <xsl:template match="ulink">
    <xsl:apply-templates/>
    <xsl:text>\footnote{\url{</xsl:text>
    <xsl:value-of select="@url"/>
    <xsl:text>}}</xsl:text>
  </xsl:template>

  <xsl:template match="emphasis">
    <xsl:text>\emph{</xsl:text>
    <xsl:apply-templates/>
    <xsl:text>}</xsl:text>
  </xsl:template>

</xsl:stylesheet>
	    

If you run this through Saxon, which is an XSLT processor, you can output a LATEX file which you can process and view (see Figure 10.3).

$ java -jar /usr/local/saxonb8-0/saxon8.jar -o para.ltx \
para.dbk para.xsl
$ latex para.ltx
This is TeX, Version 3.14159 (Web2C 7.3.7x)
(./para.ltx
LaTeX2e <2001/06/01>
Loading CZ hyphenation patterns: Pavel Sevecek, v3, 1995
Loading SK hyphenation patterns: Jana Chlebikova, 1992
Babel <v3.7h> and hyphenation patterns for english, 
dumylang, nohyphenation, czech, slovak, german, ngerman, 
danish, spanish, catalan, finnish, french, ukenglish, greek, 
croatian, hungarian, italian, latin, mongolian, dutch, 
norwegian, polish, portuguese, russian, ukrainian, 
serbocroat, swedish, loaded. 
(/usr/TeX/texmf/tex/latex/base/article.cls
Document Class: article 2001/04/21 v1.4e Standard LaTeX 
document class (/usr/TeX/texmf/tex/latex/base/size10.clo))
(/usr/TeX/texmf/tex/latex/ltxmisc/url.sty) (./para.aux) 
[1] (./para.aux) )
Output written on para.dvi (1 page, 1252 bytes).
Transcript written on para.log.
$ xdvi para &
	  

Figure 10.3Displaying the typeset paragraph
\documentclass{article}\usepackage{url}\begin{document}
      
      You can of course buy and install a fully-fledged commercial 
XML editor with XSLT support, and run this application within it. 
However, this is beyond the reach of many users, so to do this 
unaided you just need to install three pieces of software: 
\emph{Java}\footnote{\url{http://java.sun.com/j2se/1.4.2/download.html}},
\emph{Saxon}\footnote{\url{http://saxon.sourceforge.net}}, and the 
DocBook 4.2 DTD\footnote{\url{http://www.docbook.org/xml/4.2/index.html}} 
(URIs are correct at the time of writing). None of these has a visual 
interface: they are run from the command-line in the same way as is 
possible with \LaTeX.
  
\end{document}
	    

Writing XSLT is not hard, but requires a little learning. The output method here is text, which is LATEX's file format (XSLT can also output HTML and other formats of XML).

  1. The first template matches ‘/’, which is the document root (before the book start-tag). At this stage we output the text \documentclass{article} and \usepackage{url}. The ‘apply-templates’ instructions tells the processor to carry on processing, looking for more matches. XML comments get ignored, and any elements which don't match a template simply have their contents passed through until the next match occurs.

  2. The book template outputs the \begin{document} and the \end{document} commands, and between them to carry on processing.

  3. The para template just outputs its content, but follows it with a linebreak (using the hexadecimal character code x000A (see the ASCII chart in Table C.1).

  4. The ulink template outputs its content but follows it with a footnote using the \url command to output the value of the url attribute.

  5. The emphasis template surrounds its content with \emph{ and }.

This is a relatively trivial example, but it serves to show that it's not hard to output LATEX from XML. In fact there is a set of templates already written to produce LATEX from a DocBook file at http://www.dpawson.co.uk/docbook/tools.html#d4e2905

ToC10.2 Converting out of LATEX

This is much harder to do comprehensively. As noted earlier, the LATEX file format really requires the LATEX program itself in order to process all the packages and macros, because there is no telling what complexities authors have added themselves (what a lot of this book is about!).

Many authors and editors rely on custom-designed or homebrew converters, often written in the standard shell scripting languages (Unix shells, Perl, Python, Tcl, etc). Although some of the packages presented here are also written in the same languages, they have some advantages and restrictions compared with private conversions:

ToC10.2.1 Conversion to Word

There are several programs on CTAN to do LATEX-to-Word and similar conversions, but they do not all handles everything LATEX can throw at them, and some only handle a subset of the built-in commands of default LATEX. Two in particular, however, have a good reputation, although I haven't used either of them (I stay as far away from Word as possible):

One easy route into wordprocessing, however, is the reverse of the procedures suggested in the preceding section: convert LATEX to HTML, which many wordprocessors read easily. The following sections cover two packages for this.

ToC10.2.2 LATEX2HTML

As its name suggests, LATEX2HTML is a system to convert LATEX structured documents to HTML. Its main task is to reproduce the document structure as a set of interconnected HTML files. Despite using Perl, LATEX2HTML relies very heavily on standard Unix facilities like the NetPBM graphics package and the pipe syntax. Microsoft Windows is not well suited to this kind of composite processing, although all the required facilities are available for download in various forms and should in theory allow the package to run — but reports of problems are common.

  • The sectional structure is preserved, and navigational links are generated for the standard Next, Previous, and Up directions.

  • Links are also used for the cross-references, citations, footnotes, ToC, and lists of figures and tables.

  • Conversion is direct for common elements like lists, quotes, paragraph-breaks, type-styles, etc, where there is an obvious HTML equivalent.

  • Heavily formatted objects such as math and diagrams are converted to images.

  • There is no support for homebrew macros.

There is, however, support for arbitrary hypertext links, symbolic cross-references between ‘evolving remote documents’, conditional text, and the inclusion of raw HTML. These are extensions to LATEX, implemented as new commands and environments.

LATEX2HTML outputs a directory named after the input filename, and all the output files are put in that directory, so the output is self-contained and can be uploaded to a server as it stands.

ToC10.2.3 TEX4ht

TEX4ht operates differently from LATEX2HTML: it uses the TEX program to process the file, and handles conversion in a set of postprocessors for the common LATEX packages. It can also output to XML, including Text Encoding Initiative (TEI) and DocBook, and the OpenOffice and WordXML formats, and it can create TEXinfo format manuals.

By default, documents retain the single-file structure implied by the original, but there is again a set of additional configuration directives to make use of the features of hypertext and navigation, and to split files for ease of use.

ToC10.2.4 Extraction from PS and PDF

If you have the full version of Adobe Acrobat, you can open a PDF file created by pdfLATEX, select and copy all the text, and paste it into Word and some other wordprocessors, and retain some common formatting of headings, paragraphs, and lists. Both solutions still require the wordprocessor text to be edited into shape, but they preserve enough of the formatting to make it worthwhile for short documents. Otherwise, use the pdftotext program to extract everything from the PDF file as plain (paragraph-formatted) text.

ToC10.2.5 Last resort: strip the markup

At worst, the detex program on CTAN will strip a LATEX file of all markup and leave just the raw unformatted text, which can then be re-edited. There are also programs to extract the raw text from DVI and PostScript (PS) files.


Previous Top Next