Three useful XML schemas in publishing

by Liza Daly

If I say that a document is in “XML”, I’m not really saying anything very specific. All I’ve told you is that the document has some text wrapped in various angle-brackets, and that those angle-brackets are “well-formed.” A well-formed XML document just means one in which the angle-brackets open and close in a predictable way.

It doesn’t tell you anything about the information encoded in those angle-brackets (really called elements). If the element is called <i>, does that mean “put this text in italics”? Or “indent”? Or even, “The following text is about me”?

In order to know what an XML document actually means, you need to know its schema. A schema is a kind of dictionary that defines all the names of the elements and to some extent, what they mean. It also describes the grammar of the document: for example, we might say that a <chapter> can be inside a <book> but not the other way around.

You can make up your own schema, and that’s often advisable when modeling a unique business practice. But books and other kinds of literature are well-understood, and there’s already been a huge amount of thought put into how to properly model them in XML. If you’re in digital publishing, these are the three schemas you’re most likely to come across when modeling written works:

DocBook

Originally designed for technical books, DocBook has emerged as an excellent general-purpose book schema. Because it’s in wide use, there are a lot of modern tools that understand it (including the excellent oXygen XML editor), and it’s trivial to generate other formats, including PDF and HTML, from a DocBook source.

Here’s a really simple DocBook document, in this case describing an article rather than a whole book:

<?xml version="1.0" encoding="utf-8"?>
<article xmlns="http://docbook.org/ns/docbook" version="5.0" xml:lang="en">
  <title>Sample article</title>
  <para>This is a very short article.</para>
</article>

TEI

The Text Encoding Initiative is also used to model textual works, but supports methods to encode historical and academic texts. TEI allows document authors to include revision history, extensive footnoting and cross-references, and provides a rich tagging mechanism for poetry, drama, and other forms of human literature.

TEI is frequently used in library digitization and archiving projects, and it can be used to encode texts that might seem otherwise impossible to render in XML. This project from the University of Maryland really shows off TEI’s power: In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven.

XHTML

In lots of ways, XHTML is wholly unsuited for use in book content. XHTML has almost no semantically-meaningful elements as applied to literature — there’s no built-in way to indicate a chapter, or footnotes, or dialogue versus description.

The advantage it does have is that it’s ubiquitous — thanks to the web — and many people who otherwise have no experience in XML or text encoding know at least a little HTML. Because of the web there are probably more works written in HTML today than in any other form in history.

By supplementing it with other forms of XML that do provide semantic structure, as in ePub, XHTML is demonstrably a useful and important commercial format.