Validating EPUB 3 experiments

by Keith Fahlgren

EPUB 3 is tricky to experiment with today. Like any brand-new specification, there aren’t many of the resources we often take for granted, from books to software to validation tools. However, if you’re already comfortable getting your hands dirty you can get meaningful validation for your EPUB 3 documents now. In the future, we’ll probably have a dedicated EPUB 3 validation tool (modeled somewhat on epubcheck, although with quite a few changes, I hope), but I’d like to start working today. This post outlines how.

Note: I’m going to give examples using a number of bare-metal tools available on Mac OS X. These are probably portable to Linux and even Windows if you were motivated, but I’m not going to explain how to install them or set them up (here or in the comments). Google is your friend.

To get started, download all of the EPUB 3 schemas (I put them in an epub30-schemas/ directory), install the absolute latest version of the RELAX NG validator Jing (jing-20091111/ for me), download the Schematron tools at iso-schematron-xslt1.zip is for XSLT1 processors (iso-schematron for me), and make sure you’ve got access to both xsltproc and java. Finally, save this as svrl_as_text.xsl:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:svrl="http://purl.oclc.org/dsdl/svrl"
            version="1.0">
  <xsl:output method="text"/>

  <xsl:template match="*|node()">
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="svrl:failed-assert|
                       svrl:successful-report">
    <xsl:text>FAILURE: </xsl:text>
    <xsl:value-of select="local-name(.)"/>
    <xsl:text>: </xsl:text>
    <xsl:value-of select="normalize-space(svrl:text/text())"/>
    <xsl:text>&#10;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

Layout of the EPUB 3 schemas

All of the schemas for EPUB 3 are available as RELAX NG and sometimes Schematron. Each one has specific strengths, so we use both schemas whenever possible to get a complete list of all the validation issues. The EPUB 3 schemas are broken into separate files for each type of document inside an EPUB 3. You should notice that there is a RELAX NG Compact .rnc file for each type:

epub30-schemas/epub-nav-30.rnc        # EPUB Navigation Document
epub30-schemas/epub-svg-30.rnc        # SVG Content Documents
epub30-schemas/epub-xhtml-30.rnc      # XHTML Content Documents
epub30-schemas/media-overlay-30.rnc   # Media Overlay Documents
epub30-schemas/ocf-container-30.rnc   # META-INF/container.xml
epub30-schemas/ocf-encryption-30.rnc  # META-INF/encryption.xml
epub30-schemas/ocf-signatures-30.rnc  # META-INF/signatures.xml
epub30-schemas/package-30.rnc         # Package Documents

Unsurprisingly, you use a RELAX NG validator with epub30-schemas/media-overlay-30.rnc to validate a Media Overlay document.

A few of these documents also have a Schematron schema with the same prefix but ending with .sch, which is used to express other requirements that aren’t possible in RELAX NG:

epub30-schemas/epub-nav-30.sch
epub30-schemas/epub-svg-30.sch
epub30-schemas/epub-xhtml-30.sch
epub30-schemas/media-overlay-30.sch
epub30-schemas/package-30.sch

There are some standalone Schematron validators, but we’re actually going to roll our own tool for more human-readable output.

There’s a third file extension too, .nvdl, which is short for Namespace-based Validation Dispatching Language. These files are supposed to wrap these two schemas together for unified validation tools, but there isn’t good software support for NVDL today. Ignore the .nvdl files for now.

What to validate

I’m currently interested in the EPUB Navigation Document, a reformulation of EPUB’s NCX document as XHTML, so these are the examples we’ll use. However, this approach should work for any of the other document types if you go through the same setup.

Here is a valid, if short, EPUB Navigation Document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:epub="http://www.idpf.org/2007/ops"
      profile="http://www.idpf.org/epub/30/profile/content/">
  <head>
    <title>EPUB Navigation Document Example (Good)</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <section class="frontmatter TableOfContents">
      <header>
        <h1>Contents</h1>
      </header>
      <nav epub:type="toc" id="toc">
        <ol>
          <li class="toc-prelin" id="toc-prelim">
            <a href="prelims.html">Introduction</a>
          </li>
          <li class="toc-ch01" id="toc-ch01">
            <a href="ch01.html">Chapter 1</a>
          </li>
          <li>
            <a href="copyright.html">Copyright Page</a>
          </li>
        </ol>
      </nav>
      <nav epub:type="landmarks" id="guide">
        <h2>Guide</h2>
        <ol>
          <li>
            <a epub:type="toc" href="#toc">Table of Contents</a>
          </li>
          <li>
            <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a>
          </li>
          <li>
            <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a>
          </li>
        </ol>
      </nav>
    </section>
  </body>
</html>

And here is one with a few errors that should be reported as invalid:


<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:epub="http://www.idpf.org/2007/ops"
      profile="http://www.idpf.org/epub/30/profile/content/">
  <head>
    <title>EPUB Navigation Document Example (Bad)</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  </head>
  <body>
    <section class="frontmatter TableOfContents">
      <header>
        <h1>Contents</h1>
      </header>
      <nav>
        <!-- this is omitted, which is invalid: epub:type="toc" id="toc" -->
        <ol>
          <li class="toc-prelin" id="toc-prelim">
            <a href="prelims.html">Introduction</a>
          </li>
          <li class="toc-ch01" id="toc-ch01">
            <a href="ch01.html">Chapter 1</a>
            <!-- This is invalid -->
            <span/>
          </li>
          <li>
            <a href="copyright.html">Copyright Page</a>
          </li>
        </ol>
      </nav>
      <nav epub:type="landmarks" id="guide">
        <h2>Guide</h2>
        <ol>
          <li>
            <a epub:type="toc" href="#toc">Table of Contents</a>
          </li>
          <li>
            <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a>
          </li>
          <li>
            <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a>
          </li>
        </ol>
      </nav>
    </section>
  </body>
</html>

RELAX NG validation with Jing

Once you’ve got jing setup, it’s pretty straightforward to validate our files (above) against the appropriate .rnc. We’ll be using the epub30-schemas/epub-nav-30.rnc schema.

When you run jing against a file and it passes, you get no output (good) and an exit code of 0. I’m calling jing as java -jar jing-20091111/bin/jing.jar, passing the -c flag to tell it to expect a Compact version of RELAX NG, and then the schema filename followed by the filename of the document to validate:

$ java -jar jing-20091111/bin/jing.jar -c epub30-schemas/epub-nav-30.rnc good.nav.html
$ echo $? # What was the exit code?
0

Unlike earlier versions of jing, the latest versions have much clearer error reports on invalid documents (we also saw this improvement in epubcheck 1.2 thanks to George Bina from oXygen):

$ java -jar jing-20091111/bin/jing.jar -c epub30-schemas/epub-nav-30.rnc bad.nav.html
bad.nav.html:21:20: error: element "span" not allowed here; expected the element end-tag or element "ol"

…and the exit code is not 0, just as expected:

$ echo $?
1

We can take apart that first bit out output, bad.nav.html:21:20, to know which file had the error (we could run it on many at once) and also the line number (21) and character on that line (20). Line 21 has just what we would expect given the error message (a span instead of another ol or the end of this one), but for other errors it can be quite illuminating:

$ sed -n -e 21p bad.nav.html
            <span/>

Note: For really large documents, you may get an java.lang.OutOfMemoryError or other exception. Find out how to give jing more “heap space”.

Schematron validation with XSLT

Validating the Schematron schemas is a little more involved, but it catches some validation errors than jing and RELAX NG just cannot find. First we turn the .sch file into a re-usable XSLT stylesheet that produces Schematron Validation Report Language (SVRL). We can then run that stylesheet on any document of that type inside an EPUB 3 file to produce SVRL, which we then transform into something human-readable.

First we create our validation stylesheet, epub-nav-30.sch.xsl, from the epub30-schemas/epub-nav-30.sch Schematron schema:

$ xsltproc -o epub-nav-30.sch.xsl iso-schematron/iso_svrl_for_xslt1.xsl epub30-schemas/epub-nav-30.sch

Now we can use epub-nav-30.sch.xsl on any EPUB Navigation Document:

$ xsltproc epub-nav-30.sch.xsl bad.nav.html
<?xml version="1.0" standalone="yes"?>
<svrl:schematron-output xmlns:svrl="http://purl.oclc.org/dsdl/svrl" xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:schold="http://www.ascc.net/xml/schematron" xmlns:sch="http://www.ascc.net/xml/schematron"
xmlns:iso="http://purl.oclc.org/dsdl/schematron" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" title="" schemaVersion="">
  <!--   
		   
		   
		 -->
  <svrl:ns-prefix-in-attribute-values uri="http://www.w3.org/1999/xhtml" prefix="html"/>
  <svrl:ns-prefix-in-attribute-values uri="http://www.idpf.org/2007/ops" prefix="epub"/>
  <svrl:active-pattern id="nav-ocurrence" name="nav-ocurrence"/>
  <svrl:fired-rule context="html:body"/>
   ... many more lines ...

…but we rarely want to read the SVRL it outputs directly (although sometimes it is worth it for the extra detail it contains), so we need to send it through another stylesheet (svrl_as_text.xsl from above) to get a human-readable output:

$ xsltproc epub-nav-30.sch.xsl bad.nav.html | xsltproc svrl_as_text.xsl -
FAILURE: failed-assert: Exactly one 'toc' nav element must be present
FAILURE: failed-assert: Spans within nav elements must contain text
FAILURE: failed-assert: nav elements other than 'toc', 'page-list' and 'landmarks' must contain a heading as the first child

These are completely new issues that jing could not catch. Note that the issue about the span is actually distinct from the one above, which said it was in the wrong place, whereas this says that it has the wrong content (none at all, in fact).

Unlike jing, we don’t get meaningful exit codes. Although that is not too hard to add, it’s slight tricky to get all of the errors and exit codes rather than just exiting on the first one, which can make you think your document is less invalid than it really is. We still get no output for valid documents:

$ xsltproc epub-nav-30.sch.xsl good.nav.html | xsltproc svrl_as_text.xsl -
# no output

I’m certain to have made lots of mistakes in the examples above. If you spot some, please let me know in the comments and I’ll correct the post.