Validating EPUB 3 experiments
by Keith Fahlgren
EPUB 3 is tricky to experiment with today. Like any brand-new specification, there aren’t many of the resources we often take for granted, from books to software to validation tools. However, if you’re already comfortable getting your hands dirty you can get meaningful validation for your EPUB 3 documents now. In the future, we’ll probably have a dedicated EPUB 3 validation tool (modeled somewhat on epubcheck, although with quite a few changes, I hope), but I’d like to start working today. This post outlines how.
Note: I’m going to give examples using a number of bare-metal tools available on Mac OS X. These are probably portable to Linux and even Windows if you were motivated, but I’m not going to explain how to install them or set them up (here or in the comments). Google is your friend.
To get started, download all of the EPUB 3 schemas (I put them in an
epub30-schemas/ directory), install the absolute latest version of the RELAX NG validator Jing (
jing-20091111/ for me), download the Schematron tools at iso-schematron-xslt1.zip is for XSLT1 processors (
iso-schematron for me), and make sure you’ve got access to both xsltproc and
java. Finally, save this as
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:svrl="http://purl.oclc.org/dsdl/svrl" version="1.0"> <xsl:output method="text"/> <xsl:template match="*|node()"> <xsl:apply-templates/> </xsl:template> <xsl:template match="svrl:failed-assert| svrl:successful-report"> <xsl:text>FAILURE: </xsl:text> <xsl:value-of select="local-name(.)"/> <xsl:text>: </xsl:text> <xsl:value-of select="normalize-space(svrl:text/text())"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
Layout of the EPUB 3 schemas
All of the schemas for EPUB 3 are available as RELAX NG and sometimes Schematron. Each one has specific strengths, so we use both schemas whenever possible to get a complete list of all the validation issues. The EPUB 3 schemas are broken into separate files for each type of document inside an EPUB 3. You should notice that there is a RELAX NG Compact
.rnc file for each type:
epub30-schemas/epub-nav-30.rnc # EPUB Navigation Document epub30-schemas/epub-svg-30.rnc # SVG Content Documents epub30-schemas/epub-xhtml-30.rnc # XHTML Content Documents epub30-schemas/media-overlay-30.rnc # Media Overlay Documents epub30-schemas/ocf-container-30.rnc # META-INF/container.xml epub30-schemas/ocf-encryption-30.rnc # META-INF/encryption.xml epub30-schemas/ocf-signatures-30.rnc # META-INF/signatures.xml epub30-schemas/package-30.rnc # Package Documents
Unsurprisingly, you use a RELAX NG validator with
epub30-schemas/media-overlay-30.rnc to validate a Media Overlay document.
A few of these documents also have a Schematron schema with the same prefix but ending with
.sch, which is used to express other requirements that aren’t possible in RELAX NG:
epub30-schemas/epub-nav-30.sch epub30-schemas/epub-svg-30.sch epub30-schemas/epub-xhtml-30.sch epub30-schemas/media-overlay-30.sch epub30-schemas/package-30.sch
There are some standalone Schematron validators, but we’re actually going to roll our own tool for more human-readable output.
There’s a third file extension too,
.nvdl, which is short for Namespace-based Validation Dispatching Language. These files are supposed to wrap these two schemas together for unified validation tools, but there isn’t good software support for NVDL today. Ignore the
.nvdl files for now.
What to validate
I’m currently interested in the EPUB Navigation Document, a reformulation of EPUB’s NCX document as XHTML, so these are the examples we’ll use. However, this approach should work for any of the other document types if you go through the same setup.
Here is a valid, if short, EPUB Navigation Document:
<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" profile="http://www.idpf.org/epub/30/profile/content/"> <head> <title>EPUB Navigation Document Example (Good)</title> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> </head> <body> <section class="frontmatter TableOfContents"> <header> <h1>Contents</h1> </header> <nav epub:type="toc" id="toc"> <ol> <li class="toc-prelin" id="toc-prelim"> <a href="prelims.html">Introduction</a> </li> <li class="toc-ch01" id="toc-ch01"> <a href="ch01.html">Chapter 1</a> </li> <li> <a href="copyright.html">Copyright Page</a> </li> </ol> </nav> <nav epub:type="landmarks" id="guide"> <h2>Guide</h2> <ol> <li> <a epub:type="toc" href="#toc">Table of Contents</a> </li> <li> <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a> </li> <li> <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a> </li> </ol> </nav> </section> </body> </html>
And here is one with a few errors that should be reported as invalid:
<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" profile="http://www.idpf.org/epub/30/profile/content/"> <head> <title>EPUB Navigation Document Example (Bad)</title> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> </head> <body> <section class="frontmatter TableOfContents"> <header> <h1>Contents</h1> </header> <nav> <!-- this is omitted, which is invalid: epub:type="toc" id="toc" --> <ol> <li class="toc-prelin" id="toc-prelim"> <a href="prelims.html">Introduction</a> </li> <li class="toc-ch01" id="toc-ch01"> <a href="ch01.html">Chapter 1</a> <!-- This is invalid --> <span/> </li> <li> <a href="copyright.html">Copyright Page</a> </li> </ol> </nav> <nav epub:type="landmarks" id="guide"> <h2>Guide</h2> <ol> <li> <a epub:type="toc" href="#toc">Table of Contents</a> </li> <li> <a epub:type="bodymatter" href="chapter_001.xhtml">Begin Reading</a> </li> <li> <a epub:type="copyright-page" href="copyright.xhtml">Copyright Page</a> </li> </ol> </nav> </section> </body> </html>
RELAX NG validation with Jing
Once you’ve got
jing setup, it’s pretty straightforward to validate our files (above) against the appropriate
.rnc. We’ll be using the
When you run
jing against a file and it passes, you get no output (good) and an exit code of
0. I’m calling
java -jar jing-20091111/bin/jing.jar, passing the
-c flag to tell it to expect a Compact version of RELAX NG, and then the schema filename followed by the filename of the document to validate:
$ java -jar jing-20091111/bin/jing.jar -c epub30-schemas/epub-nav-30.rnc good.nav.html $ echo $? # What was the exit code? 0
$ java -jar jing-20091111/bin/jing.jar -c epub30-schemas/epub-nav-30.rnc bad.nav.html bad.nav.html:21:20: error: element "span" not allowed here; expected the element end-tag or element "ol"
…and the exit code is not
0, just as expected:
$ echo $? 1
We can take apart that first bit out output,
bad.nav.html:21:20, to know which file had the error (we could run it on many at once) and also the line number (
21) and character on that line (
20). Line 21 has just what we would expect given the error message (a
span instead of another
ol or the end of this one), but for other errors it can be quite illuminating:
$ sed -n -e 21p bad.nav.html <span/>
Note: For really large documents, you may get an
java.lang.OutOfMemoryError or other exception. Find out how to give
jing more “heap space”.
Schematron validation with XSLT
Validating the Schematron schemas is a little more involved, but it catches some validation errors than
jing and RELAX NG just cannot find. First we turn the
.sch file into a re-usable XSLT stylesheet that produces Schematron Validation Report Language (SVRL). We can then run that stylesheet on any document of that type inside an EPUB 3 file to produce SVRL, which we then transform into something human-readable.
First we create our validation stylesheet,
epub-nav-30.sch.xsl, from the
epub30-schemas/epub-nav-30.sch Schematron schema:
$ xsltproc -o epub-nav-30.sch.xsl iso-schematron/iso_svrl_for_xslt1.xsl epub30-schemas/epub-nav-30.sch
Now we can use
epub-nav-30.sch.xsl on any EPUB Navigation Document:
$ xsltproc epub-nav-30.sch.xsl bad.nav.html
<?xml version="1.0" standalone="yes"?> <svrl:schematron-output xmlns:svrl="http://purl.oclc.org/dsdl/svrl" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:schold="http://www.ascc.net/xml/schematron" xmlns:sch="http://www.ascc.net/xml/schematron" xmlns:iso="http://purl.oclc.org/dsdl/schematron" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" title="" schemaVersion=""> <!-- --> <svrl:ns-prefix-in-attribute-values uri="http://www.w3.org/1999/xhtml" prefix="html"/> <svrl:ns-prefix-in-attribute-values uri="http://www.idpf.org/2007/ops" prefix="epub"/> <svrl:active-pattern id="nav-ocurrence" name="nav-ocurrence"/> <svrl:fired-rule context="html:body"/> ... many more lines ...
…but we rarely want to read the SVRL it outputs directly (although sometimes it is worth it for the extra detail it contains), so we need to send it through another stylesheet (
svrl_as_text.xsl from above) to get a human-readable output:
$ xsltproc epub-nav-30.sch.xsl bad.nav.html | xsltproc svrl_as_text.xsl - FAILURE: failed-assert: Exactly one 'toc' nav element must be present FAILURE: failed-assert: Spans within nav elements must contain text FAILURE: failed-assert: nav elements other than 'toc', 'page-list' and 'landmarks' must contain a heading as the first child
These are completely new issues that
jing could not catch. Note that the issue about the
span is actually distinct from the one above, which said it was in the wrong place, whereas this says that it has the wrong content (none at all, in fact).
jing, we don’t get meaningful exit codes. Although that is not too hard to add, it’s slight tricky to get all of the errors and exit codes rather than just exiting on the first one, which can make you think your document is less invalid than it really is. We still get no output for valid documents:
$ xsltproc epub-nav-30.sch.xsl good.nav.html | xsltproc svrl_as_text.xsl - # no output
I’m certain to have made lots of mistakes in the examples above. If you spot some, please let me know in the comments and I’ll correct the post.