ePub tutorial on IBM DeveloperWorks

Monday, December 1st, 2008

My tutorial aimed at software developers is now available on IBM DeveloperWorks: Build a digital book with EPUB.

(Requires a free registration.)

Summary:

Need to distribute documentation, create an eBook, or just archive your favorite blog posts? EPUB is an open specification for digital books based on familiar technologies like XML, CSS, and XHTML, and EPUB files can be read on portable e-ink devices, mobile phones, and desktop computers. This tutorial explains the EPUB format in detail, demonstrates EPUB validation using Java™ technology, and moves step-by-step through automating EPUB creation using DocBook and Python.

Thanks to Keith Fahlgren at O’Reilly Media for some editorial and technical help.

epubcheck service updated to epubcheck 1.0.3

Wednesday, November 26th, 2008

This is actually a significant update to the threepress.org epubcheck validation service as flaws in the previous versions of Adobe’s epubcheck were causing several critical file types to be completely unvalidated.

I recommend anyone relying on epubcheck, whether as a standalone library or through the web form, revalidate at least a subset of your ePubs.

tei2epub 1.0 release candidate, and docbook2epub released

Monday, November 17th, 2008

Two updates to the epub-tools Python code libraries:

  1. tei2epub has been updated to version 1.0b2 as a release candidate.
  2. docbook2epub has been released

The two libraries share common code which is automatically included in any ZIP bundle to handle general ePub tasks, including packing the ZIP file correctly. This required some significant updates to tei2epub, which no longer includes the epubcheck Java library. If the library is downloaded separately then both applications will perform validation after the ePub is built.

docbook2epub is very similar to the db2epub Ruby application which is included with the DocBook XSL. It doesn’t offer any significant features over db2epub; it just happens to be in Python.

Threepress is now an IDPF member

Friday, November 14th, 2008

I’m very pleased to announce that Threepress Consulting Inc. is now an official member of the International Digital Publishing Forum.

I look forward to supporting and participating in the further development of the ePub standard.

ePub production growing fast

Friday, November 7th, 2008

More extrapolation from usage statistics on the threepress.org ePub validation service, which uses Adobe’s epubcheck:

This report, current as of today, tracks visits to the validator. Blue represents visits in the current month; green is the comparison with the previous month.

When I segment by country I get some interesting new results:

India: +115%
US: +66%
Russia: +2,500%
UK: +52%
Canada: +155%
Germany: -25%
New Zealand: +100%
Ukraine: +500%
France: -25%
Philippines: +100%

I’ve highlighted those countries which are essentially new to the report. (The absolute numbers here are still very low by web traffic standards, so a 2,500% gain is not as huge as it sounds.)

India, of course, continues to be the major user of the validator. Nevertheless, it’s nice to see ePubs starting to come out of new places.

Bookworm now has full-text search and DTBook support

Wednesday, November 5th, 2008

ePubs added to Bookworm are now fully searchable.

When you add a book to your library, its text is automatically scanned and indexed in the correct language. You can search across all of your books from anywhere in the site.

Bookworm search

Results are returned in relevance order. Bookworm supports many advanced search features, such as stemming and boolean operators, through the use of the Xapian open-source search engine.

Bookworm results

More about Bookworm’s full-text ePub search.

DTBook support

DAISY logoePubs that use DTBook rather than XHTML can now be viewed and searched just like XHTML ePubs. DTBook ePubs are automatically converted to XHTML using the DAISY pipeline. The original ePub can always be downloaded with its DTBook content intact.

More about Bookworm’s ePub support.

Some ebooks are buggy — report them

Wednesday, October 22nd, 2008

Many ebooks aren’t going through the same kind of quality control that regular books do.  That’s been my experience and that of other ebook consumers. I’m not talking about technical problems here as much as basic editorial ones.

Sometimes the issues are minor: occasional spacing errors, missing or overzealous capitalization.  Other times they can be more prevalent.  A friend recently purchased Sarah Vowell’s The Wordy Shipmates from the Kindle store and many of the quotation marks were mangled (it’s likely the wrong encoding was used).

Amazon responded to the customer complaint very quickly, saying that they would notify the publisher and my friend could re-download the corrected book when it was posted. They also gave him a credit for an additional purchase. That’s a good outcome, obviously, but you never have to return a printed book because the punctuation is wrong.

Clearly the quality control needs to be on the publisher end, as each individual bookseller can’t be responsible for checking all of the digital books they offer.  The recent survey conducted at the Frankfurt Book Fair found that 60% of the respondents did not have an ereader, and while I don’t think everyone involved in book publishing actually needs to own one, I’d hope that any group distributing ebooks would be able to review them in the same way that their customers are receiving them.  If you sell Kindle books, someone on your team should have a Kindle and should check at least a representative sample of your offerings, especially if your group is new to digital distribution.

The best thing readers can do to improve ebook quality is to complain.  For now I believe the focus should be on simple fidelity: does this ebook at least contain the same text as the printed version?  Eventually, though, expectations about digital books should rise to the point of considering design. This is especially true when the ePub format is capable of supporting embedded fonts and the same level of aesthetic sophistication that’s present on the web.  Books can be works of art, and ebooks can be beautiful too.

Where in India are the digitization vendors?

Friday, October 10th, 2008

Here’s a good guess:

This is the output from my Google Analytics web traffic report on the country which sends the most visits to threepress.org. 44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the ePub validation service, a wrapper around Adobe’s epubcheck. (Bookworm statistics are not included in this report.)

India sends three times as much the traffic to the validation page compared to second-place United States, but only one-third as much the traffic to the home page.

It’s even more interesting to look at the “bounce rate” for the home page by country. The “bounce rate” is the percentage of times that a given page is the last one that a user looks at before they leave the site, and it’s one of the most useful metrics in web analysis. The overall bounce rate for the threepress.org home page is 37%, meaning 37% of the people who visited that page didn’t have a reason to click on another link. For India, that figure is 5% — presumably because they are all clicking through to the validation service. (By contrast, 80% of South African visitors leave immediately, suggesting that some unrelated keyword searches or links are driving them there.)

So if you’re looking for vendors who can provide high-quality, valid ePubs, I’d suggest, in descending order of frequency, suppliers in these cities:

  1. Pune
  2. Delhi
  3. New Delhi
  4. Chennai
  5. Mahape

New release of Bookworm: improved user experience and public content

Thursday, October 9th, 2008

Bookworm’s public home page (the one you see if you’re not logged in) has a new look. This is just one of many changes in the largest update since the site launched in July 2008.

Much more public content and help

When I conceived of Bookworm it was largely a way for me and other developers to experiment with ePub books. ePub isn’t a difficult specification and I felt the best way for me to understand it was to implement it, leaving the ugly parts of rendering XHTML to the browser.

Since July, publishers have been accelerating their release of ePub books, and with more devices beginning to support ePub, it felt like time to re-focus Bookworm away from developers and towards readers and publishers.

To that end, Bookworm now includes a tour of the site, a completely new help page with some suggestions for common problems and a rewritten About page that describes the goal of the project.

Publisher-focused

There’s a need for more ePub information targeted at publishing technologists: people who are either actively converting to ePub or are still assessing whether the format is a match for their needs. Bookworm is ideally suited as a platform for publishers to test ePubs or to QA new workflows. Much of the new content is written with this audience in mind.

More advanced developer guidelines

Developers’ needs are still very important to me, especially as ePub evolves. Bookworm provides more visibility into how the site implements the ePub specification, and which features of the specification it does and doesn’t support. I’m hoping this can start a conversation among those organizations which already know that ePub is for them, and are moving to the next level to make full use of it.

User-interface enhancements

It’s now possible to add a book from any page on the site, with just one click: try hovering over the “Add a book” link in the upper right. There are other small details that should make the reading experience smoother, too.

Other code fixes and improvements

This release includes a large number of behind-the-scenes changes to expand the range of ePubs that are accepted. I’m especially grateful for a user’s assistance in fully supporting Chinese language content.

Still coming…

I’ve been promising the ability to search individual books or across one’s library for a long time. Putting that off was tough, but I felt it was more important to make Bookworm easier and friendlier to use. Now I’m going to focus on features that will really take advantage of Bookworm’s online nature in a way that standalone readers and devices just can’t do.

How good are your ePubs?

Wednesday, October 8th, 2008

Most of my work in maintaining the Bookworm ePub reader is keeping up with all of the variations of the format that people try to upload.  There are some consistent problems that I’m seeing “out in the wild,” some serious, some understandable.

Lots of these problems would be caught by epubcheck, which can be used via the threepress.org epubcheck service, but I imagine that many people are testing only by opening the ePub in Adobe Digital Editions. ADE is very forgiving.  In the long run it’s best to validate all ePubs, as that guarantees they’ll work properly in future rendering systems that might not be so generous.

  1. Missing required attributes in the metadata. This is the one that’s most likely to get your ePub rejected by Bookworm, and the most common case is missing playOrder attributes in the NCX table of contents file.  The playOrder attribute specifies the order in which the table of contents should be laid out, and it’s easy to miss because its information is usually redundant — generally the navPoints are laid out in document order anyway.  A recent update of Bookworm will allow books that are missing their playOrder to be loaded (it then relies on document order), but strictly speaking, playOrder is required.

  2. Metadata that hasn’t been proofread. I’m not going to name names, but there’s a major publisher releasing ePubs with their own name misspelled in the dc:publisher field.  That’s not only embarrassing, it prevents web-aware ereaders like Bookworm from doing anything useful with that data, like automatically linking back to the publisher’s web site, or showing other books by that publisher.

  3. Improper nesting of the ePub zip file. The META-INF folder and mimetype file inside an ePub must be at the top level of the archive, not in a sub-folder.  Bookworm won’t accept these documents and epubcheck rejects them.   It’s a requirement I might loosen in the future but doing so is not a high priority for me.

  4. Items declared in the OPF file that are missing from the archive. I could “fix” this in Bookworm by ignoring any missing files, but I’ve been reluctant to do it because it could easily lead to Bookworm appearing to be buggy when it isn’t.  For example, I could remove any missing pages from the TOC, but internal document links will be broken, and if what’s missing is ‘Chapter 7′, I think most readers would want to know this. I feel this is a serious enough problem that I plan to continue to reject books that have this issue.

  5. Invalid XHTML. This is pretty common but not serious in the scope of things. A lot of “XHTML” in ePub is really HTML 4.01 or broken XHTML pretending to be valid. Bookworm does want to parse the content a bit (to do some pre-processing like rendering inline SVG as external links, and to extract just the <body> from the file), but if the content isn’t strictly XHTML it can still cope, just as a web browser does. Nevertheless, if your ePub content isn’t really XML, it limits the number of ways that it could be reused.

    The exceptional cases are ePubs which are themselves generated from web content, such as blogs or fanfic. Cleaning up real-world HTML is an art form and I don’t expect automated tools like BookGlutton’s HTML to ePub converter (which uses tidy) to be able to make it perfect.

One idiosyncrasy that isn’t technically a problem but has caused me no end of headaches is the issue of internal links within XHTML content.  For example, imagine you have all your content files in a sub-folder, so your ePub looks like:

META-INF/container.xml
mimetype
OEBPS/www/index1.html
OEBPS/www/image1.png
OEBPS/content.opf
OEBPS/toc.ncx

If index1.html uses image1.png as an inline image, what does the value of the src look like? src="www/image1.png" or src="image1.png"?

I see both forms. Bookworm will try to locate the full path first, and then fall back to just looking for the image name anywhere in the archive. This means it could potentially pull the wrong image if you have multiple images with the same name in different sub-folders, but I haven’t seen this happen. (If the src contains an absolute path, it will fail to find the image, a problem that epubcheck would flag.)

To be strictly accurate, any references inside an XHTML file should be relative to that file’s location in the archive. In the above example, the link should be src="image1.png".