Bookworm now has full-text search and DTBook support

Wednesday, November 5th, 2008

ePubs added to Bookworm are now fully searchable.

When you add a book to your library, its text is automatically scanned and indexed in the correct language. You can search across all of your books from anywhere in the site.

Bookworm search

Results are returned in relevance order. Bookworm supports many advanced search features, such as stemming and boolean operators, through the use of the Xapian open-source search engine.

Bookworm results

More about Bookworm’s full-text ePub search.

DTBook support

DAISY logoePubs that use DTBook rather than XHTML can now be viewed and searched just like XHTML ePubs. DTBook ePubs are automatically converted to XHTML using the DAISY pipeline. The original ePub can always be downloaded with its DTBook content intact.

More about Bookworm’s ePub support.

New release of Bookworm: improved user experience and public content

Thursday, October 9th, 2008

Bookworm’s public home page (the one you see if you’re not logged in) has a new look. This is just one of many changes in the largest update since the site launched in July 2008.

Much more public content and help

When I conceived of Bookworm it was largely a way for me and other developers to experiment with ePub books. ePub isn’t a difficult specification and I felt the best way for me to understand it was to implement it, leaving the ugly parts of rendering XHTML to the browser.

Since July, publishers have been accelerating their release of ePub books, and with more devices beginning to support ePub, it felt like time to re-focus Bookworm away from developers and towards readers and publishers.

To that end, Bookworm now includes a tour of the site, a completely new help page with some suggestions for common problems and a rewritten About page that describes the goal of the project.

Publisher-focused

There’s a need for more ePub information targeted at publishing technologists: people who are either actively converting to ePub or are still assessing whether the format is a match for their needs. Bookworm is ideally suited as a platform for publishers to test ePubs or to QA new workflows. Much of the new content is written with this audience in mind.

More advanced developer guidelines

Developers’ needs are still very important to me, especially as ePub evolves. Bookworm provides more visibility into how the site implements the ePub specification, and which features of the specification it does and doesn’t support. I’m hoping this can start a conversation among those organizations which already know that ePub is for them, and are moving to the next level to make full use of it.

User-interface enhancements

It’s now possible to add a book from any page on the site, with just one click: try hovering over the “Add a book” link in the upper right. There are other small details that should make the reading experience smoother, too.

Other code fixes and improvements

This release includes a large number of behind-the-scenes changes to expand the range of ePubs that are accepted. I’m especially grateful for a user’s assistance in fully supporting Chinese language content.

Still coming…

I’ve been promising the ability to search individual books or across one’s library for a long time. Putting that off was tough, but I felt it was more important to make Bookworm easier and friendlier to use. Now I’m going to focus on features that will really take advantage of Bookworm’s online nature in a way that standalone readers and devices just can’t do.

How good are your ePubs?

Wednesday, October 8th, 2008

Most of my work in maintaining the Bookworm ePub reader is keeping up with all of the variations of the format that people try to upload.  There are some consistent problems that I’m seeing “out in the wild,” some serious, some understandable.

Lots of these problems would be caught by epubcheck, which can be used via the threepress.org epubcheck service, but I imagine that many people are testing only by opening the ePub in Adobe Digital Editions. ADE is very forgiving.  In the long run it’s best to validate all ePubs, as that guarantees they’ll work properly in future rendering systems that might not be so generous.

  1. Missing required attributes in the metadata. This is the one that’s most likely to get your ePub rejected by Bookworm, and the most common case is missing playOrder attributes in the NCX table of contents file.  The playOrder attribute specifies the order in which the table of contents should be laid out, and it’s easy to miss because its information is usually redundant — generally the navPoints are laid out in document order anyway.  A recent update of Bookworm will allow books that are missing their playOrder to be loaded (it then relies on document order), but strictly speaking, playOrder is required.

  2. Metadata that hasn’t been proofread. I’m not going to name names, but there’s a major publisher releasing ePubs with their own name misspelled in the dc:publisher field.  That’s not only embarrassing, it prevents web-aware ereaders like Bookworm from doing anything useful with that data, like automatically linking back to the publisher’s web site, or showing other books by that publisher.

  3. Improper nesting of the ePub zip file. The META-INF folder and mimetype file inside an ePub must be at the top level of the archive, not in a sub-folder.  Bookworm won’t accept these documents and epubcheck rejects them.   It’s a requirement I might loosen in the future but doing so is not a high priority for me.

  4. Items declared in the OPF file that are missing from the archive. I could “fix” this in Bookworm by ignoring any missing files, but I’ve been reluctant to do it because it could easily lead to Bookworm appearing to be buggy when it isn’t.  For example, I could remove any missing pages from the TOC, but internal document links will be broken, and if what’s missing is ‘Chapter 7′, I think most readers would want to know this. I feel this is a serious enough problem that I plan to continue to reject books that have this issue.

  5. Invalid XHTML. This is pretty common but not serious in the scope of things. A lot of “XHTML” in ePub is really HTML 4.01 or broken XHTML pretending to be valid. Bookworm does want to parse the content a bit (to do some pre-processing like rendering inline SVG as external links, and to extract just the <body> from the file), but if the content isn’t strictly XHTML it can still cope, just as a web browser does. Nevertheless, if your ePub content isn’t really XML, it limits the number of ways that it could be reused.

    The exceptional cases are ePubs which are themselves generated from web content, such as blogs or fanfic. Cleaning up real-world HTML is an art form and I don’t expect automated tools like BookGlutton’s HTML to ePub converter (which uses tidy) to be able to make it perfect.

One idiosyncrasy that isn’t technically a problem but has caused me no end of headaches is the issue of internal links within XHTML content.  For example, imagine you have all your content files in a sub-folder, so your ePub looks like:

META-INF/container.xml
mimetype
OEBPS/www/index1.html
OEBPS/www/image1.png
OEBPS/content.opf
OEBPS/toc.ncx

If index1.html uses image1.png as an inline image, what does the value of the src look like? src="www/image1.png" or src="image1.png"?

I see both forms. Bookworm will try to locate the full path first, and then fall back to just looking for the image name anywhere in the archive. This means it could potentially pull the wrong image if you have multiple images with the same name in different sub-folders, but I haven’t seen this happen. (If the src contains an absolute path, it will fail to find the image, a problem that epubcheck would flag.)

To be strictly accurate, any references inside an XHTML file should be relative to that file’s location in the archive. In the above example, the link should be src="image1.png".

Bookworm library integration with Google Books Search

Wednesday, September 24th, 2008

On September 22nd Google Books announced its expanded Google Book Search API, which includes the ability to preview and search Google Books content from other web sites.

Bookworm now has integration with one part of this API. The Book Information page (available from the table of contents for each Bookworm book), displays results from the Google Book Search service for that title and author.

Anne of Green Gables results from Google Book Search

Anne of Green Gables results from Google Book Search

How good are the results?

Frankly I’m disappointed. The metadata is often sloppy: description fields are sometimes nonsensical, there are numerous spacing errors in which words run together, and there is much more data available when you click through to the Google Books page than was returned by the API.

Nevertheless, I have decided to include the data in this single place per book, to help Bookworm users find print editions of their ebooks (especially for public domain books).

The identifier problem

This latest API is not the first that Google Books released, but it is the first that allows arbitrary search queries (such as for title and author name). The previous version only allowed searches by ISBN.

The ePub standard requires that ebooks be tagged with a unique identifier but does not specify what that identifier is. Obviously public domain and non-books don’t have ISBNs. Some publishers are assigning an ISBN as the ePub identifier, but using unique ISBNs for their digital editions. It would be nice if I could uniquely tie the ePub version of a book on Bookworm to its print counterpart (and leverage powerful Google features like searching that book content), but that’s not going to be possible when the editions have different ISBNs. Similarly it would be difficult to encourage users to buy a print version from Amazon or other retailers without running the risk of pointing to an older edition or one by a different publisher.

Bookworm feature update: remember where I left off

Thursday, August 14th, 2008

Bookworm will now remember and display the last-read chapter of each book, allowing you to jump right to where you finished reading. This feature applies to both the web and mobile versions of the site:

In addition, a new setting in the Profile page allows you to configure the site to always link the book’s title in this list to the last-read page.  This is especially useful when using the mobile version.


Note that, per the ePub specification, opening a new book will always go to the initial page as defined in the ebook’s OPF file.

Comments or suggestions for improvements on this and other Bookworm features are always welcome.

Recent posts to the O’Reilly TOC blog

Wednesday, August 13th, 2008

Bookworm on the Kindle browser

On the O’Reilly Tools of Change blog recently:

  1. Processing the deep backlist at the New York Times, a report from OSCON
  2. Optimizing web content for the Kindle, using Bookworm screenshots

The latter is part of a series of Kindle articles that I’ll be putting out in the coming weeks, including those on getting inside the device’s operating system (based on Igor Skochinsky’s amazing work).

(You can also read earlier posts by me on TOC.)

I’m also happy to announce that I will be on this year’s TOC Conference program committee.  Proposals for the 2009 conference are due August 25th.

Bookworm feature updates: sorting and pagination

Wednesday, July 30th, 2008

It is now possible to re-sort books in your library by title, first author or creation date, and to re-order those in ascending or descending order:

If the number of books in your library exceeds 20, you will be presented with next/previous pagination controls.

In an earlier post I listed several features that I planned to add shortly, and two are now completed:

  1. Optimized layouts for mobile readers (including the iPhone)
  2. Search within book content
  3. Methods for sorting and managing one’s library
  4. 100% compliance with the IDPF guidelines for ePub reading systems (in regards to XHTML 1.1 content)

Looks like search is up next!

Bookworm mobile screenshots / OSCON

Monday, July 21st, 2008

I’ll be in Portland, OR this weekend for the O’Reilly Open Source Convention, talking with people about future directions for Bookworm and other threepress projects. If you’ll be there and would like to get in touch, the best way to contact me is by email liza@threepress.org.

Some samples of the current version of Bookworm Mobile running on an iPhone:

(more…)

Mobile Bookworm launched

Friday, July 18th, 2008

A mobile web-optimized version of the Bookworm ePub ebook reader is now available at http://mobile.threepress.org/.

Bookworm Mobile has been specifically customized for the iPhone. Other improvements for different mobile web browsers will be rolled out over time, starting with Opera Mobile.

I welcome input on how to improve the reading experience on small devices, so please comment or send email to info@threepress.org. (I’m especially curious about the Kindle over Whispernet, since I believe Bookworm is the only way to read ePub books on the Kindle at this time.)

Additionally, small user interface enhancements and bug fixes have been released on the main Bookworm site as well.

Bookworm: an online ePub reader

Tuesday, July 15th, 2008

To coincide with the first launch of ePub books by a major publisher, I’m happy to announce the open beta of Bookworm, a web-based reader for the ePub ebook format.

Unlike most other ePub readers, Bookworm allows for full use of stylesheets and images, which is especially critical for technical books which include HTML tables and code samples.

Bookworm is free for us and open source under the BSD license; the code is part of the threepress project and is available here on Google Code.  Currently it should be considered beta software — this is especially true as new publishers begin to release ePub-formatted books in varying methods.

Please be patient if you encounter errors — detailed error reports are automatically emailed to me, but user bug reports are always helpful too.

There will be several major updates to Bookworm in the coming weeks, including:

  1. Optimized layouts for mobile readers (including the iPhone)
  2. Search within book content
  3. Methods for sorting and managing one’s library
  4. 100% compliance with the IDPF guidelines for ePub reading systems (in regards to XHTML 1.1 content)

For more information on Bookworm, please see our About page.