Python and XML (and Google!) in publishing applications

Tuesday, October 28th, 2008

IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.

Massive XML files are uncommon in the general web development world, where the primary roles of XML are either as configuration files, read only infrequently, or for interchange across the web, in which case the files are necessarily small. It’s rare to encounter XML measured in gigabytes or more; data at that level is usually stored in a relational database.

For that reason I find myself frustrated with many XML tools, even those ostensibly designed to handle large amounts of data.  Too often they don’t scale well or at least easily.  I don’t believe that scaling should be a black art that each individual developer needs to solve independently.  Unfortunately, in commercial products ease-of-use is a key bullet point and computationally-difficult problems are hard to summarize in a user’s guide.

I tend to recommend open-source software most strongly in two scenarios: for small projects with limited budgets and for large projects with unique challenges.  There simply isn’t going to be a one-size-fits-all application for most interesting publishing work.

This is one of many reasons I’m excited by Google’s willingness to open its Google Books archive to researchers:  Python is a first-class programming language in the Google ecosystem, and Google has a good track record of open-sourcing those internal tools with limited commercial value.  I expect a lot of interesting work to come out of that archive once it’s available.

The analog hole, and a seminar on digitization

Thursday, October 23rd, 2008

Over on Tools of Change there’s a post of mine discussing the so-called “analog hole” as it applies to digital  books.  It was a fun article to write, especially the hands-on part.  I used Google’s OCRopus open-source OCR software, which was a little impenetrable to someone outside of the machine-learning community but did a good job once I fumbled around with it for awhile.

Also on that page at the moment is a giant photo of my head advertising What Publishers Need to Know About Digitization, a web seminar I’ll be hosting with O’Reilly Media on November 12. It will be a very high-level, introductory overview aimed at non-technical staff in publishing who are considering a digitization project.

Going full-circle, I wonder if there would be interest in a simple web-based OCR service where publishers could upload a scanned document to see how well bare-bones OCR performed on an image-only PDF or JPEG scan. I imagine it might help predict the complexity of a digitization project, and understand some of the challenges inherent in the process.

The real Internet Archive

Wednesday, October 22nd, 2008

My attention was caught by this quote from Clay Shirky on the excellent ReadWriteWeb blog:

Back in 1974, when the Internet was a fraction of what it is now, the acorn to an oak, there were really only two applications,” said Shirky, “Telnet, and FTP.”

Surely he’s wrong, I thought.  Those protocols aren’t that old.

But I was wrong. FTP was invented in 1971, and telnet was developed in 1969.

(Telnet is a way to connect interactively with another computer. In practice it’s been replaced by the more secure ssh, but vestigial copies remain on all modern computers.)

What really threw me wasn’t that telnet was from 1969 as much as that it was RFC 15.  In the networked world, Requests for Comments are documents which define the standards that computers use when communicating with each other.  To understand how old RFC 15 is, consider that the venerable FTP is RFC 114,  while email as we know it is RFC 821 (1982), and HTTP is RFC 1945 (1996, although obviously it had been in use for years). The most recent RFC is 5382. RFC 15 is ancient history.

Because I am a nerd I spent some time browsing the early RFCs, and I was struck by how charmingly antique they are. RFC 16 says that M.I.T. should receive copies of RFCs. RFC 6 begins, “I talked with Bob Kahn at BB&N yesterday.” RFC 14 never existed.

RFC 7 (”Host-IMP Interface”) includes a prefatory note:

The original of RFC 7 was hand-written, and only partially illegible [sic]
copies exist.

Indeed, the actual RFC begins:

This paper is concerned with the preliminary software design of the
Host IMP interface. Its main purpose is on the one hand to define
functions that will be implemented, and on the other hand to provide
a base for discussions and …(unreadable).

I’m on the mailing list for users of the Text Encoding Initiative (TEI), an XML schema used primarily for encoding historical texts. The schema is equipped with tags for tracking everything about a document, including changes that occur over centuries of time. On the TEI list, people ask questions like, “How do I represent a medieval manuscript and also indicate which passages were underlined by an 18th century owner?” or “What tag should I use for a poem title that was handwritten vertically in the left margin?” (Promptly followed by vigorous scholarly debates over the “correct” answers.)

There’s something charming about how early internet history, just 40 years old, is almost as poorly documented and in need of careful archivists.

Some ebooks are buggy — report them

Wednesday, October 22nd, 2008

Many ebooks aren’t going through the same kind of quality control that regular books do.  That’s been my experience and that of other ebook consumers. I’m not talking about technical problems here as much as basic editorial ones.

Sometimes the issues are minor: occasional spacing errors, missing or overzealous capitalization.  Other times they can be more prevalent.  A friend recently purchased Sarah Vowell’s The Wordy Shipmates from the Kindle store and many of the quotation marks were mangled (it’s likely the wrong encoding was used).

Amazon responded to the customer complaint very quickly, saying that they would notify the publisher and my friend could re-download the corrected book when it was posted. They also gave him a credit for an additional purchase. That’s a good outcome, obviously, but you never have to return a printed book because the punctuation is wrong.

Clearly the quality control needs to be on the publisher end, as each individual bookseller can’t be responsible for checking all of the digital books they offer.  The recent survey conducted at the Frankfurt Book Fair found that 60% of the respondents did not have an ereader, and while I don’t think everyone involved in book publishing actually needs to own one, I’d hope that any group distributing ebooks would be able to review them in the same way that their customers are receiving them.  If you sell Kindle books, someone on your team should have a Kindle and should check at least a representative sample of your offerings, especially if your group is new to digital distribution.

The best thing readers can do to improve ebook quality is to complain.  For now I believe the focus should be on simple fidelity: does this ebook at least contain the same text as the printed version?  Eventually, though, expectations about digital books should rise to the point of considering design. This is especially true when the ePub format is capable of supporting embedded fonts and the same level of aesthetic sophistication that’s present on the web.  Books can be works of art, and ebooks can be beautiful too.

TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum

Tuesday, October 14th, 2008

I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews. I can’t read Dutch, but Google’s automated translation tells me that the review of Hamlet mentions a “long modern clown,” which sounds disturbing enough that I’ll leave the actual reading to someone else.

The source documents are encoded in TEI XML and rendered to the browser using Python and lxml, three of my favorite technologies.

There are a few take-aways from this project that might benefit anyone working in a similar area and scale:

  • Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)
  • Use a modern programming language, even in a humanities context (e.g. Python)
  • Use modern XML parsing tools (e.g. lxml + XPath + XSLT)

The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python. I’m still amazed by how many people are “parsing” XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!

“Free” doesn’t have to mean primitive. In fact I would argue that projects like Pinax can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.

Congratulations to Thomas Crombez and his team!

Where in India are the digitization vendors?

Friday, October 10th, 2008

Here’s a good guess:

This is the output from my Google Analytics web traffic report on the country which sends the most visits to threepress.org. 44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the ePub validation service, a wrapper around Adobe’s epubcheck. (Bookworm statistics are not included in this report.)

India sends three times as much the traffic to the validation page compared to second-place United States, but only one-third as much the traffic to the home page.

It’s even more interesting to look at the “bounce rate” for the home page by country. The “bounce rate” is the percentage of times that a given page is the last one that a user looks at before they leave the site, and it’s one of the most useful metrics in web analysis. The overall bounce rate for the threepress.org home page is 37%, meaning 37% of the people who visited that page didn’t have a reason to click on another link. For India, that figure is 5% — presumably because they are all clicking through to the validation service. (By contrast, 80% of South African visitors leave immediately, suggesting that some unrelated keyword searches or links are driving them there.)

So if you’re looking for vendors who can provide high-quality, valid ePubs, I’d suggest, in descending order of frequency, suppliers in these cities:

  1. Pune
  2. Delhi
  3. New Delhi
  4. Chennai
  5. Mahape

New release of Bookworm: improved user experience and public content

Thursday, October 9th, 2008

Bookworm’s public home page (the one you see if you’re not logged in) has a new look. This is just one of many changes in the largest update since the site launched in July 2008.

Much more public content and help

When I conceived of Bookworm it was largely a way for me and other developers to experiment with ePub books. ePub isn’t a difficult specification and I felt the best way for me to understand it was to implement it, leaving the ugly parts of rendering XHTML to the browser.

Since July, publishers have been accelerating their release of ePub books, and with more devices beginning to support ePub, it felt like time to re-focus Bookworm away from developers and towards readers and publishers.

To that end, Bookworm now includes a tour of the site, a completely new help page with some suggestions for common problems and a rewritten About page that describes the goal of the project.

Publisher-focused

There’s a need for more ePub information targeted at publishing technologists: people who are either actively converting to ePub or are still assessing whether the format is a match for their needs. Bookworm is ideally suited as a platform for publishers to test ePubs or to QA new workflows. Much of the new content is written with this audience in mind.

More advanced developer guidelines

Developers’ needs are still very important to me, especially as ePub evolves. Bookworm provides more visibility into how the site implements the ePub specification, and which features of the specification it does and doesn’t support. I’m hoping this can start a conversation among those organizations which already know that ePub is for them, and are moving to the next level to make full use of it.

User-interface enhancements

It’s now possible to add a book from any page on the site, with just one click: try hovering over the “Add a book” link in the upper right. There are other small details that should make the reading experience smoother, too.

Other code fixes and improvements

This release includes a large number of behind-the-scenes changes to expand the range of ePubs that are accepted. I’m especially grateful for a user’s assistance in fully supporting Chinese language content.

Still coming…

I’ve been promising the ability to search individual books or across one’s library for a long time. Putting that off was tough, but I felt it was more important to make Bookworm easier and friendlier to use. Now I’m going to focus on features that will really take advantage of Bookworm’s online nature in a way that standalone readers and devices just can’t do.

How good are your ePubs?

Wednesday, October 8th, 2008

Most of my work in maintaining the Bookworm ePub reader is keeping up with all of the variations of the format that people try to upload.  There are some consistent problems that I’m seeing “out in the wild,” some serious, some understandable.

Lots of these problems would be caught by epubcheck, which can be used via the threepress.org epubcheck service, but I imagine that many people are testing only by opening the ePub in Adobe Digital Editions. ADE is very forgiving.  In the long run it’s best to validate all ePubs, as that guarantees they’ll work properly in future rendering systems that might not be so generous.

  1. Missing required attributes in the metadata. This is the one that’s most likely to get your ePub rejected by Bookworm, and the most common case is missing playOrder attributes in the NCX table of contents file.  The playOrder attribute specifies the order in which the table of contents should be laid out, and it’s easy to miss because its information is usually redundant — generally the navPoints are laid out in document order anyway.  A recent update of Bookworm will allow books that are missing their playOrder to be loaded (it then relies on document order), but strictly speaking, playOrder is required.

  2. Metadata that hasn’t been proofread. I’m not going to name names, but there’s a major publisher releasing ePubs with their own name misspelled in the dc:publisher field.  That’s not only embarrassing, it prevents web-aware ereaders like Bookworm from doing anything useful with that data, like automatically linking back to the publisher’s web site, or showing other books by that publisher.

  3. Improper nesting of the ePub zip file. The META-INF folder and mimetype file inside an ePub must be at the top level of the archive, not in a sub-folder.  Bookworm won’t accept these documents and epubcheck rejects them.   It’s a requirement I might loosen in the future but doing so is not a high priority for me.

  4. Items declared in the OPF file that are missing from the archive. I could “fix” this in Bookworm by ignoring any missing files, but I’ve been reluctant to do it because it could easily lead to Bookworm appearing to be buggy when it isn’t.  For example, I could remove any missing pages from the TOC, but internal document links will be broken, and if what’s missing is ‘Chapter 7′, I think most readers would want to know this. I feel this is a serious enough problem that I plan to continue to reject books that have this issue.

  5. Invalid XHTML. This is pretty common but not serious in the scope of things. A lot of “XHTML” in ePub is really HTML 4.01 or broken XHTML pretending to be valid. Bookworm does want to parse the content a bit (to do some pre-processing like rendering inline SVG as external links, and to extract just the <body> from the file), but if the content isn’t strictly XHTML it can still cope, just as a web browser does. Nevertheless, if your ePub content isn’t really XML, it limits the number of ways that it could be reused.

    The exceptional cases are ePubs which are themselves generated from web content, such as blogs or fanfic. Cleaning up real-world HTML is an art form and I don’t expect automated tools like BookGlutton’s HTML to ePub converter (which uses tidy) to be able to make it perfect.

One idiosyncrasy that isn’t technically a problem but has caused me no end of headaches is the issue of internal links within XHTML content.  For example, imagine you have all your content files in a sub-folder, so your ePub looks like:

META-INF/container.xml
mimetype
OEBPS/www/index1.html
OEBPS/www/image1.png
OEBPS/content.opf
OEBPS/toc.ncx

If index1.html uses image1.png as an inline image, what does the value of the src look like? src="www/image1.png" or src="image1.png"?

I see both forms. Bookworm will try to locate the full path first, and then fall back to just looking for the image name anywhere in the archive. This means it could potentially pull the wrong image if you have multiple images with the same name in different sub-folders, but I haven’t seen this happen. (If the src contains an absolute path, it will fail to find the image, a problem that epubcheck would flag.)

To be strictly accurate, any references inside an XHTML file should be relative to that file’s location in the archive. In the above example, the link should be src="image1.png".

Free, public domain ePub logos available for use

Tuesday, October 7th, 2008

Six styles of unofficial, public domain ePub logos are now offered by threepress.org for use: ePub logos.

These logos, created by illustrator John McCoy, are being made available to help spread awareness and adoption of the ePub standard. Publishers and booksellers may use them to indicate that they offer ebooks in ePub format; others may use them to illustrate articles or blog posts on ePub. As the logos are public domain there are no restrictions on how they may be used or modified.

David Rothman of TeleRead has previously called for and distributed one unofficial ePub logo.  Please use whichever suits your application or taste.

View ePub logos.

Call me “ePub”

Friday, October 3rd, 2008

It’s fantastic to see more and more publishers beginning to distribute books in ePub format, but call the format by its real name!

I’ve seen “ePub in disguise” in a few places, most recently this release from Pan Macmillan:

If you click on the arrow, the site brings up a very long page explaining what all the various file formats are.  Discussing “Adobe Digital Edition” format:

ADE uses a format based on the Open Publishing Standard with the extension .epub, and so ADE files are also known as .epub or ePub. ADE will also display your PDF files in a double-page, single page, or fit-to-width view — or you can specify your own custom fit.

(Although it’s not mentioned on the book page, on the digitalist blog it was stated that this book is DRM-free.  I’m assuming, therefore, that this is truly just plain ePub, although there’s no way for me to be sure other than buying a £9.99 ebook.)

Now I worry a lot about making ebook technology comprehensible to the average person, so I sympathize with the urge to simplify. But:

  1. “ePub” is a pretty good label (other than that no one agrees on how to capitalize it). It’s short and evocative.
  2. Nowhere on the ebook help page does it actually say which format you need for what device.  If I bought a shiny new Sony PRS-505 in the UK, which format do I want?  What about on my iPhone? My Kindle? (It’s a UK site, but it’s also an ebook. There’s no reason why an American couldn’t buy it.)

The whole value of the ePub format is that it isn’t vendor-specific.  Disguising it under Adobe’s name just makes it harder for buyers to know they can read it on their Sony Reader or Stanza/iPhone, and that the book isn’t suddenly going to be useless when some proprietary device finally gives out.