Seven new books added

Monday, May 12th, 2008

The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one with verse components (The Jungle Book); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.

To expand the project in useful ways I’d like to be able to add:

  1. Other content types besides novels, especially reference
  2. Content from other document formats, such as DocBook
  3. Native, highly-tagged TEI documents

Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren’t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn’t seem productive.

XML data welcome!

New books added: A Tale of Two Cities and The Cask of Amontillado

Monday, May 5th, 2008

Two books that should’ve been in the initial release were added today: A Tale of Two Cities by Charles Dickens and The Cask of Amontillado by Edgar Allen Poe.

Tale was challenging because of the way the “books” were organized (they’re called parts in threepress).  This book exposed a bug in the way I was handling chapter ordering, which I’ve fixed.

Cask is my only content with no chapters, as it’s a short story.  I could make that more transparent to the user than the current implementation (right now content is assigned to a pseudo-chapter called “Complete story”), but whether I do that will depend on which is the outlying case: books or single-chaptered works.  Right now it’s mostly books, so that feels like the natural way to organize the site.