Threepress Consulting blog

Threepress creates software for publishers, educators and authors.

Month: October, 2008

Python and XML (and Google!) in publishing applications

IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.

Massive XML files are uncommon in the general web development world, where the primary roles of [...]

The analog hole, and a seminar on digitization

Over on Tools of Change there’s a post of mine discussing the so-called “analog hole” as it applies to digital  books.  It was a fun article to write, especially the hands-on part.  I used Google’s OCRopus open-source OCR software, which was a little impenetrable to someone outside of the machine-learning community but did a good [...]

The real Internet Archive

My attention was caught by this quote from Clay Shirky on the excellent ReadWriteWeb blog:
Back in 1974, when the Internet was a fraction of what it is now, the acorn to an oak, there were really only two applications,” said Shirky, “Telnet, and FTP.”
Surely he’s wrong, I thought.  Those protocols aren’t that old.
But I was [...]

Some ebooks are buggy — report them

Many ebooks aren’t going through the same kind of quality control that regular books do.  That’s been my experience and that of other ebook consumers. I’m not talking about technical problems here as much as basic editorial ones.
Sometimes the issues are minor: occasional spacing errors, missing or overzealous capitalization.  Other times they can be more [...]

TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum

I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews. I can’t read Dutch, but Google’s automated translation tells me that the review of Hamlet mentions a “long modern clown,” which sounds disturbing enough that I’ll leave the [...]

Where in India are the digitization vendors?

Here’s a good guess:

This is the output from my Google Analytics web traffic report on the country which sends the most visits to threepress.org. 44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the ePub validation service, a wrapper around Adobe’s [...]

New release of Bookworm: improved user experience and public content

Bookworm’s public home page (the one you see if you’re not logged in) has a new look. This is just one of many changes in the largest update since the site launched in July 2008.

Much more public content and help

When I conceived of Bookworm it was largely a way for me and other developers to [...]

How good are your ePubs?

Most of my work in maintaining the Bookworm ePub reader is keeping up with all of the variations of the format that people try to upload.  There are some consistent problems that I’m seeing “out in the wild,” some serious, some understandable.

Lots of these problems would be caught by epubcheck, which can be used via [...]

Free, public domain ePub logos available for use

Six styles of unofficial, public domain ePub logos are now offered by threepress.org for use: ePub logos.
These logos, created by illustrator John McCoy, are being made available to help spread awareness and adoption of the ePub standard. Publishers and booksellers may use them to indicate that they offer ebooks in ePub format; others may [...]

Call me “ePub”

It’s fantastic to see more and more publishers beginning to distribute books in ePub format, but call the format by its real name!
I’ve seen “ePub in disguise” in a few places, most recently this release from Pan Macmillan:

If you click on the arrow, the site brings up a very long page explaining what all the [...]