Threepress Consulting blog

Threepress creates software for publishers, educators and authors.

Tag: digitization

A case study in converting image-based ebooks into XML

There’s a great deal of valuable information in this recently-released white paper by The American Council of Learned Societies: ACLS Humanities E-Book XML Conversion Experiment: Report on Workflow, Costs, and User Preferences. Although the study was based on scholarly books, their findings would apply to many other digitization projects.
The Humanities E-Book (HEB) project took [...]

Slides from “What publishers need to know about digitization”

O’Reilly Media will be posting a complete recording of the presentation, but in the meantime I’ve posted the slides from the webcast, “What publishers need to know about digitization” on Slideshare.
Thanks to everyone who attended and especially to those who asked so many excellent questions.
What publishers need to know about digitization
View SlideShare presentation or Upload [...]

Python and XML (and Google!) in publishing applications

IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.

Massive XML files are uncommon in the general web development world, where the primary roles of [...]

The analog hole, and a seminar on digitization

Over on Tools of Change there’s a post of mine discussing the so-called “analog hole” as it applies to digital  books.  It was a fun article to write, especially the hands-on part.  I used Google’s OCRopus open-source OCR software, which was a little impenetrable to someone outside of the machine-learning community but did a good [...]

TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum

I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews. I can’t read Dutch, but Google’s automated translation tells me that the review of Hamlet mentions a “long modern clown,” which sounds disturbing enough that I’ll leave the [...]

Where in India are the digitization vendors?

Here’s a good guess:

This is the output from my Google Analytics web traffic report on the country which sends the most visits to threepress.org. 44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the ePub validation service, a wrapper around Adobe’s [...]