<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Threepress Consulting blog &#187; digitization</title>
	<atom:link href="http://blog.threepress.org/tag/digitization/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.threepress.org</link>
	<description>Threepress creates software for publishers, educators and authors.</description>
	<lastBuildDate>Mon, 09 Jan 2012 13:02:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>A case study in converting image-based ebooks into XML</title>
		<link>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/</link>
		<comments>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/#comments</comments>
		<pubDate>Sat, 21 Feb 2009 22:01:18 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[book design]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[ebooks]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[images]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=369</guid>
		<description><![CDATA[There&#8217;s a great deal of valuable information in this recently-released white paper by The American Council of Learned Societies: ACLS Humanities E-Book XML Conversion Experiment: Report on Workflow, Costs, and User Preferences.  Although the study was based on scholarly books, their findings would apply to many other digitization projects.
The Humanities E-Book (HEB) project took [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s a great deal of valuable information in this recently-released white paper by The American Council of Learned Societies: <a href="http://www.humanitiesebook.org/HEBWhitePaper2.pdf">ACLS Humanities E-Book XML Conversion Experiment: Report on Workflow, Costs, and User Preferences</a>.  Although the study was based on scholarly books, their findings would apply to many other digitization projects.</p>
<p><a href="http://www.humanitiesebook.org/">The Humanities E-Book</a> (HEB) project took 20 books (as scanned page images + uncorrected OCR) and converted them to an in-house XML format.  They compared the workflow impact, costs and user experience of the final XML product versus that of the page-image ebooks.</p>
<p>Many of their findings consorted with my own experience in this area:</p>
<ol>
<li>The quality of the OCRed text was worse than expected: good enough for search, but not always suitable for reading.  However, the cost of double-keying the text from scratch was prohibitive.</li>
<li>The encoding vendor, while skilled and diligent, nevertheless produced output that would require a trained editor to correct properly.  HEB spent 4-8 hours hand-correcting each book in the sample set.</li>
<li>The average cost for conversion to XML was approximately 3X greater than for scans + OCR only.  This did not include in-house correction and review.</li>
</ol>
<h3>User survey results</h3>
<p>After the 20 sample books were made available to their community, users were polled for their reactions.  I feel these are worth mentioning at length.</p>
<p>69% of readers preferred the XML-encoded books (presented as HTML in a browser).</p>
<p>Reasons for preferring the XML scans included:</p>
<ol>
<li>Readability (despite the fact that not all books were completely proofed)</li>
<li>Usability (e.g. cut and paste, ability to use screen readers)</li>
<li>Layout (the HTML presentation had few distracting elements on the pages, and more content was available per web page than in the page-based scans)</li>
</ol>
<p>Interestingly, of those readers who preferred the image scans, one of the primary reasons cited was the more book-like paginated layout. I&#8217;m very conscious of this tension: many Bookworm users complain about the chapter-at-a-time, scrolling layout of the pages, while others absolutely hate arbitrary emulation of the printed work. It seems to be a strong personal preference that runs in one direction or the other.</p>
<h3>Ebook interface considerations</h3>
<p>Although not directly related to the study at hand, I found some of the publisher-imposed constraints on their user interface illustrative.  I feel these would be best be avoided when designing an ebook reading site:</p>
<blockquote><p>Foremost among user requests was a desire for better printing options. Printing of HEB titles has always been restricted to fair-use provisions, and for this reason there had neverbeen any immediate way of printing out pages without prior browser adjustment to<br />
accommodate frames—the intention being to discourage printing out long sections of copyrighted text at once.</p></blockquote>
<p>The ability to print text at length is critical for any serious work.  I&#8217;m always unhappy when a site prevents me from doing an ordinary task like printing or downloading.  I hope that publishers reconsider these types of restrictions.</p>
<p>Similarly, revenue models should not constrain the ways in which licensed users can access content:</p>
<blockquote><p>XML titles normally suppress all higher-level “container” sections, so that users always access only the smallest available text chunk in each overarching section. [...]</p>
<p>&#8230;for this set of titles, we would simply make all section levels accessible.  This would affect the process of tallying hits for these titles—something needed in order to calculate royalties  for publishers and usage statistics for libraries—as users could now potentially read an entire book by accessing only a small number of chapter-level sections (which in turn would generate fewer hits than reading the page-image version).</p></blockquote>
<p>As a reader, I should absolutely be able to read content &#8212; especially XML-based content &#8212; in as fluid a manner as possible.  Generating accurate accounting is a programming problem, and not one that should drive decisions about the reading interface.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Slides from &#8220;What publishers need to know about digitization&#8221;</title>
		<link>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/</link>
		<comments>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 17:17:32 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[toc]]></category>
		<category><![CDATA[ebooks]]></category>
		<category><![CDATA[epub]]></category>
		<category><![CDATA[publishing]]></category>
		<category><![CDATA[schemas]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=193</guid>
		<description><![CDATA[O&#8217;Reilly Media will be posting a complete recording of the presentation, but in the meantime I&#8217;ve posted the slides from the webcast, &#8220;What publishers need to know about digitization&#8221; on Slideshare.
Thanks to everyone who attended and especially to those who asked so many excellent questions.
What publishers need to know about digitization
View SlideShare presentation or Upload [...]]]></description>
			<content:encoded><![CDATA[<p>O&#8217;Reilly Media will be posting a complete recording of the presentation, but in the meantime I&#8217;ve posted the slides from the webcast, &#8220;<a href="http://toc.oreilly.com/2008/11/toc-webcast-tomorrow-what-publ.html">What publishers need to know about digitization</a>&#8221; on Slideshare.</p>
<p>Thanks to everyone who attended and especially to those who asked so many excellent questions.</p>
<div style="width:425px;text-align:left" id="__ss_749916"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/lizadaly/what-publishers-need-to-know-about-digitization-presentation?type=powerpoint" title="What publishers need to know about digitization">What publishers need to know about digitization</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=digitizationwebinar-1226595850439471-9&#038;stripped_title=what-publishers-need-to-know-about-digitization-presentation" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=digitizationwebinar-1226595850439471-9&#038;stripped_title=what-publishers-need-to-know-about-digitization-presentation" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View SlideShare <a style="text-decoration:underline;" href="http://www.slideshare.net/lizadaly/what-publishers-need-to-know-about-digitization-presentation?type=powerpoint" title="View What publishers need to know about digitization on SlideShare">presentation</a> or <a style="text-decoration:underline;" href="http://www.slideshare.net/upload?type=powerpoint">Upload</a> your own. (tags: <a style="text-decoration:underline;" href="http://slideshare.net/tag/schema">schema</a> <a style="text-decoration:underline;" href="http://slideshare.net/tag/epub">epub</a>)</div>
</div>
<p><img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.0NXC/bT*xJmx*PTEyMjY1OTYzNTY1NjAmcHQ9MTIyNjU5NjM3MzE2NSZwPTEwMTkxJmQ9Jmc9MiZ*PSZvPWRmOGM3MDgzOWYyYjQzOTliMmZlYWZkZDc1YWFkZDk3.gif" /></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Python and XML (and Google!) in publishing applications</title>
		<link>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/</link>
		<comments>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/#comments</comments>
		<pubDate>Tue, 28 Oct 2008 21:38:24 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[tools]]></category>
		<category><![CDATA[article]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=156</guid>
		<description><![CDATA[IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.



Massive XML files are uncommon in the general web development world, where the primary roles of [...]]]></description>
			<content:encoded><![CDATA[<p>IBM DeveloperWorks has just released an article of mine on <a href="http://www.ibm.com/developerworks/library/x-hiperfparse/">High-Performance XML Parsing in Python</a>.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.</p>
<div align="center">
<a href="http://www.ibm.com/developerworks/library/x-hiperfparse/"><img class="alignnone size-medium wp-image-157" title="lxml article screenshot" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-12-300x226.png" alt="" width="300" height="226" style="float:none"/></a>
</div>
<p>Massive XML files are uncommon in the general web development world, where the primary roles of XML are either as configuration files, read only infrequently, or for interchange across the web, in which case the files are necessarily small.  It&#8217;s rare to encounter XML measured in gigabytes or more; data at that level is usually stored in a relational database.</p>
<p>For that reason I find myself frustrated with many XML tools, even those ostensibly designed to handle large amounts of data.  Too often they don&#8217;t scale well or at least easily.  I don&#8217;t believe that scaling should be a black art that each individual developer needs to solve independently.  Unfortunately, in commercial products ease-of-use is a key bullet point and computationally-difficult problems are hard to summarize in a user&#8217;s guide.</p>
<p>I tend to recommend open-source software most strongly in two scenarios: for <a href="http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/">small projects with limited budgets</a> and for large projects with unique challenges.  There simply isn&#8217;t going to be a one-size-fits-all application for most interesting publishing work.</p>
<p>This is one of many reasons I&#8217;m excited by Google&#8217;s willingness to <a href="http://tinyurl.com/6kc6hx">open its Google Books archive to researchers</a>:  Python is a first-class programming language in the Google ecosystem, and Google has a good track record of open-sourcing those internal tools with limited commercial value.  I expect a lot of interesting work to come out of that archive once it&#8217;s available.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The analog hole, and a seminar on digitization</title>
		<link>http://blog.threepress.org/2008/10/23/the-analog-hole-and-a-seminar-on-digitization/</link>
		<comments>http://blog.threepress.org/2008/10/23/the-analog-hole-and-a-seminar-on-digitization/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 14:01:01 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[toc]]></category>
		<category><![CDATA[drm]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[webinar]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=154</guid>
		<description><![CDATA[Over on Tools of Change there&#8217;s a post of mine discussing the so-called &#8220;analog hole&#8221; as it applies to digital  books.  It was a fun article to write, especially the hands-on part.  I used Google&#8217;s OCRopus open-source OCR software, which was a little impenetrable to someone outside of the machine-learning community but did a good [...]]]></description>
			<content:encoded><![CDATA[<p>Over on Tools of Change there&#8217;s a post of mine discussing the so-called<a href="http://toc.oreilly.com/2008/10/the-analog-hole-in-digital-boo.html"> &#8220;analog hole&#8221; as it applies to digital  books</a>.  It was a fun article to write, especially the hands-on part.  I used Google&#8217;s <a href="http://google-code-updates.blogspot.com/2007/04/announcing-ocropus-open-source-ocr.html">OCRopus open-source OCR</a> software, which was a little impenetrable to someone outside of the machine-learning community but did a good job once I fumbled around with it for awhile.</p>
<p>Also on that page at the moment is a giant photo of my head advertising <a href="https://oreilly.webex.com/mw0305l/mywebex/default.do?nomenu=true&amp;siteurl=oreilly&amp;service=6&amp;main_url=https%3A%2F%2Foreilly.webex.com%2Fec0600l%2Feventcenter%2Fevent%2FeventAction.do%3FtheAction%3Ddetail%26confViewID%3D278119650%26siteurl%3Doreilly%26%26%26">What Publishers Need to Know About Digitization</a>, a web seminar I&#8217;ll be hosting with O&#8217;Reilly Media on November 12. It will be a very high-level, introductory overview aimed at non-technical staff in publishing who are considering a digitization project.</p>
<p>Going full-circle, I wonder if there would be interest in a simple web-based OCR service where publishers could upload a scanned document to see how well bare-bones OCR performed on an image-only PDF or JPEG scan. I imagine it might help predict the complexity of a digitization project, and understand some of the challenges inherent in the process.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/23/the-analog-hole-and-a-seminar-on-digitization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum</title>
		<link>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/</link>
		<comments>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 02:42:29 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[clowns]]></category>
		<category><![CDATA[dutch]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xpath]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=103</guid>
		<description><![CDATA[I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of Hamlet mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the [...]]]></description>
			<content:encoded><![CDATA[<p>I was pleased to be able to assist with the <a href="http://webh01.ua.ac.be/theso/cti/index.html">Corpus Toneelkritiek Interbellum</a> project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of <a href="http://webh01.ua.ac.be/theso/cti/1926-05-30_putman70.html">Hamlet</a> mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the actual reading to someone else.
</p>
<div style="text-align:center;margin:auto;float:none">
<a href="http://webh01.ua.ac.be/theso/cti/index.html"><img style="float:none" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-6-300x253.png" alt="" title="picture-6" width="300" height="253"  align="right" /></a>
</div>
<p style="clear:both">
The source documents are encoded in <a href="http://www.tei-c.org/index.xml">TEI</a> XML and rendered to the browser using Python and <a href="http://codespeak.net/lxml/">lxml</a>, three of my favorite technologies.</p>
<p>
There are a few take-aways from this project that might benefit anyone working in a similar area and scale: </p>
<ul>
<li> Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)</li>
<li> Use a modern programming language, even in a humanities context (e.g. Python)</li>
<li> Use modern XML parsing tools (e.g. lxml + XPath + XSLT)</li>
</ul>
<p>
The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python.  I&#8217;m still amazed by how many people are &#8220;parsing&#8221; XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!</p>
<p> &#8220;Free&#8221; doesn&#8217;t have to mean primitive. In fact I would argue that projects like <a href="http://pinaxproject.com/">Pinax</a> can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.
</p>
<p> Congratulations to Thomas Crombez and his team!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Where in India are the digitization vendors?</title>
		<link>http://blog.threepress.org/2008/10/10/where-in-india-are-the-digitization-vendors/</link>
		<comments>http://blog.threepress.org/2008/10/10/where-in-india-are-the-digitization-vendors/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 18:21:33 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[epub]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[ebook]]></category>
		<category><![CDATA[india]]></category>
		<category><![CDATA[threepress]]></category>
		<category><![CDATA[vendor]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=95</guid>
		<description><![CDATA[Here&#8217;s a good guess:


This is the output from my Google Analytics web traffic report on the country which sends the most visits to threepress.org.  44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the ePub validation service, a wrapper around Adobe&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a good guess:</p>
<p><a href="http://blog.threepress.org/wp-content/uploads/2008/10/picture-2.png"><img src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-2-300x288.png" alt="" title="picture-2" width="200" class="aligncenter size-medium wp-image-96" /></a></p>
<p>
This is the output from my <a href="http://analytics.google.com/">Google Analytics</a> web traffic report on the country which sends the most visits to <a href="http://www.threepress.org/">threepress.org</a>.  44% of the traffic to the entire site, which includes this blog, some public domain ebooks and my contact information, is to the <a href="http://www.threepress.org/document/epub-validate/">ePub validation service</a>, a wrapper around Adobe&#8217;s <a href="http://code.google.com/p/epubcheck/">epubcheck</a>.  (<a href="http://bookworm.threepress.org">Bookworm</a> statistics are not included in this report.)
</p>
<p>India sends three times as much the traffic to the validation page compared to second-place United States, but only <em>one-third</em> as much the traffic to the home page.
</p>
<p>  It&#8217;s even more interesting to look at the &#8220;bounce rate&#8221; for the home page by country. The &#8220;bounce rate&#8221; is the percentage of times that a given page is the last one that a user looks at before they leave the site, and it&#8217;s one of the most useful metrics in web analysis. The overall bounce rate for the threepress.org home page is 37%, meaning 37% of the people who visited that page didn&#8217;t have a reason to click on another link.  For India, that figure is 5% &#8212; presumably because they are all clicking through to the validation service. (By contrast, 80% of South African visitors leave immediately, suggesting that some unrelated keyword searches or links are driving them there.)
</p>
<p>
So if you&#8217;re looking for vendors who can provide high-quality, valid ePubs, I&#8217;d suggest, in descending order of frequency, suppliers in these cities:
</p>
<ol>
<li>Pune</li>
<li>Delhi</li>
<li>New Delhi</li>
<li>Chennai</li>
<li>Mahape </li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/10/where-in-india-are-the-digitization-vendors/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

