<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Threepress Consulting blog &#187; xml</title>
	<atom:link href="http://blog.threepress.org/tag/xml/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.threepress.org</link>
	<description>Threepress creates software for publishers, educators and authors.</description>
	<lastBuildDate>Tue, 27 Jul 2010 16:34:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>A case study in converting image-based ebooks into XML</title>
		<link>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/</link>
		<comments>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/#comments</comments>
		<pubDate>Sat, 21 Feb 2009 22:01:18 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[book design]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[ebooks]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[images]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=369</guid>
		<description><![CDATA[There&#8217;s a great deal of valuable information in this recently-released white paper by The American Council of Learned Societies: ACLS Humanities E-Book XML Conversion Experiment: Report on Workflow, Costs, and User Preferences.  Although the study was based on scholarly books, their findings would apply to many other digitization projects.
The Humanities E-Book (HEB) project took [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s a great deal of valuable information in this recently-released white paper by The American Council of Learned Societies: <a href="http://www.humanitiesebook.org/HEBWhitePaper2.pdf">ACLS Humanities E-Book XML Conversion Experiment: Report on Workflow, Costs, and User Preferences</a>.  Although the study was based on scholarly books, their findings would apply to many other digitization projects.</p>
<p><a href="http://www.humanitiesebook.org/">The Humanities E-Book</a> (HEB) project took 20 books (as scanned page images + uncorrected OCR) and converted them to an in-house XML format.  They compared the workflow impact, costs and user experience of the final XML product versus that of the page-image ebooks.</p>
<p>Many of their findings consorted with my own experience in this area:</p>
<ol>
<li>The quality of the OCRed text was worse than expected: good enough for search, but not always suitable for reading.  However, the cost of double-keying the text from scratch was prohibitive.</li>
<li>The encoding vendor, while skilled and diligent, nevertheless produced output that would require a trained editor to correct properly.  HEB spent 4-8 hours hand-correcting each book in the sample set.</li>
<li>The average cost for conversion to XML was approximately 3X greater than for scans + OCR only.  This did not include in-house correction and review.</li>
</ol>
<h3>User survey results</h3>
<p>After the 20 sample books were made available to their community, users were polled for their reactions.  I feel these are worth mentioning at length.</p>
<p>69% of readers preferred the XML-encoded books (presented as HTML in a browser).</p>
<p>Reasons for preferring the XML scans included:</p>
<ol>
<li>Readability (despite the fact that not all books were completely proofed)</li>
<li>Usability (e.g. cut and paste, ability to use screen readers)</li>
<li>Layout (the HTML presentation had few distracting elements on the pages, and more content was available per web page than in the page-based scans)</li>
</ol>
<p>Interestingly, of those readers who preferred the image scans, one of the primary reasons cited was the more book-like paginated layout. I&#8217;m very conscious of this tension: many Bookworm users complain about the chapter-at-a-time, scrolling layout of the pages, while others absolutely hate arbitrary emulation of the printed work. It seems to be a strong personal preference that runs in one direction or the other.</p>
<h3>Ebook interface considerations</h3>
<p>Although not directly related to the study at hand, I found some of the publisher-imposed constraints on their user interface illustrative.  I feel these would be best be avoided when designing an ebook reading site:</p>
<blockquote><p>Foremost among user requests was a desire for better printing options. Printing of HEB titles has always been restricted to fair-use provisions, and for this reason there had neverbeen any immediate way of printing out pages without prior browser adjustment to<br />
accommodate frames—the intention being to discourage printing out long sections of copyrighted text at once.</p></blockquote>
<p>The ability to print text at length is critical for any serious work.  I&#8217;m always unhappy when a site prevents me from doing an ordinary task like printing or downloading.  I hope that publishers reconsider these types of restrictions.</p>
<p>Similarly, revenue models should not constrain the ways in which licensed users can access content:</p>
<blockquote><p>XML titles normally suppress all higher-level “container” sections, so that users always access only the smallest available text chunk in each overarching section. [...]</p>
<p>&#8230;for this set of titles, we would simply make all section levels accessible.  This would affect the process of tallying hits for these titles—something needed in order to calculate royalties  for publishers and usage statistics for libraries—as users could now potentially read an entire book by accessing only a small number of chapter-level sections (which in turn would generate fewer hits than reading the page-image version).</p></blockquote>
<p>As a reader, I should absolutely be able to read content &#8212; especially XML-based content &#8212; in as fluid a manner as possible.  Generating accurate accounting is a programming problem, and not one that should drive decisions about the reading interface.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2009/02/21/a-case-study-in-converting-image-based-ebooks-into-xml/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Slides from &#8220;What publishers need to know about digitization&#8221;</title>
		<link>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/</link>
		<comments>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 17:17:32 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[toc]]></category>
		<category><![CDATA[ebooks]]></category>
		<category><![CDATA[epub]]></category>
		<category><![CDATA[publishing]]></category>
		<category><![CDATA[schemas]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=193</guid>
		<description><![CDATA[O&#8217;Reilly Media will be posting a complete recording of the presentation, but in the meantime I&#8217;ve posted the slides from the webcast, &#8220;What publishers need to know about digitization&#8221; on Slideshare.
Thanks to everyone who attended and especially to those who asked so many excellent questions.
What publishers need to know about digitization
View SlideShare presentation or Upload [...]]]></description>
			<content:encoded><![CDATA[<p>O&#8217;Reilly Media will be posting a complete recording of the presentation, but in the meantime I&#8217;ve posted the slides from the webcast, &#8220;<a href="http://toc.oreilly.com/2008/11/toc-webcast-tomorrow-what-publ.html">What publishers need to know about digitization</a>&#8221; on Slideshare.</p>
<p>Thanks to everyone who attended and especially to those who asked so many excellent questions.</p>
<div style="width:425px;text-align:left" id="__ss_749916"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/lizadaly/what-publishers-need-to-know-about-digitization-presentation?type=powerpoint" title="What publishers need to know about digitization">What publishers need to know about digitization</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=digitizationwebinar-1226595850439471-9&#038;stripped_title=what-publishers-need-to-know-about-digitization-presentation" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=digitizationwebinar-1226595850439471-9&#038;stripped_title=what-publishers-need-to-know-about-digitization-presentation" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View SlideShare <a style="text-decoration:underline;" href="http://www.slideshare.net/lizadaly/what-publishers-need-to-know-about-digitization-presentation?type=powerpoint" title="View What publishers need to know about digitization on SlideShare">presentation</a> or <a style="text-decoration:underline;" href="http://www.slideshare.net/upload?type=powerpoint">Upload</a> your own. (tags: <a style="text-decoration:underline;" href="http://slideshare.net/tag/schema">schema</a> <a style="text-decoration:underline;" href="http://slideshare.net/tag/epub">epub</a>)</div>
</div>
<p><img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.0NXC/bT*xJmx*PTEyMjY1OTYzNTY1NjAmcHQ9MTIyNjU5NjM3MzE2NSZwPTEwMTkxJmQ9Jmc9MiZ*PSZvPWRmOGM3MDgzOWYyYjQzOTliMmZlYWZkZDc1YWFkZDk3.gif" /></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/11/13/slides-from-what-publishers-need-to-know-about-digitization/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Seven new books added</title>
		<link>http://blog.threepress.org/2008/05/12/seven-new-books-added/</link>
		<comments>http://blog.threepress.org/2008/05/12/seven-new-books-added/#comments</comments>
		<pubDate>Tue, 13 May 2008 01:06:11 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[content]]></category>
		<category><![CDATA[project gutenberg]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[wiki]]></category>
		<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=8</guid>
		<description><![CDATA[The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">The last set of <a href="http://gutenberg.hwg.org/checkdoc1.html">Gutenberg HTML</a> books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (<a href="http://www.threepress.org/document/On-the-Origin-of-Species-by-Means-of-Natural-Selection_Charles-Darwin/">On the Origin of Species</a>) and one with verse components (<a href="http://www.threepress.org/document/The-Jungle-Book_Rudyard-Kipling/">The Jungle Book</a>); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.</p>
<p style="text-align: left;">To expand the project in useful ways I&#8217;d like to be able to add:</p>
<ol>
<li>Other content types besides novels, especially reference</li>
<li>Content from other document formats, such as DocBook</li>
<li>Native, highly-tagged TEI documents</li>
</ol>
<p>Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren&#8217;t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn&#8217;t seem productive.</p>
<p>XML data welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/05/12/seven-new-books-added/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
