<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Threepress Consulting blog &#187; xslt</title>
	<atom:link href="http://blog.threepress.org/tag/xslt/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.threepress.org</link>
	<description>Threepress creates software for publishers, educators and authors.</description>
	<lastBuildDate>Tue, 27 Jul 2010 16:34:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum</title>
		<link>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/</link>
		<comments>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 02:42:29 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[clowns]]></category>
		<category><![CDATA[dutch]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xpath]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=103</guid>
		<description><![CDATA[I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of Hamlet mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the [...]]]></description>
			<content:encoded><![CDATA[<p>I was pleased to be able to assist with the <a href="http://webh01.ua.ac.be/theso/cti/index.html">Corpus Toneelkritiek Interbellum</a> project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of <a href="http://webh01.ua.ac.be/theso/cti/1926-05-30_putman70.html">Hamlet</a> mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the actual reading to someone else.
</p>
<div style="text-align:center;margin:auto;float:none">
<a href="http://webh01.ua.ac.be/theso/cti/index.html"><img style="float:none" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-6-300x253.png" alt="" title="picture-6" width="300" height="253"  align="right" /></a>
</div>
<p style="clear:both">
The source documents are encoded in <a href="http://www.tei-c.org/index.xml">TEI</a> XML and rendered to the browser using Python and <a href="http://codespeak.net/lxml/">lxml</a>, three of my favorite technologies.</p>
<p>
There are a few take-aways from this project that might benefit anyone working in a similar area and scale: </p>
<ul>
<li> Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)</li>
<li> Use a modern programming language, even in a humanities context (e.g. Python)</li>
<li> Use modern XML parsing tools (e.g. lxml + XPath + XSLT)</li>
</ul>
<p>
The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python.  I&#8217;m still amazed by how many people are &#8220;parsing&#8221; XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!</p>
<p> &#8220;Free&#8221; doesn&#8217;t have to mean primitive. In fact I would argue that projects like <a href="http://pinaxproject.com/">Pinax</a> can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.
</p>
<p> Congratulations to Thomas Crombez and his team!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Seven new books added</title>
		<link>http://blog.threepress.org/2008/05/12/seven-new-books-added/</link>
		<comments>http://blog.threepress.org/2008/05/12/seven-new-books-added/#comments</comments>
		<pubDate>Tue, 13 May 2008 01:06:11 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[content]]></category>
		<category><![CDATA[project gutenberg]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[wiki]]></category>
		<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=8</guid>
		<description><![CDATA[The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">The last set of <a href="http://gutenberg.hwg.org/checkdoc1.html">Gutenberg HTML</a> books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (<a href="http://www.threepress.org/document/On-the-Origin-of-Species-by-Means-of-Natural-Selection_Charles-Darwin/">On the Origin of Species</a>) and one with verse components (<a href="http://www.threepress.org/document/The-Jungle-Book_Rudyard-Kipling/">The Jungle Book</a>); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.</p>
<p style="text-align: left;">To expand the project in useful ways I&#8217;d like to be able to add:</p>
<ol>
<li>Other content types besides novels, especially reference</li>
<li>Content from other document formats, such as DocBook</li>
<li>Native, highly-tagged TEI documents</li>
</ol>
<p>Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren&#8217;t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn&#8217;t seem productive.</p>
<p>XML data welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/05/12/seven-new-books-added/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
