<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Threepress Consulting blog &#187; lxml</title>
	<atom:link href="http://blog.threepress.org/tag/lxml/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.threepress.org</link>
	<description>Threepress creates software for publishers, educators and authors.</description>
	<lastBuildDate>Mon, 09 Jan 2012 13:02:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Python and XML (and Google!) in publishing applications</title>
		<link>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/</link>
		<comments>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/#comments</comments>
		<pubDate>Tue, 28 Oct 2008 21:38:24 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[tools]]></category>
		<category><![CDATA[article]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=156</guid>
		<description><![CDATA[IBM DeveloperWorks has just released an article of mine on High-Performance XML Parsing in Python.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.



Massive XML files are uncommon in the general web development world, where the primary roles of [...]]]></description>
			<content:encoded><![CDATA[<p>IBM DeveloperWorks has just released an article of mine on <a href="http://www.ibm.com/developerworks/library/x-hiperfparse/">High-Performance XML Parsing in Python</a>.  Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.</p>
<div align="center">
<a href="http://www.ibm.com/developerworks/library/x-hiperfparse/"><img class="alignnone size-medium wp-image-157" title="lxml article screenshot" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-12-300x226.png" alt="" width="300" height="226" style="float:none"/></a>
</div>
<p>Massive XML files are uncommon in the general web development world, where the primary roles of XML are either as configuration files, read only infrequently, or for interchange across the web, in which case the files are necessarily small.  It&#8217;s rare to encounter XML measured in gigabytes or more; data at that level is usually stored in a relational database.</p>
<p>For that reason I find myself frustrated with many XML tools, even those ostensibly designed to handle large amounts of data.  Too often they don&#8217;t scale well or at least easily.  I don&#8217;t believe that scaling should be a black art that each individual developer needs to solve independently.  Unfortunately, in commercial products ease-of-use is a key bullet point and computationally-difficult problems are hard to summarize in a user&#8217;s guide.</p>
<p>I tend to recommend open-source software most strongly in two scenarios: for <a href="http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/">small projects with limited budgets</a> and for large projects with unique challenges.  There simply isn&#8217;t going to be a one-size-fits-all application for most interesting publishing work.</p>
<p>This is one of many reasons I&#8217;m excited by Google&#8217;s willingness to <a href="http://tinyurl.com/6kc6hx">open its Google Books archive to researchers</a>:  Python is a first-class programming language in the Google ecosystem, and Google has a good track record of open-sourcing those internal tools with limited commercial value.  I expect a lot of interesting work to come out of that archive once it&#8217;s available.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/28/python-and-xml-and-google-in-publishing-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum</title>
		<link>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/</link>
		<comments>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 02:42:29 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[clowns]]></category>
		<category><![CDATA[dutch]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xpath]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=103</guid>
		<description><![CDATA[I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of Hamlet mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the [...]]]></description>
			<content:encoded><![CDATA[<p>I was pleased to be able to assist with the <a href="http://webh01.ua.ac.be/theso/cti/index.html">Corpus Toneelkritiek Interbellum</a> project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of <a href="http://webh01.ua.ac.be/theso/cti/1926-05-30_putman70.html">Hamlet</a> mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the actual reading to someone else.
</p>
<div style="text-align:center;margin:auto;float:none">
<a href="http://webh01.ua.ac.be/theso/cti/index.html"><img style="float:none" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-6-300x253.png" alt="" title="picture-6" width="300" height="253"  align="right" /></a>
</div>
<p style="clear:both">
The source documents are encoded in <a href="http://www.tei-c.org/index.xml">TEI</a> XML and rendered to the browser using Python and <a href="http://codespeak.net/lxml/">lxml</a>, three of my favorite technologies.</p>
<p>
There are a few take-aways from this project that might benefit anyone working in a similar area and scale: </p>
<ul>
<li> Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)</li>
<li> Use a modern programming language, even in a humanities context (e.g. Python)</li>
<li> Use modern XML parsing tools (e.g. lxml + XPath + XSLT)</li>
</ul>
<p>
The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python.  I&#8217;m still amazed by how many people are &#8220;parsing&#8221; XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!</p>
<p> &#8220;Free&#8221; doesn&#8217;t have to mean primitive. In fact I would argue that projects like <a href="http://pinaxproject.com/">Pinax</a> can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.
</p>
<p> Congratulations to Thomas Crombez and his team!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

