<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Threepress Consulting blog &#187; tei</title>
	<atom:link href="http://blog.threepress.org/tag/tei/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.threepress.org</link>
	<description>Threepress creates software for publishers, educators and authors.</description>
	<lastBuildDate>Tue, 27 Jul 2010 16:34:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The real Internet Archive</title>
		<link>http://blog.threepress.org/2008/10/22/the-real-internet-archive/</link>
		<comments>http://blog.threepress.org/2008/10/22/the-real-internet-archive/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 01:44:21 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[content]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[archiving]]></category>
		<category><![CDATA[ftp]]></category>
		<category><![CDATA[rfc]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[telnet]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=135</guid>
		<description><![CDATA[My attention was caught by this quote from Clay Shirky on the excellent ReadWriteWeb blog:
Back in 1974, when the Internet was a fraction of what it is now, the acorn to an oak, there were really only two applications,&#8221; said Shirky, &#8220;Telnet, and FTP.&#8221;
Surely he&#8217;s wrong, I thought.  Those protocols aren&#8217;t that old.
But I was [...]]]></description>
			<content:encoded><![CDATA[<p>My attention was caught by<a href="http://www.readwriteweb.com/archives/health_20_economics_of_aggregation.php"> this quote from Clay Shirky</a> on the excellent <a href="http://www.readwriteweb.com/">ReadWriteWeb blog:</a></p>
<blockquote><p>Back in 1974, when the Internet was a fraction of what it is now, the acorn to an oak, there were really only two applications,&#8221; said Shirky, &#8220;Telnet, and FTP.&#8221;</p></blockquote>
<p><em>Surely he&#8217;s wrong</em>, I thought.  <em>Those protocols aren&#8217;t that old.</em></p>
<p>But I was wrong. FTP was invented in 1971, and telnet was developed in 1969.</p>
<p>(Telnet is a way to connect interactively with another computer. In practice it&#8217;s been replaced by the more secure <a href="http://en.wikipedia.org/wiki/Ssh">ssh</a>, but vestigial copies remain on all modern computers.)</p>
<p>What really threw me wasn&#8217;t that telnet was from 1969 as much as that it was <a href="http://tools.ietf.org/html/rfc15">RFC 15</a>.  In the networked world, Requests for Comments are documents which define the standards that computers use when communicating with each other.  To understand how old RFC 15 is, consider that the venerable FTP is <a href="http://www.rfc-editor.org/rfc/rfc114.txt">RFC 114</a>,  while email as we know it is <a href="http://www.faqs.org/rfcs/rfc821.html">RFC 821</a> (1982), and HTTP is <a href="http://www.faqs.org/rfcs/rfc1945.html">RFC 1945</a> (1996, although obviously it had been in use for years). The most recent RFC is <a href="http://tools.ietf.org/html/rfc5382">5382</a>. RFC 15 is <em><span style="text-decoration: underline;">ancient history</span></em>.</p>
<p>Because I am a nerd I spent some time browsing the early RFCs, and I was struck by how charmingly antique they are. <a href="http://tools.ietf.org/html/rfc16">RFC 16</a> says that M.I.T. should receive copies of RFCs. <a href="http://tools.ietf.org/html/rfc6">RFC 6</a> begins, &#8220;I talked with Bob Kahn at BB&amp;N yesterday.&#8221; RFC 14 <a href="http://tools.ietf.org/html/rfc14">never existed</a>.</p>
<p>RFC 7 (&#8220;Host-IMP Interface&#8221;) includes a prefatory note:</p>
<blockquote><p>The original of <a href="http://tools.ietf.org/html/rfc7">RFC 7</a> was hand-written, and only partially illegible [sic]<br />
copies exist.</p></blockquote>
<p>Indeed, the actual RFC begins:</p>
<blockquote><p>This paper is concerned with the preliminary software design of the<br />
Host IMP interface.  Its main purpose is on the one hand to define<br />
functions that will be implemented, and on the other hand to provide<br />
a base for discussions and &#8230;(unreadable).</p></blockquote>
<p>I&#8217;m on the mailing list for users of the <a href="http://www.tei-c.org/index.xml">Text Encoding Initiative</a> (TEI), an XML schema used primarily for encoding historical texts. The schema is equipped with tags for tracking everything about a document, including changes that occur over centuries of time. On the TEI list, people ask questions like, &#8220;How do I represent a medieval manuscript and also indicate which passages were underlined by an 18th century owner?&#8221; or &#8220;What tag should I use for a poem title that was handwritten vertically in the left margin?&#8221; (Promptly followed by vigorous scholarly debates over the &#8220;correct&#8221; answers.)</p>
<p>There&#8217;s something charming about how early internet history, just 40 years old, is almost as poorly documented and in need of careful archivists.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/22/the-real-internet-archive/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum</title>
		<link>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/</link>
		<comments>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 02:42:29 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[libraries]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[clowns]]></category>
		<category><![CDATA[dutch]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[xpath]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=103</guid>
		<description><![CDATA[I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of Hamlet mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the [...]]]></description>
			<content:encoded><![CDATA[<p>I was pleased to be able to assist with the <a href="http://webh01.ua.ac.be/theso/cti/index.html">Corpus Toneelkritiek Interbellum</a> project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews.  I can&#8217;t read Dutch, but Google&#8217;s automated translation tells me that the review of <a href="http://webh01.ua.ac.be/theso/cti/1926-05-30_putman70.html">Hamlet</a> mentions a &#8220;long modern clown,&#8221; which sounds disturbing enough that I&#8217;ll leave the actual reading to someone else.
</p>
<div style="text-align:center;margin:auto;float:none">
<a href="http://webh01.ua.ac.be/theso/cti/index.html"><img style="float:none" src="http://blog.threepress.org/wp-content/uploads/2008/10/picture-6-300x253.png" alt="" title="picture-6" width="300" height="253"  align="right" /></a>
</div>
<p style="clear:both">
The source documents are encoded in <a href="http://www.tei-c.org/index.xml">TEI</a> XML and rendered to the browser using Python and <a href="http://codespeak.net/lxml/">lxml</a>, three of my favorite technologies.</p>
<p>
There are a few take-aways from this project that might benefit anyone working in a similar area and scale: </p>
<ul>
<li> Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)</li>
<li> Use a modern programming language, even in a humanities context (e.g. Python)</li>
<li> Use modern XML parsing tools (e.g. lxml + XPath + XSLT)</li>
</ul>
<p>
The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python.  I&#8217;m still amazed by how many people are &#8220;parsing&#8221; XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!</p>
<p> &#8220;Free&#8221; doesn&#8217;t have to mean primitive. In fact I would argue that projects like <a href="http://pinaxproject.com/">Pinax</a> can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.
</p>
<p> Congratulations to Thomas Crombez and his team!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/10/14/corpus-toneelkritiek-interbellum/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Seven new books added</title>
		<link>http://blog.threepress.org/2008/05/12/seven-new-books-added/</link>
		<comments>http://blog.threepress.org/2008/05/12/seven-new-books-added/#comments</comments>
		<pubDate>Tue, 13 May 2008 01:06:11 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[content]]></category>
		<category><![CDATA[project gutenberg]]></category>
		<category><![CDATA[tei]]></category>
		<category><![CDATA[wiki]]></category>
		<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=8</guid>
		<description><![CDATA[The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">The last set of <a href="http://gutenberg.hwg.org/checkdoc1.html">Gutenberg HTML</a> books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (<a href="http://www.threepress.org/document/On-the-Origin-of-Species-by-Means-of-Natural-Selection_Charles-Darwin/">On the Origin of Species</a>) and one with verse components (<a href="http://www.threepress.org/document/The-Jungle-Book_Rudyard-Kipling/">The Jungle Book</a>); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.</p>
<p style="text-align: left;">To expand the project in useful ways I&#8217;d like to be able to add:</p>
<ol>
<li>Other content types besides novels, especially reference</li>
<li>Content from other document formats, such as DocBook</li>
<li>Native, highly-tagged TEI documents</li>
</ol>
<p>Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren&#8217;t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn&#8217;t seem productive.</p>
<p>XML data welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/05/12/seven-new-books-added/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Convert TEI to epub</title>
		<link>http://blog.threepress.org/2008/05/12/convert-tei-to-epub/</link>
		<comments>http://blog.threepress.org/2008/05/12/convert-tei-to-epub/#comments</comments>
		<pubDate>Mon, 12 May 2008 14:39:55 +0000</pubDate>
		<dc:creator>Liza Daly</dc:creator>
				<category><![CDATA[tools]]></category>
		<category><![CDATA[e-books]]></category>
		<category><![CDATA[epub]]></category>
		<category><![CDATA[idpf]]></category>
		<category><![CDATA[tei]]></category>

		<guid isPermaLink="false">http://blog.threepress.org/?p=7</guid>
		<description><![CDATA[The most useful standalone tool in threepress right now is tei2epub, which the system uses to convert its internal source XML to the emerging e-book standard format epub.
TEI is the Text Encoding Initiative, and is one of the most popular markup formats for printed works (especially in academics).  All of the content on threepress [...]]]></description>
			<content:encoded><![CDATA[<p>The most useful standalone tool in threepress right now is <a href="http://code.google.com/p/epub-tools/">tei2epub</a>, which the system uses to convert its internal source XML to the emerging e-book standard format epub.</p>
<p>TEI is the <a href="http://www.tei-c.org/index.xml">Text Encoding Initiative</a>, and is one of the most popular markup formats for printed works (especially in academics).  All of the content on threepress has been converted from the Gutenberg format to TEI upon ingestion into the site.</p>
<p>epub is the shorthand for the e-book format proposed by the <a title="IDPF consortium" href="http://www.idpf.org/">International Digital Publishing Forum</a> (IDPF), which uses XHTML and custom metadata formats.  An e-book bundle is distributed in ZIP file format with its text and supplementary media &#8220;bound&#8221; together.</p>
<p><a href="http://code.google.com/p/epub-tools/">tei2epub</a> is written in Python with XSLT.  It also comes bundled with the latest version of <a href="http://code.google.com/p/epubcheck/">epubcheck</a>, for validating the output.  It is meant to be used by developers rather than end-users (unlike the recent <a href="http://blog.bookglutton.com/?p=71">BookGlutton epub converter</a>) and as most of the functionality is in the XSLT, should be easy to port to other languages.  Like all threepress tools it is released under the <a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a> which means it is free for all commercial and non-commercial use.  You may <a href="http://code.google.com/p/epub-tools/downloads/list">download the ZIP</a> version of the current release or get the latest version from svn at <code><tt>http://epub-tools.googlecode.com/svn/trunk/</tt></code></p>
<p>Current limitations:</p>
<ol>
<li>tei2epub has not been tested on extensively marked-up TEI.  It leverages the standard TEI to XHTML stylesheets distributed by TEI, but it is unknown whether epub readers will support all of the resulting markup</li>
<li>It accepts only a single source document (i.e. an entire TEI book)</li>
<li>It does not handle images or other kinds of media</li>
</ol>
<p>Any of the above can be addressed with the addition of more complex TEI source books.</p>
<p><em>Edited May 22, 2008 to point resources at a new standalone repository.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.threepress.org/2008/05/12/convert-tei-to-epub/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
