The real Internet Archive

Wednesday, October 22nd, 2008

My attention was caught by this quote from Clay Shirky on the excellent ReadWriteWeb blog:

Back in 1974, when the Internet was a fraction of what it is now, the acorn to an oak, there were really only two applications,” said Shirky, “Telnet, and FTP.”

Surely he’s wrong, I thought.  Those protocols aren’t that old.

But I was wrong. FTP was invented in 1971, and telnet was developed in 1969.

(Telnet is a way to connect interactively with another computer. In practice it’s been replaced by the more secure ssh, but vestigial copies remain on all modern computers.)

What really threw me wasn’t that telnet was from 1969 as much as that it was RFC 15.  In the networked world, Requests for Comments are documents which define the standards that computers use when communicating with each other.  To understand how old RFC 15 is, consider that the venerable FTP is RFC 114,  while email as we know it is RFC 821 (1982), and HTTP is RFC 1945 (1996, although obviously it had been in use for years). The most recent RFC is 5382. RFC 15 is ancient history.

Because I am a nerd I spent some time browsing the early RFCs, and I was struck by how charmingly antique they are. RFC 16 says that M.I.T. should receive copies of RFCs. RFC 6 begins, “I talked with Bob Kahn at BB&N yesterday.” RFC 14 never existed.

RFC 7 (”Host-IMP Interface”) includes a prefatory note:

The original of RFC 7 was hand-written, and only partially illegible [sic]
copies exist.

Indeed, the actual RFC begins:

This paper is concerned with the preliminary software design of the
Host IMP interface. Its main purpose is on the one hand to define
functions that will be implemented, and on the other hand to provide
a base for discussions and …(unreadable).

I’m on the mailing list for users of the Text Encoding Initiative (TEI), an XML schema used primarily for encoding historical texts. The schema is equipped with tags for tracking everything about a document, including changes that occur over centuries of time. On the TEI list, people ask questions like, “How do I represent a medieval manuscript and also indicate which passages were underlined by an 18th century owner?” or “What tag should I use for a poem title that was handwritten vertically in the left margin?” (Promptly followed by vigorous scholarly debates over the “correct” answers.)

There’s something charming about how early internet history, just 40 years old, is almost as poorly documented and in need of careful archivists.

TEI + Python + lxml + Dutch = Corpus Toneelkritiek Interbellum

Tuesday, October 14th, 2008

I was pleased to be able to assist with the Corpus Toneelkritiek Interbellum project, which allows reading, browsing and searching of early 20th-century Dutch theater reviews. I can’t read Dutch, but Google’s automated translation tells me that the review of Hamlet mentions a “long modern clown,” which sounds disturbing enough that I’ll leave the actual reading to someone else.

The source documents are encoded in TEI XML and rendered to the browser using Python and lxml, three of my favorite technologies.

There are a few take-aways from this project that might benefit anyone working in a similar area and scale:

  • Use a standard encoding format (in this case TEI, but choose an appropriate one based on the source content)
  • Use a modern programming language, even in a humanities context (e.g. Python)
  • Use modern XML parsing tools (e.g. lxml + XPath + XSLT)

The key advantage of libraries such as lxml in publishing and digitization projects is that it allows the developer to freely mix XML-native languages like XPath and XSLT with the expressive, procedural programming style of Python. I’m still amazed by how many people are “parsing” XML using regular expressions (or worse), or using plain CGI/Perl scripts to serve up content. There are easier ways!

“Free” doesn’t have to mean primitive. In fact I would argue that projects like Pinax can jump-start library or digital archive sites into the 21st century with less work than a grad student will spend crafting a bespoke Perl script.

Congratulations to Thomas Crombez and his team!

ALA 2008: Technical solutions to increasing the visibility of libraries

Tuesday, July 8th, 2008

I had a great time meeting people and attending talks at this year’s ALA conference in Anaheim.  Although I’ve so far focused on software development for publishers, there’s a lot of need for innovation in library software as well, and is something I’m interested in exploring.

User-generated content

Tim Spalding from LibraryThing convincingly demonstrated that ordinary readers can, in aggregate, contribute accurate metadata and even scholarly initiatives.  UGC initiatives don’t replace professional cataloging or research, but they can galvanize interest in a subject by using tools “where people are” on the net, whether it’s LibraryThing, Facebook or Amazon.

Once a resource’s content and metadata are available on the net, the library can broaden the scope of what its “local community” means. It may no longer service just people living in its immediate vicinity but anyone who has a interest in the library’s holdings, e.g. retirees who grew up in that area, or individuals with historical interest in the location.

This Flickr photostream from the Library of Congress allows anyone to add historical notes and corrections. Ironically, this project also validates the need for editorial control, as some popular photos are overloaded with inane comments. A sensible moderation policy admits potentially-useful information while deleting random valueless statements (”nice hat!”).

For better or worse, most archival library holdings will draw less attention, and thus UGC is likely to be of higher quality.  Without UGC many collections might languish unseen for decades because the resources don’t exist to professionally catalog them.

Software services and discovery

I’m reading about a book on the net, and decide I’m interested in it — but not to buy.  Perhaps it’s out of print, or extremely expensive, or I’m only mildly curious about the title.  I should be a maximum of one or two clicks away from finding out that it’s available via my local library and ordering it.

(I don’t especially care where the book is or how the library acquires it, although I do need to know an estimated time of arrival in case that’s important to my use case. One ALA speaker suggested the unorthodox practice of buying used books online and mailing them directly to patrons, simply because it can be cheaper than old-fashioned inter-library loan.)

Right now, my local library catalog accepts only inbound requests.  I have to go to the site and initiate a search for the title of interest (assuming I even know what I want). My library network (a consortia of many city libraries in a well-off, highly-educated region) isn’t part of WorldCat and certainly doesn’t provide any advanced discovery tools of its own.

Libraries need to move in the true Web 2.0 direction of providing outbound services.  They should broadcast their catalogs using a simple REST-like API.  It could be as simple as asking for http://mylibrary.org/isbn/123456789 and getting a brief XML response back: the book is available via loan and will take 3-5 days to arrive at the local branch. (An authenticated POST request could then reserve it.) There are already good models for these services in the form of the Google Books and Amazon APIs and there is nothing technically infeasible about it.

The regional library of the future should not be just a physical building to store books but a public service for getting books into its community.

Inspired by the conference, I did come back and make my first online request to my local library. It wasn’t difficult, and this morning I got an email notice that the book is waiting for me at the regional branch a couple blocks away. But it could be even easier, and I’d love to help build out that infrastructure.