Untangling Text Encoding
Over the last few months, I’ve been spending a lot of my time focused on a fairly technical topic: text encoding.
Basically, text encoding is a method for representing text in a digital form. It allows you to record information about text -- for example, whether it is handwritten, or mentions someone’s name, or is the salutation of a letter -- right alongside the text itself.
And it’s a key component of our new Greenfield digital history project, part of a larger effort funded by the Albert M. Greenfield Foundation.
As I’ve described in previous blog posts, we are digitizing, transcribing, and annotating approximately 300 primary source documents from the Albert M. Greenfield papers (collection 1959) and other collections to tell the story of Bankers Trust Company, a large Philadelphia bank that failed in December 1930.
We will be coding each document in XML following the encoding guidelines set out by the international Text Encoding Initiative (TEI). We will present these documents, contextual essays, and teacher resources online in the fall of 2012.
So what does text encoding look like?
To give one small example, if I were transcribing a letter from 1928 in which the author wrote “Meet me on June 13!”, my encoded transcription would look like this:
<p>Meet me on <date when="1928-06-13">June 13</date>!</p>
The added coding allows a computer to recognize that the string of characters “June 13” represents a specific date, and it would allow a computer to recognize the same date whether it was written as “June thirteenth” or “6/13/28” or even “today.”
We will be encoding much more information than just dates in our Bankers Trust documents, from the physical appearance of the text to its structure to its intellectual content. Once we have completed coding, web users will be able to tackle sophisticated searching and analysis of the digital documents. I'll keep you posted on our progress.