Getting From Paper Pages To Digital Texts

2011-06-15 12:12

Now that we’re elbow-deep in encoding the 300 or so documents for the Greenfield Digital Project, my colleague Faith Charlton and I are spending a lot of time at the keyboard.

As I've explained in past posts, we are digitizing, transcribing, and annotating primary source documents to tell the story of Bankers Trust Company, a large Philadelphia bank that failed in December 1930. This project is part of a larger effort funded by the Albert M. Greenfield Foundation.

We've selected our documents and passed them to HSP's digital imaging team, and we are now focused on creating our XML text-encoded files.

Step one is getting the text from paper to a digital file.

One obvious method is simply to retype the text on a computer. That works great for brief documents and is essentially the only option for handwritten documents, where each author's writing must be carefully deciphered. But we have another tool in our toolbox for longer typescript and print documents: optical character recognition, or OCR.

Software programs like ABBYY FineReader or OmniPage can perform sophisticated transformations of digital images into editable text files. For a variety of reasons, I decided it didn't make sense to invest in that type of software for our project.

Instead, when we encounter longer typescript or printed documents in our project, we are using a free tool: Google Documents, or Google docs for short. You can upload .jpg, .gif, .png, or .pdf files to a private storage area, and the system will do its best to translate that digital image file into an editable text file. You can then copy and paste the text into any other software you'd like, including the software we're using for text encoding, oXygen XML Editor.

Screen shot of Google docs upload menu

The crucial check-box in the Google docs uploading process is near the bottom: "Convert text from PDF or image files to Google Docs documents."

Depending on your perspective, the Google results are either amazingly accurate or frustratingly imperfect. (I fall into the "amazingly accurate" camp.)

For example, the following four-page letter has about 1300 words total. I could type it from scratch in about 15-20 minutes; Google docs can do the same work in less than a minute.

First page of a 1927 letter from Samuel Barker to Albert Greenfield proposing creation of Bankers Securities

The first page of the digitized letter, in which Bankers Trust Co. President Samuel Barker proposed a new business venture, Bankers Securities Corp.

Here is how Google docs transcribed the first page of the letter:

June 11, 1927

Mr. Albert Pi. Greenfield, Bankers Trust Building, Philadelphia.

Dear A 0

Let me give you more concretely than I did in brief conversation a few days ago my thoughts cmcerning a Bankers Securities Corporation and the much that can be accomplished through such an organization. I s11:';ll try to put the proposition, as I vision it, with the strong conviction that the time has come to act in the matter.

When I first proposed it, immediately upon organization of Bankers Trust Company you and others thought the time premature. You were right. Since then the way has cleared and been opened in many and important ways for Bankers Securities Corporation to be brought into life. There is ready for it a largely advance prepared and very profitable field, with real things at hand for it to do-—things of creative as well as money-making character. What I see is this:

1. Bankers Trust Company is now safely and surely established. Already it holds recognized position in Philadelphia. Important financial interests elsewhere, as in New York, Boston, Baltimore and Pittsburgh, are glad to do business with it. From now forward there is pretty sure promise of earnings which will give increasing net income for the stock. There are about 500 stockholders. Without support the stock is much above both issue price and book value. The premium is one measure of the belief which exists that the Company has a large future.

2. Resources of Bankers Trust Company have more than doubled; its deposits have increased 60% in five months. Directly, some 12,000 people are banking with it. Already, with a securities department only in swaddling clothes, Bankers Trust Company has been welcomed to the table with grown-ups and taken into the inner circle by big financial groups. I fill in this picture as follows:

a. It enjoys full syndicate position with Kuhn, Loeb and Company, so getting securities which that banking house issues at bottom issue price.

If you compare the original and the transcription carefully, you'll see that Google skipped the printed letterhead at the top of the page and had problems with the recipient address, salutation, and first paragraph. But overall, it made relatively few mistakes.

Of course, both my typing and the Google docs transformation require careful proofreading. Google often skips over text that confuses it, and it seems to do worse if there are multiple styles of text on a page (like the letterhead and typescript above). It also has a hard time distinguishing between typescript 3s and 5s, among other issues. But for our purposes, these shortcomings are a fair tradeoff for the price.

