I’ve had a longstanding, friendly debate with a colleague about whether
it is sufficient to provide page images of books, or whether text
should be converted to a machine- and human-readable format such as
XML. She argues that converting scanned books to text is expensive and
that the primary goal should be to provide access to more material.
True, but converting books into a textual format makes them much more
accessible, allowing users to search, manipulate, organize, and analyze
them. Here’s my summary of what you can do with an electronic text.
Most of these advantages are pretty obvious, but worth articulating.
It’s not digital text if it’s an image file. It’s just an image, that might contain anything at all. Vannevar Bush’s Memex was an idea for a text storage-and-retrieval system that worked by storing and linking microfilm images of pages of text, but his vision was purely analog. Page images do provide a certain amount of information, and today it’s not too hard to find tools that convert page images to text, but an archival project is incomplete if the digitization process stops at simply supplying images of the the material to be archived.