The Archive, Digital Tools, and Copyrighted Texts

hartness.thompson's picture

To discuss the role of the digital humanities (DH) in the future of the archive we must consider what “the archive” is. Defining archive proves challenging because it is a “slippery”[1] term. Archive tends to take on the definition that suits a given scholar’s needs. We suggest that digital humanists embrace this fluidity of definition. As a nascent field, the digital humanities can afford to be creative in moving toward the future of the archive. Rather than pigeonholing archivists and scholars into one standard definition, let’s embrace what Marlene Manoff calls the ambiguity and complexity of the archive.[2] Accepting archive as an umbrella term democratizes both the process and the product. We can re-evaluate who can archive and what gets archived. Derrida warns, “There is no political power without control of the archive, if not memory. Effective democratization can always be measured by this essential criterion: the participation in and access to the archive, its constitution, and its interpretation.”[3] DH has the opportunity, at its start, to be inclusive—invite and involve the tinkerer, the homebrewer, the marginalized, the novice, the expert.

Advances in Optical Character Recognition and Creating an Archive of Modern Texts

The growth of digital tools and accessibility to these tools will allow us to reevaluate not only who can make archives, but what can be archived and analyzed. As humanists and scholars, we want to be able to do topical studies of works not in the public domain, allowing individuals to research texts that are not open-access. With the improvement of optical character recognition (OCR), for example, preparing an archive of non-public domain documents for an individual scholar to study is more possible. Depending on the size of the study, it may not be so overwhelming. In the 1990s, OCR was new, and though it saved hours of typing and provided a searchable record, documents scanned using OCR still required extensive editing to ensure the new document matched the original. Now, OCR programs search for whole words rather than individual characters, effectively reducing these types of errors. In addition, newer documents have clearer texts.Editing is still required for OCR texts; however, with fewer errors, the task becomes quicker, making the work more accessible for the novice and the professional. With tools like OCR, digital humanists have the opportunity to create “private archives” in which scholars can collect and archive modern, copyrighted texts. In this way, archives can be closed spaces for individual study.

Available Technologies for Distant Reading and Topic Modeling

Within these private archives, scholars will be able to utilize tools available in the digital humanities that give us the ability to extract data from texts and archives. Voyant,R and R-StudioPython nGram, and MONK are open-access tools individual scholars can use to explore texts and archives, using what Matthew Jockers calls macroanalysis. Jockers, Franco Moretti, and Tanya E. Clement, among others, have used these tools to reveal new characteristics and information previously inaccessible because the size of the text or corpus made an analysis of the works nearly, if not completely, impossible. In the beginning, the information revealed using macroanalysis and topic modeling tools provided us with new knowledge of documents within the public domain. Moving forward, these tools for distant reading can lead scholars to new and meaningful discoveries in modern texts.

[1] Clement, Tanya, et. al. “Toward a Notion of the Archive of the Future: Impressions of Practice by Librarians, Archivists, and Digital Humanities Scholars,” The Library Quarterly: Information, Community, and Policy, vol. 83, no. 2, 2013, pp. 112-30.

[2] “Theories of the Archive from Across the Disciplines,” Libraries and the Academy, vol. 4, no. 1, 2004, pp. 9-25.

[3] “Archive Fever: A Freudian Impression,” Diacritics, vol. 25, no. 2, 1995, pp. 9-63.

Image from http://brunellocreative.com/blog/captcha-spam-and-digital-books/