Below is the text of my prepared remarks for a roundtable at the Modern Language Association convention in January 2013. (This version is slightly longer than the one to be delivered live, due to time constraints of live delivery). Here’s the blurb for the session from the program:
Session 22/”Expanding Access: Building Bridges within Digital Humanities.” Thursday, 12:00 noon–1:15 p.m., 205, Hynes. A special session. Presiding: Trent M. Kays, Univ. of Minnesota, Twin Cities; Lee Skallerup Bessette, Morehead State Univ. Speakers: Marc Fortin, Queen’s Univ.; Alexander Gil, Univ. of Virginia; Brian Larson, Univ. of Minnesota, Twin Cities; Sophie Marcotte, Concordia Univ.; Ernesto Priego, London, England
Digital humanities are often seen to be a monolith, as shown in recent publications that focus almost exclusively on the United States and English-language projects. This roundtable will bring together digital humanities scholars from seemingly disparate disciplines to show how bridges can be built among languages, cultures, and geographic regions in and through digital humanities.
This roundtable arose in part out of a blog post and then article by Domenico Fiormonte (2012) proposing a “cultural critique of the digital humanities.” Fiormonte offered two major concerns about Anglo-American DH. First, he identified
the composition of the government organs, institutions etc., inspiring and managing the processes, strategies and ultimately the research methodologies (thus affecting also the visibility of the results) (p. 62)
I’m not prepared or qualified to take up this question, and I’ll leave it for others.
Second, he exposed questions about “the cultural-linguistic nuances and features of the tools” used in DH (p. 62). He broke this issue into two components. He addressed one of them at length, what he called the “the cultural-semiotic problem of the different tools of representation” arising from DH’s Anglo-American centrism. I’m going to refer you to his treatment of that issue, which is provocative, though incomplete. The other branch of the cultural-linguistic issue, what he described as “the cultural and political problem of software and platform[s] almost exclusively produced in the Anglo-American environment,” he left for another time. It is this last issue that I’d like to address. In fact, I’d specifically like to point out ways that the largely Anglo-American researchers who have created the software and platforms used for natural language processing have worked to make their efforts workable for scholars elsewhere in the world.
First, I’d like to provide a little personal background, just so you can be well warned about my theoretical and epistemological commitments. I’m a PhD student in Rhetoric and Scientific and Technical Communication at the University of Minnesota, minoring in cognitive science. I use computational methods to explore writing. However, I don’t think computational methods alone are sufficient for examining meaning in the texts that I study. Instead, I’m convinced that we need to use both traditional close reading and what Burdick et al. (2012) call “distant reading”; that we need to consider the ways that “big data” and computational methods can permit what anthropologist Clifford Geertz (1975) called a “a continuous dialectical tacking between the most local of local detail and the most global of global structure in such a way as to bring both into view simultaneously” (p. 52). For me, this means looking at trends and metrics across large corpora of data while also giving selected samples from such corpora close readings and using methods such as interview, observation, etc., to enrich a contextual understanding of them.
I think I am NOT, however, a digital humanist, and thanks to Burdick, Drucker, Lunenfeld, Presner, and Schnapp’s Digital_Humanities (2012), I finally know why. I had struggled with the name digital humanities, working to decide whether they were the application of digital tools to traditional objects of study in the humanities or the application of the methodologies of the humanities to digital artifacts. Burdick et al. say that DH is neither; rather, they say:
Digital Humanities refers to new modes of scholarship and institutional units for collaborative, trans-disciplinary, and computationally engaged research, teaching, and publication. Digital Humanities is less a unified field than an array of convergent practices that explore a universe in which print is no longer the primary medium in which knowledge is produced and disseminated. (p. 122)
I read them as saying that digital humanities describes the outputs of these new modes of scholarship, rather than the inputs and the tools. In contrast, I merely use computational tools to look at rhetorical performances; and I’ll likely publish my results in the same old journals as my forebears. Nevertheless, I’m grateful that others have developed the technologies that I use, particularly in natural language processing.
I argue that natural language processing (or “NLP”) researchers have already done much to make their research accessible and useful to a global audience, which benefits researchers in the Anglo-American center of DH but also makes these important tools useful throughout the developed and developing world. I’d like to discuss three developments: First, much of the computational linguistics literature is open access. Second, many or most of the tools are open source, or at least open access. Finally, NLP experts appear to be well aware of the cross-linguistic issues inherent in their tools and research and appear prepared to attempt to bridge the gaps. We’ll begin with the openness of research.
The leading international association for NLP researchers is the Association for Computational Linguistics. This organization is committed to open access to its publications. So, for example, its flagship journal, Computational Linguistics, published by MIT Press, has been open access since 2009 (see “Computational Linguistics,” n.d.). What is even more useful and important, however, is the ACL Anthology, which archives nearly 22,000 papers from more than 30 years of NLP conferences, including all the back issues of Computational Linguistics (“ACL Anthology,” n.d.).
This is important in light of the fact that presenters at the NLP community’s conferences generally must submit completed papers, rather than just presentation proposals, before being accepted to speak. Thus, the ACL Anthology provides almost immediate access to a rich cross-section of state-of-the-art NLP research, free of charge. ACL also makes the effort to take its annual international conference outside North America (this year, it was in South Korea).
Open and cross-linguistic tools
But doing NLP requires software tools. Fortunately, these too are widely available and many already take account of cross-linguistic issues. I’ll describe just two of the ones that I’ve used in my own work: the General Architecture for Text Engineering or “GATE” (Cunningham et al., 2012); and Stanford CoreNLP (“Stanford CoreNLP,” n.d.). But first, I’d like to mention a few of the functions that NLP software is called on to play in analysis of text corpora.
Making effective use of a text corpus usually requires several pre-processing steps. This includes “tokenizing” the text, identifying word boundaries and punctuation marks; sentence-splitting, identifying boundaries between sentences; lemmatization, identifying variant forms of the same underlying lexical item, as for example when recognizing that “posso” and “pode” are both forms of the verb “poder” in Portuguese or that “women” is the plural form of “woman” in English; named-entity recognition, identifying proper names of people, places, and things; part-of-speech tagging, which entails assigning a probable part of speech to each token; dependency parsing and others. After these steps, others, such as topic and sentiment analysis and the application of machine learning algorithms, generally become possible. For a relatively accessible introduction to the steps required for NLP, see Jurafsky and Martin (2009).
Many of these NLP tools are language-dependent. For example, a part-of-speech tagger usually relies on a dictionary of words from the language it is processing; it often has to be “trained” on a corpus of text that has already been tagged for part of speech by linguists. Lemmatizing is quite different for a language like Portuguese, with numerous verb endings, than for English. And it’s complicated in an entirely different way for Arabic, where hundreds of verb forms can be generated from a single three- or four-character root. Each of the two toolsets I mentioned earlier makes some effort to address these concerns.
GATE’s home is at the University of Sheffield in the UK, where its development has received funding from a variety of public sources. It is distributed under a GNU general public license (“How to use GNU licenses,” n.d.) that makes it largely free for research, non-profit, and academic uses. The application runs in Java on Windows, Mac, and UNIX/Linux machines, meaning that it can be deployed cheaply by researchers working even with relatively dated and inexpensive equipment. Its development community has deployed plugins for non-Western languages, including Arabic, Cebuano, Chinese (Mandarin), and Hindi. GATE actually includes plugins that embody another useful toolset, Stanford’s NLP tools.
Stanford’s CoreNLP also runs in Java environments and is subject to the GNU general public license. Though this toolset is overtly dedicated to processing text in English, its sponsors have also deployed rich resources for processing Chinese and Arabic (“Chinese Natural Language Processing,” n.d.; “Arabic Natural Language Processing,” n.d.).
There are other widely available toolsets, including NLTK: The Natural Language Processing Toolkit (Bird et al., 2009), which consists of software written in Python (another open source tool) and a good, free guidebook; and WEKA (“WEKA 3,” n.d.), a suite of machine learning tools maintained by the University of Waikato in New Zealand.
The availability of NLP research and software tools does not address all of the problems Fiormonte identifies. There is clearly more work to be done, and I’m grateful for Fiormonte’s articulation of some of those problems. So, for example, the coding languages or these applications (Java and Python) were developed by English speakers and probably embody cultural norms from the Anglo-American world. There are also examples of uses of these technologies that belie a lack of critical awareness of their implications. (I’ve referenced a couple on the works cited list, and I’ll be happy to discuss them later in the roundtable session.) Nevertheless, natural language processing is seeing substantial growth around the world, and the researchers at the Anglo-American center of the field appear to be making yeoman efforts to make their work available to their peers elsewhere.
Works cited or of interest
I refer to some of these references in the remarks above; others may be interesting to audience members; and still others may come up during discussion.
ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics. (n.d.). Retrieved January 1, 2013 from http://www.aclweb.org/anthology-new/ .
Arabic Natural Language Processing. (n.d.). The Stanford Natural Language Processing Group. Retrieved January 1, 2013, from http://nlp.stanford.edu/projects/arabic.shtml
Argamon, S., Koppel, M., Fine, J., & Shimoni, A. R. (2003). Gender, genre, and writing style in formal written texts. Text, 23(3), 321–346.
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python (1st ed.). O’Reilly Media.
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.
Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., & Schnapp, J. (2012). Digital_Humanities. Cambridge, MA: The MIT Press.
Chinese Natural Language Processing and Speech Processing. (n.d.). The Stanford Natural Language Processing Group. Retrieved January 1, 2013, from http://nlp.stanford.edu/projects/chinese-nlp.shtml
Computational Linguistics. (n.d.). Retrieved January 1, 2013 from http://cljournal.org/ .
Cunningham, H., Maynard, Diana, Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., … Peters, W. (2012, December 28). Developing Language Processing Components with GATE Version 7 (a User Guide). GATE: General Architecture for Text Engineering. Retrieved January 1, 2013, from http://gate.ac.uk/sale/tao/split.html
Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1365–1374). Association for Computational Linguistics.
Fiormonte, D. (2012). Towards a cultural critique of the digital humanities. Historical Social Research, 37(3), 59–76.
Gadamer, Hans-Georg. “Elements of a Theory of Hermeneutic Experience.” Truth and Method. 2nd ed. New York: Continuum, 1989. 265–307. Print.
Gadamer, H.-G. (1989). Elements of a theory of hermeneutic experience. In Truth and Method (2nd ed., pp. 265–307). New York: Continuum.
Geertz, C. (1975). On the nature of anthropological understanding: Not extraordinary empathy but readily observable symbolic forms enable the anthropologist to grasp the unarticulated concepts that inform the lives and cultures of other peoples. American Scientist, 63(1), 47–53.
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3), 61–83. [See responses from other researchers published in same issue.]
How to use GNU licenses for your own software. (n.d.). GNU Operating System. Retrieved January 1, 2013, from https://www.gnu.org/licenses/gpl-howto.html
Journal of Cognition and Culture. Leiden: Brill. [Helpful articles examining effects of culture on cognition.]
Juarrero, A. (1999). Dynamics in Action: Intentional Behavior as a Complex System. Cambridge, MA: The MIT Press.
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). Upper Saddle River, NJ: Pearson Education, Inc.
Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401 –412.
Stanford CoreNLP: A suite of core NLP tools. (n.d.). The Stanford Natural Language Processing Group. Retrieved January 1, 2013, from http://nlp.stanford.edu/software/corenlp.shtml
Weka 3: Data Mining Software in Java. (n.d.). WEKA: The University of Waikato. Retrieved January 1, 2013, from http://www.cs.waikato.ac.nz/ml/weka/
Zimmer, B. (2012, October 18). Bigger, better Google ngrams: Brace yourself for the power of grammar. The Atlantic. Retrieved from http://www.theatlantic.com/technology/archive/2012/10/bigger-better-google-ngrams-brace-yourself-for-the-power-of-grammar/263487/