Digital Discourse Analysis (DDA)
Digital Discourse Analysis (DDA) is related to Critical Discourse Analysis (CDA) (for an overview of CDA, see Fairclough 2008; Wodak 2009) through their shared interest in the correlation between ideology (in the sense of a belief system) and language use. In contrast to CDA, however, DDA makes extensive use of large corpora of digitized or digitally available texts and combines quantitative and qualitative approaches. DDA draws on corpus linguistics, linguistics, and philology in general, thus creating a framework that is an apt tool for any textual analysis primarily concerned with content and a large number of texts. Corpus linguistics provides the means to handle the large amounts of textual data and linguistics and philology in general provide the theoretical frameworks and models against which the textual data is interpreted.
More on Discourse Analysis / Discourse Studies:
- T. van Dijk’s website
Gathering and providing data for analysis: Corpus Linguistics in DDA
Both, discourse analysis and corpus linguistics deal with natural language (as opposed to introspectively generated examples) above the sentence: The opening statement in the volume on corpus linguistics in the Edinburgh series textbooks in empirical linguistics states that Corpus Linguistics is “the study of language based on examples of real life use” (McEnery & Wilson 1996: 1) which is quite similar to the first sentence in the Routledge Handbook of Discourse Analysis which reads “Discourse analysis is the study of language in use” (Gee & Handford 2012: 1).
How the “language in use” is analysed, however, differs quite drastically. Corpus-linguistic analyses make full use of a large number of texts be it the complete corpus or a subcorpus built from texts that contain certain features. This is hoped to allow “researchers to uncover linguistic evidence for prevailing/majority and resistant/minority discourses as a large corpus is likely to show a range of ideological positions – something which an analysis of a single text may be less likely to reveal.” (Baker, Paul: Using Corpora in Discourse Analysis. [Homepage; retrieved: 2014-02-24]). Due to the large number of texts and other linguistic units, corpus-linguistic analyses “are always based on the evaluation of some kind of frequencies” (Gries 2009: 2). Only with counting is it possible to determine how often–or by how many different discourse actors–a certain word or phrase is used and it is one of the basic assumption of corpus linguists and discourse analysis that “repeated patterns show that evaluative meanings are not merely personal or idiosyncratic, but widely shared in a discourse community.” (Stubbs 2001: 215)
There are numerous tools available which provide the means to build, preprocess and analyse corpora (see Anthony 2013; McEnery & Hardie 2012; Weisser 2016). Which of them are best suited for a specific research project depends mainly on the research interest and which linguistic units need to be available for analysis.
More on corpus linguistics (and also its combination with discourse analysis):
- Sabine Bartsch’s linguisticsweb
- Noah Bubenhofer’s introduction to corpus linguistics (unfortunately, only available in German)
- Paul Baker’s “Using corpora in discourse analysis“
- Stefan Th. Gries’s “Quantitative Corpus Linguistics with R“
Gathering texts can be as easy as downloading hundreds of texts with one click or as complicated as painstakingly digitizing fragile historic documents and use optical character recognition (OCR) to convert the images into searchable text.
One of the most comfortable ways to gather a large number of (media) texts is to use a media database like Nexis (cf. http://www.lexisnexis.com/en-us/products/nexis-ab.page). Access to resources of this kind is usually restricted. University libraries, however, are often licensed to use them.
The preprocessing should at least achieve one goal: it should make available those linguistic units which are crucial for answering the research question. This might be morphemes, word forms, phrases etc. Word forms (tokens) are a particularly important linguistic unit since they provide the basis for various kinds of annotation. The following preprocessing steps can be considered something of a standard in DDA (the list of tools mentioned here is far from complete):
- Tokenisation: splitting text into single word forms
- Tools: TreeTagger (Schmid 1996)
- Lemmatisation: adding the base form to a word form
- Tools: TreeTagger (Schmid 1996)
- Part-of-speech tagging: assigning a label to each token that identifies the word class it belongs to
- Tools: TreeTagger (Schmid 1996)
- Named Entity Recognition: assigning a label to each token that annotates proper names, geographical names and names of organisations
- Tools: Stanford NLP toolkit (Manning et al. 2014)
- Morphological tagging (in addition to POS): assigning additional morpho-syntactic information to each token; for example tense and aspect
- Tools: RFTagger (Schmid & Laws 2008)
Compiling preprocessed texts into a corpus
Technically, any collection of texts can be considered a corpus. Accessing texts that are available in the directories of a computer by using UNIX tools to extract and count specific portions of them is a perfectly fine way to do corpus analysis. It is, however, not a very popular one, at least not for all of those scholars whose background is firmly rooted in the linguistic / philological side of DDA.
Ideally, software that is used to manage corpora should be able to handle multiple layers of annotation in a way that enables cross-queries of all layers.
For more information on and your first experience with specific corpus software check out
- Stefan Evert’ Corpus Workbench. The Corpus Workbench itself is a command line tool. It supports multiple layers of annotation and uses a very effective compression to make it possible to work with even very large corpors (up to 2 bn tokens).
- Andrew Hardie’s CQPweb. CQPweb is a browser based frontend for the Corpus Worbench. It is best suited for a multiple user scenario. For local use, CQPwebInABox is probably the easier choice since it is much easier to install.
- Laurence Anthony’s AntConc. AntConc does not natively support multiple layers. You can, however, analyse annotated text if you import it as plain text. This might render queries and the resulting frequency statements quite complex however.
Corpus linguistic standard methods (cf. Wynne 2008) are generating frequency lists, concordancing, calculating co-occurrences, and calculating keywords. Frequency lists for all the available annotation layers can be the first steps in the vertical (or distant) reading of the texts in the corpus. Hermeneutically interesting terms can be more closely inspected by generating their concordances, i. e. the representation of a search term within its context that can be sorted to reveal patterns of usage (in short KWIC for Key-Word-in-Context). The calculation of co-occurring terms (co-occurrence in corpora is a non-trivial matter as evidenced in Bartsch 2004; Bartsch & Evert 2014; Evert 2005, 2009) is a very effective way to help with the inspection of the co-text surrounding specific terms. Finally, keyword analysis can give insight which specific linguistic units are presumed to be typical for a corpus. Calculating the ‘keyness’ of a linguistic entity (like tokens, POS, lemmas etc.) means comparing the relative frequency of the instances of this unit in one corpus to the relative frequency of the the same instances of this unit in a reference corpus and compute whether the difference in occurrence is statistically significant (see Demmen & Culpeper 2015 for a good overview on keyness).
Anthony, Laurence (2013): „A critical look at software tools in corpus linguistics.“. In: Linguistic research 30(2), pp. 141–161.
Bartsch, Sabine (2004): Structural and functional properties of collocations in English: a corpus study of lexical and pragmatic constraints on lexical co-occurence. Tübingen: Zugl.: Darmstadt, Techn. Univ., Diss., 2003.
Bartsch, Sabine; Evert, Stefan (2014): „Towards a Firthian notion of collocation“. In: Abel, Andrea; Lemnitzer, Lothar (eds.): Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern. (= OPAL – Online publizierte Arbeiten zur Linguistik 2/2014) Mannheim: Inst. für Deutsche Sprache, pp. 48–61.
Demmen, Jane Elizabeth; Culpeper, Jonathan Vaughan (2015): „Keywords“. In: Biber, Douglas; Reppen, Randi (eds.): The Cambridge Handbook of English Corpus Linguistics. Cambridge University Press, pp. 90–105.
Evert, Stefan (2005): The statistics of word cooccurrences: word pairs and collocations. Stuttgart.
Evert, Stefan (2009): „Corpora and Collocations“. In: Lüdeling, Anke; Kytö, Merja (eds.): Corpus linguistics 2. (= Handbücher zur Sprach- und Kommunikationswissenschaft 29,2) Berlin [u.a.]: de Gruyter, pp. 1212–1248.
Fairclough, Norman (2008): Critical discourse analysis. The critical study of language. (= Language in social life series) [Nachdr.]., Harlow, England [u.a.]: Longman.
Gee, James Paul; Handford, Michael (2012): „Introduction“. In: Gee, James Paul (ed.): The Routledge handbook of discourse analysis. (= Routledge handbooks in applied linguistics) 1. publ., London [u.a.]: Routledge, pp. 1–6.
Gries, Stefan Thomas (2017): Quantitative corpus linguistics with R: a practical introduction. Second edition., New York NY: Routledge.
Gries, Stefan Thomas (2009): „What is Corpus Linguistics?“. In: Language and Linguistics Compass 3(5), pp. 1–17.
Manning, Christopher D.; Surdeanu, Mihai; Bauer, John; et al. (2014): „The Stanford CoreNLP Natural Language Processing Toolkit“. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 55–60, Retrieved am from http://www.aclweb.org/anthology/P/P14/P14-5010.
McEnery, Tony; Hardie, Andrew (2012): Corpus linguistics: method, theory and practice. (= Cambridge textbooks in linguistics) 1. publ., Cambridge [u.a.]: Cambridge Univ. Press.
McEnery, Tony; Wilson, Andrew (1996): Corpus linguistics. (= Edinburgh textbooks in empirical linguistics) Edinburgh: Edinburgh University Press.
Schmid, Helmut (1996): „TreeTagger“. TreeTagger – a language independent part-of-speech tagger Retrieved am 04.08.2011 from http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.
Schmid, Helmut; Laws, Florian (2008): „Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging“. In: Proceedings of the 22nd International Conference on Computational Linguistics. Manchester, pp. 777–784.
Stubbs, Michael (2001): Words and phrases: corpus studies of lexical semantics. Oxford [u.a.]: Blackwell.
Weisser, Martin (2016): Practical corpus linguistics: an introduction to corpus-based language analysis. Chichester: Wiley Blackwell.
Wodak, Ruth (ed.) (2009): Methods of critical discourse analysis. (= Introducing qualitative methods) 2. ed., Los Angeles [u.a.]: SAGE.