Return to Corpus Analysis

CQPweb

This tutorial aims to show how to make use of the main functions of CQPweb. It is best read completely; if you are looking for a cheat sheet on how to formulate queries, use the built-in one in CQPweb. From within CQPweb, click on the link titled “Simple query language syntax” on the starting page to open it.

screenhot: link to cheat sheet on starting page of CQPweb

Find information on

In a nutshell

Basically, there are two ways of dealing with the texts comprising a corpus: corpus based and corpus driven. The corpus driven approach assumes no knowledge whatsoever about the contents of the corpus. The first thing you do in this approach is to take an inspiring look at frequency lists. Those lists tell you which words are most frequent and maybe even most important within the corpus. Knowing that, you choose to find out more about a certain word or phrase which means you now take the corpus based approach. CQPweb allows for both.

In corpus driven mode, you just tell CQPweb in what kind of frequency list you are interested in (read more on how to generate frequency lists).

For corpus based queries it is a lot like using a search engine on the internet. The most basic query is searching for a single word (token) to find out in which documents and sentences it is used. For more suitable results, you can of course use search terms consisting of parts of a word or multiple words. To refine your search even further, CQPweb allows you to query not only tokens but also annotation tags like part of speech or lemma. Read more on complex search queries.

The result will tell you a) how often your search term was found (frequency of occurrence) and b) in how many different texts (dispersion). Additionally, the hits will be displayed as concordances for further inspection. Read more on frequency, dispersion and concordances.

Discourse analysts are not only interested in the frequency with which a specific word or multiword unit occurs within the corpus. They are even more interested in the cotext of search terms since it is the cotext that enriches the (functional) meaning of words. Therefore, it is crucial for discourse analysts to categorise hits further according to their research interest. The CQPweb interface provides you with a possibility to define sets of categories tailored to your research question and categorise concordances of hits. After categorising your findings, CQPweb displays a frequency list of the categories.

Using words as search terms presupposes that you know which words to look for. If you don’t know those words yet or if you would like to get an idea which words are used in the corpus in the first place, you will query the corpus  for a list of all the word types used. If you would like to get an even better idea which of the words used in the corpus might be considered significant for this specific corpus, you will query the corpus for a list of keywords. Both queries lead you to possible single word or multiword search terms which in turn lead to insight into the dispersion of these terms within the corpus and the sentences in which they are used.

In more detail

CQPweb offers two different ways of querying a corpus:

  1. The Corpus Elementary Query Language (CEQL)
  2. The Corpus Query Processor (CQP) syntax

The CEQL is less formally complex than the CQP syntax but similarly powerful. This tutorial will cover both, keep in mind, however, that the purpose of it is to help you use the CQPweb interface in discourse analysis using specialised corpora. Queries that might be helpful pursuing other research interests than that might not be discussed at all even if they are well within the capabilities of the CQPweb interface.

One of the fundamental differences between the two is that the CEQL allows for bare words being input as search terms, the CQP syntax, however, does not.
Treibhausgas is a valid search term in CEQL; to yield the same result in CQP syntax, you need to input the following string including the brackets and the semicolon:

[word="Treibhausgas"];

Select a corpus for analysis

Your first step is to select a corpus for analysis from the CQPweb start page:

screenhot: corpus selection

Create subcorpora for comparative analysis

One part of corpus analysis is to compare and contrast one portion of a corpus (= subcorpus) with another portion. CQPweb allows for the creation of subcorpora by several criteria the most notably of which are corpus metadata and the result of a query. To create a subcorpus, click on “create/edit subcorpora” in the left hand menu.

screenhot: create subcorpora

To create a subcorpus by corpus metadata choose “corpus metadata” from the dropdown menu right after “Define new subcorpus via” and click “Go”.

screenhot: select metadata for subcorpus creation

On the next screen, enter a name for the subcorpus and use the checkboxes to select the metadata by which your subcorpus should be created. Click “Create subcorpus from selected categories”. On the next screen, an overview of all existing subcorpora will be displayed. Click on “Compile” in the field labeled “Frequency list” to make the subcorpus available for all uses including for example computation of keywords.

compilation of frequency list for subcorpus

Frequency, distribution and concordances of specific search terms

After you have selected a corpus, the standard query page opens:
screenhot: input search term

There, you can choose which query language (= query mode in the screenshot) to use. The default selection is “Simple query language” CEQL (= Corpus Elementary Query Language). For the most basic query, type a word into the text box and click “Start Query”. If there are subcorpora available, you can select one from the restriction menu.

The CEQL makes use of simpler wildcards than the CQP syntax, an asterisk, for example will be interpreted as “as many additional characters as there are” if you chose CEQL. The asterisk is transformed into a full regular expression within the CEQL implementation; the CQP syntax operates on regular expressions directly. An asterisk will be therefore interpreted as “zero or more of the preceding units”. To find any number of (additional) characters using CQP syntax, you must use the combination “.” (for any character) followed by “*”. Refer to the section “More Complex Queries …” below for details.

The query in the screenshot above finds all instances of words beginning with “Treibhausgas”, i. e. any inflected wordforms like “Treibhausgase” or “Treibhausgases” but also any compounds like “Treibhausgasemmissionen”. The result shows the number of hits, the number of documents with hits, the total number of words, the total number of documents, and the relative frequency of the hits in instances per million words:

screenhot: query result

If you want to find all inflected forms of a word but no compounds you can make use of the lemma search (also called ‘headword search’ in CEQL):

  • In CEQL: Put curly brackets around your one word search term. {Treibhausgas} will find any instance of “Treibhausgas”, “Treibhausgase”, “Treibhausgases”, “Treibhausgasen”
  • In CQP syntax: expressly state the annotation layer followed by an equal sign followed by your search term in quotes. And don’t forget to put it all in square brackets and – equally important – finish it all with a semicolon. [lemma="Treibhausgas"]; will output the same result as {Treibhausgas} in CEQL.

Restricting the search by categories like text source

If you want to use only a part of the corpus, you can restrict the search by categories like text source or publication date by clicking on “Restricted query” in the left menu. The screenshot below shows how to restrict the same query to only two of the available sources (“Der Spiegel” and “Wirtschaftsblatt”).

screenhot: set restrictions for query

Note that the total frequency of words and texts is given in accordance with the restriction in the result:

screenhot: result for restricted query

Frequency breakdown of queries resulting in more than one match

In order to get a list of how often which of the matching wordforms occurs in the corpus, choose “Frequency breakdown” from the scroll down menu in the upper right on the concordance screen.

screenhot: frequency breakdown of queries resulting in more than one match

Distribution of matches within the corpus

In order to find out how often a search term occurs in which documents, select “Distribution” from the menu in the upper right in the concordance screen. Note in the result below that any element pertaining to the chosen category (here: text_date) is listed, even those with 0 hits. Also note that besides the number of files with matches for the chosen category the frequency of the matches per million words in the chosen category is given:

screenhot: distribution of matching word within the corpus

You can also put cross constraints on the distribution query and get, for example, a list of matches per year and source. To achieve this, use both, the “Categories” menu and the “Category for crosstabs” menu in the upper left.

screenhot: distribution of matching word within the corpus

Categorising concordances

To make use of the categorisation feature of CQPweb, choose “Categorise” from the drop down menu on the concordance screen and click “Go”. You are presented with a form where you can create a customised categorisation schema.

screenhot: custom categories

After you filled in the form, click “Submit” to go to the categorisation screen. There you can annotate single concordances with any category you defined. Do not forget to save your categorisations by clicking either “Save values for this page and go to next” or “Save values and leave categorisation mode” on the bottom of the page.

screenhot: custom categories

CQPweb saves the set of categories together with the concordances so you can continue categorising at any time. To do this, choose “Categorised queries” from the left hand menu on the standard screen. By clicking on the name of the set of categories you go back to the categorisation mode. The right hand drop down menu offers you the possibility to add new categories to a set of categories or delete the set and the concordances it was created for. It also presents you with the possibility to separate the concordances you annotated by annotation which in turn enables you to make statements about the specific use of terms within your corpus.

Collocations

To get a list of words that occur in the neighbourhood of a search term first carry out a search like _ADJA Treibhausgas* in CEQL mode for finding all occurrences of words starting with “Treibhausgas” preceded by an adjective. In the screen showing the resulting concordances, choose “Collocations” from the drop down menu in the upper right. On the next page, you can choose

  • the span of words around the search term to be considered for the computation of collocate frequency
  • and whether lemma and/or pos tag information should be considered for the computation of collocates.

Click “Create collocation database” to start the computation.

screenshot: choose settings for collocations

CQPweb will create a subcorpus consisting of all the words that occur together with the search term within the span you specified. The frequency with which those words occur within the span specified (called “observed frequency”) is compared to the frequency with which that word is expected to occur. Finally, a test for statisticial significance is applied to the difference between observed and expected frequencies for each word that occurs within the span. (Read more on this in Stefan Everts paper “Corpora and collocations”.)

screenshot: choose settings for collocations

More complex queries in CEQL and CQP syntax

Cross querying various annotation layers

For examples: Adjective or adverb + specific noun or adjective

⇒ “gefährliche Treibhausgase”

⇒ “möglichst treibhausgasneutrale”

Finding specific words or word combinations is only a first step. The CEQL query Treibhausgas* mentioned above yields 10363 hits. After mulling over some of the results you find that for your research interest occurrences like “gefährliche Treibhausgase” or “möglichst treibhausgasneutrale” are most interesting. From these instances you infer that what you are looking for is not occurrences of Treibhausgas* but occurrences of one of the following patterns: “a word that is tagged as an adjective” + “Treibhausgas*” or “a word that is tagged as an adverb” + “Treibhausgas*”.

To accomplish this, you need to cross query the annotation layers “word” and “pos”, i. e. look for and find any words that start with “Treibhausgas” and are preceded by an adjective or adverb. In other words: instances of tokens containing the character combination “Treibhaus” preceded by a token that was annotated with a pos class containing the character combination “AD”. The CEQL syntax for this is _AD* Treibhausgas*. Using CQP syntax, you need to specify the annotation layers and the relation of the two queries, resulting in the query [pos="AD.*"] [word="Treibhausgas.*"%c];. Both queries yield the same 1590 hits. Providing that this combination indeeed is helpful in regard to your research interest, you should be much better off with this result set.

screenhot: matches for the search query adjective or adverb followed by any word starting with treibhausgas

Matching any or a specific number of something

Both query languages enable you to look for and find any or a specific (number of) something, be it a character, a combination of characters or a match from a character class. As you have seen above, the asterisk means “zero or more characters” in CEQL while in CQP syntax it means “zero or more of the preceding”, i. e. in CQP syntax the asterisk is an operator of repetition only, it does not also represent any instance of the classes of characters. In CQP syntax you need to specify “any character” by using “.”, the regular expression character representing any character. Therefore, “.*” in QCP syntax means the same as “*” in CEQL.

The following table illustrates the difference between wildcards and operators for repetition with regard to CEQL and CQP syntax:

Character Meaning in CEQL Meaning in CQP syntax
. . (matches the fullstop) any ASCII character
(matches for example “a”, “z” but not “ö”)
? any ASCII character zero or one of the preceding:
“b?” matches “b” and “bb”
* zero or more characters (including UTF-8):
“*rger” matches “Bürger”, “Ärger”, “Wittenberger” and would also match “rger”.
zero or more of the preceding
+ one or more characters (including UTF-8):
“+rger” matches “Bürger”, “Ärger”, “Wittenberger” but not “rger”
one or more of the preceding
{2,3} two or three of the preceding element
(_NE){2,3} matches two to three words that have been tagged as “NE” (named entity) by the TreeTagger
two or three of the preceding element
[pos="NE"]{2,3} is the CQP syntax for CEQL (_NE){2,3}. Both find for example: “Manfred Stolpe”, “St. Pauli”, “Los Angeles”.

Grouping

Parentheses are used to group things together. Consider for example that you chose the CEQL query _ADJA Treibhausgas* to find all occurrences of words starting with “Treibhausgas” preceded by an adjective and you realize by reading the resulting concordances that you are most interested in those occurrences dealing with producers of greenhouse gas. Especially patterns like “Produzenten des gefährlichen Treibhausdgases” and “Produzenten gefährlicher Treibhausgase” are of interest to you. Therefore, you need to search for “Produzent*” followed by an optional article followed by an adjective followed by “Treibhausgas*”. Since an asterisk normally represents “zero or more characters” in CEQL the following query will not yield the result you are looking for: “Produzent* _ART* _ADJA Treibhausgas*”. You need to group “_ART” together by putting parentheses around it to change CEQL’s interpretation of the asterisk from “zero or more characters” to “zero or more of the preceding”. The complete query reads: Produz* (_ART)* _ADJA Treibhausgas*.

Distance Operators: Matching search terms with something on the side or in between

Distance operators allow for defining a span in which something must occur to be considered a match. In contrast to generating a list of collocations, search queries using distance operators generate concordances of items matching the complete search term consisting of (parts of) words (pos tags or lemmas) and constraints in regard to the distance between them.

Example: Treibhausgas* < <5>> Klima* finds all instances of words starting with “Treibhausgas” occurring within 5 tokens from a word starting with “Klima”. Remember that search queries are case insensitive. The above query will find “klimaschädliche Treibhausgase” as well as “Klimaschutz und Treibhausgase” and “(Gegenstand der Tagung ist der Abbau von) Treibhausgasen, um den Klimawandel (zu bremsen).

You can choose a distance

  • surrounding the search term by specifying the number of tokens within angled brackets (see example above)
  • to the left from the search term by specifying the number of tokens within angled brackets pointing to the left (for example Treibhausgas* < <5<< Klima*)
  • to the right from the search term by specifying the number of tokens within angled brackets pointing to the right (for example Treibhausgas* >>5>> Klima*)

This also works on different annotation layers: Treibhausgas* < <5>> _VVFIN finds finite Verbs co-occurring with words starting with “Treibhausgas” within a range of 5 tokens.

“Treibhausgas*” followed by any number of words (but not punctuation characters) followed by a finite Verb

Distance markers do not allow you to choose what kind of tokens are to be considered a match. If you want to further control which kind of token should be considered a match, you can formulate your query in a more elaborate fashion:
{Treibhausgas} (_A* | _C* | _F* | _I* | _K* | _N* | _P* | _T* | _VVI* | _VVP* | _VA* | _VM* ){0,} _VVFIN
The above query will match word forms belonging to the lemma “Treibhausgas” followed by any number of words, followed by a finite verb. Example matches include:

  • “Treibhausgas durch Photosynthese aufnehmen”
  • “Treibhausgasen vorschreibt”
  • “Treibhausgasen und Ausbildung bereits verlässliche Daten vorliegen”

Explanation:

  1. {Treibhausgas}: Putting curly brackets around a word tells CQPweb to treat the word as lemma. It will match any word form which was tagged with this lemma.
  2. (_A* | _C* | _F* | _I* | _K* | _N* | _P* | _T* | _VVI* | _VVP* | _VA* | _VM* ): The vertical line “|” serves as OR operator. The round brackets “(” and “)” are used for grouping. The terms between the vertical lines cover all POS tags with the exception of punctuation characters and finite verbs.
  3. {0,}: Two digits separated by a comma in curly brackets define a range. “{0,}” means “zero or as many of the preceding elements as there are”. This is the reason that the part before needs to be grouped together.
  4. _VVFIN: matches any word that has been tagged as a finite verb.

Generating lists of frequency and distribution without a search term

Words, POS, lemmas

screenshot: frequency lists options

On the query page, choose “Frequency Lists”. On the next page, choose whether you want the frequencies of words, POS or lemmas displayed. If subcorpora are available, you can also choose a specific subcorpus for which to generate the list. Click “Show frequency list”.

For a more specific query you can add several filters:

  • Set a pattern the words, POS or lemmas should begin with, end with, contain or match exactly
  • Set the number of times matches must occur to be displayed
  • Set the number of items shown per page
  • Set the criterion by which to sort the results: most frequent, least frequent, alphabetically

Keywords

To generate a list of keywords in CQPweb, start by clicking on “Keywords” from the menu on the standard screen. There you can choose

  • which frequency lists (= corpora) to compare (make sure to choose two different lists here)
  • whether words, pos tags, or lemmas to compare
  • the minimum frequency words need to be considered in the comparison
  • the test for statistical significance (Log-likelihood is the default and should be fine)
  • the significance threshold (0.01% is the default and should be fine for most queries)

You can also opt to only compare lists by the criterion whether a word, pos tag, or lemma occurs in one of the lists only by choosing the list you are interested in and clicking “Compare lists” on the bottom of the screen.

screenshot: keywords options

After you have made your choices, click on “Calculate keywords” to display the result of the comparison. It is a good idea to choose “Show only positive keywords” from the drop down menu if the second list you compared the first list to is not of interest to you, i. e. if you only want to get an idea which words, pos tags, or lemmas are typical for a whole corpus. However, if you compare parts of a corpus to another part of the same corpus – by date, for example -, you will be interested to know which words, pos tags, or lemmas are key in either list.

The display of the results give information on

  • the word (or pos tag or lemma, depending on your choice)
  • the frequency of this word in list 1 and in list 2
  • whether this word is key for list 1 or list 2 (“+” signifies that it is key for list 1, “-” signifies that it is key for list 2)
  • the log-likelihood value

screenshot: display positive keywords

Available Tags | POS, NER etc.

STTS Tagset

The TreeTagger (Schmidt 1995) is used for annotating German texts with pos and lemma information. The underlying tagset is the Stuttgart Tübingen Tag-Set (STTS). The tags and their corresponding part of speech are briefly explained in the following table (taken from Institute for Natural Language Processing (Institut für Maschinelle Sprachverarbeitung), Stuttgart. For an English translation go to the ISOcat Data Category Registry.

POS Description Examples
ADJA attributives Adjektiv [das] große [Haus]
ADJD adverbiales oder prädikatives Adjektiv [er fährt] schnell, [er ist] schnell
ADV Adverb schon, bald, doch
APPR Präposition; Zirkumposition links in [der Stadt], ohne [mich]
APPRART Präposition mit Artikel im [Haus], zur [Sache]
APPO Postposition [ihm] zufolge, [der Sache] wegen
APZR Zirkumposition rechts [von jetzt] an
ART bestimmter oder unbestimmter Artikel der, die, das, ein, eine
CARD Kardinalzahl zwei [Männer], [im Jahre] 1994
FM Fremdsprachliches Material [Er hat das mit “] A big fish [” übersetzt]
ITJ Interjektion mhm, ach, tja
KOUI unterordnende Konjunktion mit “zu” und Infinitiv um [zu leben], anstatt [zu fragen]
KOUS unterordnende Konjunktion mit Satz weil, dass, damit, wenn, ob
KON nebenordnende Konjunktion und, oder, aber
KOKOM Vergleichskonjunktion als, wie
NN normales Nomen Tisch, Herr, [das] Reisen
NE Eigennamen Hans, Hamburg, HSV
PDS substituierendes Demonstrativpronomen dieser, jener
PDAT attribuierendes Demonstrativpronomen jener [Mensch]
PIS substituierendes Indefinitpronomen keiner, viele, man, niemand
PIAT attribuierendes Indefinitpronomen ohne Determiner kein [Mensch], irgendein [Glas]
PIDAT attribuierendes Indefinitpronomen mit Determiner [ein] wenig [Wasser], [die] beiden [Brüder]
PPER irreflexives Personalpronomen ich, er, ihm, mich, dir
PPOSS substituierendes Possessivpronomen meins, deiner
PPOSAT attribuierendes Possessivpronomen mein [Buch], deine [Mutter]
PRELS substituierendes Relativpronomen [der Hund ,] der
PRELAT attribuierendes Relativpronomen [der Mann ,] dessen [Hund]
PRF reflexives Personalpronomen sich, einander, dich, mir
PWS substituierendes Interrogativpronomen wer, was
PWAT attribuierendes Interrogativpronomen welche[Farbe], wessen [Hut]
PWAV adverbiales Interrogativ- oder Relativpronomen warum, wo, wann, worüber, wobei
PAV Pronominaladverb dafür, dabei, deswegen, trotzdem
PTKZU “zu” vor Infinitiv zu [gehen]
PTKNEG Negationspartikel nicht
PTKVZ abgetrennter Verbzusatz [er kommt] an, [er fährt] rad
PTKANT Antwortpartikel ja, nein, danke, bitte
PTKA Partikel bei Adjektiv oder Adverb am [schönsten], zu [schnell]
TRUNC Kompositions-Erstglied An- [und Abreise]
VVFIN finites Verb, voll [du] gehst, [wir] kommen [an]
VVIMP Imperativ, voll komm [!]
VVINF Infinitiv, voll gehen, ankommen
VVIZU Infinitiv mit “zu”, voll anzukommen, loszulassen
VVPP Partizip Perfekt, voll gegangen, angekommen
VAFIN finites Verb, aux [du] bist, [wir] werden
VAIMP Imperativ, aux sei [ruhig !]
VAINF Infinitiv, aux werden, sein
VAPP Partizip Perfekt, aux gewesen
VMFIN finites Verb, modal dürfen
VMINF Infinitiv, modal wollen
VMPP Partizip Perfekt, modal gekonnt, [er hat gehen] können
XY Nichtwort, Sonderzeichen enthaltend 3:7, H2O, D2XW3
$, Komma ,
$. Satzbeendende Interpunktion . ? ! ; :
$( sonstige Satzzeichen; satzintern – [,]()

Tagset RFTagger (German)

Abbreviations:
Degree: Comp = Comparative; Pos = Positive; Sup = Superlative
Case: Nom = Nominative; Gen = Genitive; Dat = Dative; Acc = Accusative
Number: Sg = Singular; Pl = Plural
Gender: Fem = Feminine; Masc = Masculine; Neut = Neuter

POS-Tag

Possible Attributes

Example

ADJA [= attributive adjective]

Degree: Comp Pos Sup

Case: Nom Gen Dat Acc * (unspecified)

Number: Sg Pl *

Gender: Fem Masc Neut *

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.” (Alexis, Willibald: Herr von Sacken. In: Deutsches Taschenbuch auf das Jahr 1837. Berlin 1837)

Word to annotate: “große”

Annotation: große_ADJA.Pos.Acc.Sg.Fem

ADJD [= adjective with predicative or adverbial usage]

Degree: Comp Pos Sup

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Words to annotate: “Vermuthlich” and “absichtlich”

Annotation: Vermuthlich_ADJD.Pos

absichtlich_ADJD.Pos

ADV [= adverb]

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “wieder”

Annotation: wieder_ADV

APPO [= postposition]

Case: Gen Dat Acc *

APPR [= preposition]

Case: Gen Dat Acc *

APPRART [= preposition with incorporated article]

Case: Nom Gen Dat Acc *

Number: Sg Pl *

Gender: Fem Masc Neut *

APZR [= circumposition (right part)]

ART [= article]

Type: Def (definite) Indef (indefinite)

Case: Nom Gen Dat Acc *

Number: Sg Pl *

Gender: Fem Masc Neut *

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “die”

Annotation: die_ART.Def.Acc.Sg.Fem

CARD [= cardinal number]

CONJ [= conjunction]

Type: Comp (comparative) Coord (coordinating)

SubFin: (subordinating with finite clause)

SubInf: (subordinating with infinitive)

FM [= foreign word]

ITJ [= interjection]

N [= noun]

Type: Name (proper name) Reg (regular noun)

Case: Nom Gen Dat Acc *

Number: Sg Pl *

Gender: Fem Masc Neut *

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “Straße”

Annotation: Straße_N.Reg.Acc.Sg.Fem

PART [= particle]

Type: Ans (answer) Deg (degree) Neg (negation) Zu (“zu” particle)

Verb (separated verb particle)

PRO [= pronoun]

Type: Dem (demonstrative) Indef (indefinite) Intr (interrogative)

Pers: (personal) Refl (reflexive) Rel (relative)

Usage: Attr (attributive) Subst (substituting) Poss (possessive)

Person: 1 2 3 * – (not applicable)

Case: Nom Gen Dat Acc *

Number: Sg Pl *

Gender: Fem Masc Neut *

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “man”

Annotation:

man_PRO.Indef.Subst.Nom.Sg.*

PROADV [= pronomial adverb]

Type: Dem (regular: dabei damit) Inter (interrogative or relative: wie wozu)

SYM [= symbol]

Type: Pun (punctuation) Quot (quotation) Paren (parenthesis) Other

Subtype: Left Right Colon Comma Sent (sentential punctuation) Hyph (hyphen)

Slash: Aster (asterisk) Cont (continuation periods) Auth (author) XY

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “.”

Annotation:

._SYM.Pun.Sent

TRUNC [= truncated word form]

POS: Adj Noun Verb –

VFIN [= finite verb]

Type: Aux (auxiliary) Mod (modal) Full (full verb)

Person: 1 2 3

Number: Pl Sg

Tense Past Pres

Mood Ind (indicative) Subj (subjunctive)

Sentence: “Vermuthlich mied man absichtlich wieder die große Straße.”

Word to annotate: “mied”

Annotation:

mied_VFIN.Full.3.Sg.Past.Ind

VIMP [= imperative verb]

Type: Aux (auxiliary) Mod (modal) Full (full verb)

Person: 1 2 3

Number: Pl Sg

VINF [= infinitival verb]

Type: Aux (auxiliary) Mod (modal) Full (full verb)

Subtype: zu (with “zu” infix) – (no infix)

VPP [= participle verb]

Type: Aux (auxiliary) Mod (modal) Full (full verb)

Subtype: Prp (present participle) Psp (past participle)

Tagset for English texts (Penn Treebank Tagset)

English texts are annotated with pos and lemma information by the POS Tagger from the Stanford CoreNLP Natural Language Processing Toolkit (Manning et al 2014) which uses the Penn Treebank Tagset. An overview of the available tags and and their corresponding part of speech is available at Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)

Stanford NER (German)

We use the 4 class model of the Stanford NER which annotates Location, Person, Organization, Misc. The following Tags are available:

  • I-LOC: Location
  • I-PER: Person
  • I-ORG: Organisation
  • I-MISC: Miscellaneous

Stanford NER (English | 7 class)

English texts are usually tagged with the 7 class model of the Stanford NER which provides the following tags:

  • LOCATION
  • ORGANIZATION
  • DATE
  • MISC
  • NUMBER
  • PERSON
  • DURATION
  • PERCENT
  • ORDINAL
  • MONEY
  • SET
  • TIME

Permanent link to this article: https://discourselab.de/tutorials/corpusanalysis/cqpweb/