- Tools for Text Gathering
- Tools for Preprocessing
- Tools for Corpus Management and Corpus Analysis
While most tools for corpus analysis provide access to texts by some kind of frequency analysis, they do differ in features and scope. Not every tool for corpus analysis is suitable for every task which is why we test and check the tools and software out there whether they are suitable for discourse analysis and provide access to those we deem best.
The categories “Web Service” and “Local Use” reflect whether the user has to install and/or configure a tool before they can use it. That means, that tools in the category “Web Service” are those tools that are ready for use over the internet without any installation. Any tool that is not ready for use over the internet is considered to be a tool for local use, even though it might run as a (local) web service.
Tools for Text Gathering
WGet is a command-line tool for retrieving files from the internet using HTTP, HTTPS or FTP protocols. It has been developed for Linux, however, Windows and Mac versions are available also. Check the homepage for details on download and installation.
Tools for Preprocessing
Weblicht | Preprocessing: Tokenization, POS-tagging, Lemmatisation, NER-Tagging, Parsing etc.
Get started on the Weblicht homepage
Treetagger | Tokenization, POS-tagging, Lemmatisation
The treetagger can handle many languages from Bulgarian to Swahili. A complete list is provided on the TreeTagger homepage. Apart from POS-tagging and lemmatisation it also does chunking but only for German, French, Spanish, and English. It is a command-line tool best suited for Linux. There is, however, a Windows version (with GUI) availabla via the homepage.
Stanford CoreNLP | Tokenization, POS-tagging, Lemmatisation, NER-Tagging, Parsing etc.
The Stanford CoreNLP handles multiple languages (check the homepage for a complete list). It is a command-line tool written in Java that can also be configured as a web service.
Webanno | Manual and automatic annotation
Get started on one of your projects (you need to be a member of a study or research group to get access).
WebAnno is an annotation tool that allows for a wide range of annotations. It also handles project management, user management, inter-annotator agreement and is quite capable of managing rather complex annotation projects. It is available as a standalone version for local use and as a web-based system for use on a server.
Tools for Corpus Management and Corpus Analysis
Voyant Tools | Word clouds and visualisation, simple queries
CATMA | Manual annotation, simple and complex queries
Get started on the CATMA homepage
CWB & CQPweb | Complex queries utilizing metadata on document and token level
The IMS Corpus Workbench is a very capable corpus management tool that runs on linux without a GUI. CQPweb makes it a web-based system that can be used and (for large parts) managed by any web browser. For easy local use, the combination of CWB and CQPweb is now also available as a Virtual Machine. Check out the homepage for details on download and installation. For using CWB/CQPweb in discourse analysis, check out our tutorial.
Get started on one of our pre-built corpora.