Home Korpus

Thanks to digitisation and worldwide networking over the Internet we have seen and still see the upcoming of electronic corpora for a wide variety of languages. The British National Corpus (BNC) containing about 100 million words and constructed between 1991 and 1994 can be seen as a model for many corpora. The BNC is a balanced corpus, annotated, lemmatised an part-of-speech tagged.

Since then a lot of digitisation and corpus projects have emerged and there are still new projects emerging. For German there are mainly two projects that have to be cited. On the one hand there is COSMAS, the huge corpus of the Institut für deutsche Sprache (IdS) in Mannheim with billions of words (some also from Switzerland). On the other hand there is the German part of project Gutenberg with a community effort of collecting literary texts by more than 800 authors not covered by copyright anymore.

The Mannheim Corpus is the biggest collection of German texts in the world. However it mainly contains newspaper texts and is therefore not balanced enough for many lexicographical and other linguistic purposes. The Gutenberg database contains literary texts only, and exclusively older texts (authors deceased more than 70 years ago).

Most other digital corpora of German have been built by research groups with a main focus on computational linguistics. They often contain contemporary texts only, coming from newspaper archives or the Internet.

 Before the Swiss Text Corpus there was no digital corpus of German texts from Switzerland. The Swiss Text Corpus has closed this gap and offer a balanced empirical data resource for lexicographical and other linguistic research.