Home Korpus

Text digitisation is a field of work that has dynamically developed in the last few years. Big commercial projects like Google Books set the pace of technological, legal and political change in the field – copyright issues are thereby particularly contested. Libraries and private companies nowadays invest a lot in the retro-digitisation of their documents. Text digitisation plays a central role, and it is important that universities and the research community stand for quality assurance in the field.

 The expertise the team of the Swiss Text Corpus has gathered especially in the domains of text recognition and annotation can prove useful for third parties, as well.

The Swiss Text Corpus mainly focuses on collaboration and open standards in order to be able to use the best technology possible for its purposes.

 Open Standards in text and corpus technology are indispensable for the provision of sustainable digital resources. Like many other corpus projects the Swiss Text Corpus annotates the XML versions of its documents according to the Text Encoding Initiative (TEI). The scans with the underlying OCR text are stored as archivable PDFs (PDF/A according to ISO 19005-1).

We mainly use open-source software and/or software from our partner projects for the processing and publication of the corpus in the Internet. The search interface of the Swiss Text Corpus is based on the web framework Django. The linguistic search engine for the indexation of the corpus texts in the background is DDC, developed by our Berlin partner project DWDS.

We are very open for any kind of know-how exchange in the field of corpus technology.

From the very beginning, the structure of the Swiss Text Corpus was designed to cover the vocabulary of 20th century standard German in Switzerland as widely as possible. The corpus consists of printed and typewritten texts of very different production and publication forms. It is a balanced according to time, form and content criteria:

  • Text class: formal criterion
  • Quarter of century: time criterion
  • Domain: content criterion

With this structure, the Swiss Text Corpus is a balanced data resource for all kinds of linguistic research questions.

The Swiss Text Corpus contains the following amounts of text (according to the criteria mentioned above):

  1900-1924 1925-1949 1950-1974 1975-1999 2000-2018 total
  d w d w d w d w d w d w
functional texts 1042 1'122'547 1'465 1'235'998 969 1'165'808 1'417 1'036'198 1'238 944'778 6'131 5'505'329
factual texts 167 1'447'644 433 2'043'191 804 1'943'462 276 1'846'198 898 985'400 2'578 8'265'832
journalistic texts 833 501'527 1'107 1'006'662 993 970'560 1'929 1'117'639 1'267 973'282 6'129 4'569'670
fiction 188 1'116'823 50 1'248'864 159 1'122'446 59 1'147'943 40 942'760 496 5'578'836
total 2'230 4'188'541 3'055 5'534'715 2'925 5'202'276 3'681 5'147'978 3'443 3'845'700 15'334 23'919'667

d = documents
w = words (tokens minus punctuation characters)

Thanks to digitisation and worldwide networking over the Internet we have seen and still see the upcoming of electronic corpora for a wide variety of languages. The British National Corpus (BNC) containing about 100 million words and constructed between 1991 and 1994 can be seen as a model for many corpora. The BNC is a balanced corpus, annotated, lemmatised an part-of-speech tagged.

Since then a lot of digitisation and corpus projects have emerged and there are still new projects emerging. For German there are mainly two projects that have to be cited. On the one hand there is COSMAS, the huge corpus of the Institut für deutsche Sprache (IdS) in Mannheim with billions of words (some also from Switzerland). On the other hand there is the German part of project Gutenberg with a community effort of collecting literary texts by more than 800 authors not covered by copyright anymore.

The Mannheim Corpus is the biggest collection of German texts in the world. However it mainly contains newspaper texts and is therefore not balanced enough for many lexicographical and other linguistic purposes. The Gutenberg database contains literary texts only, and exclusively older texts (authors deceased more than 70 years ago).

Most other digital corpora of German have been built by research groups with a main focus on computational linguistics. They often contain contemporary texts only, coming from newspaper archives or the Internet.

 Before the Swiss Text Corpus there was no digital corpus of German texts from Switzerland. The Swiss Text Corpus has closed this gap and offer a balanced empirical data resource for lexicographical and other linguistic research.

Subcategories