Home Korpus

From the very beginning, the structure of the Swiss Text Corpus was designed to cover the vocabulary of 20th century standard German in Switzerland as widely as possible. The corpus consists of printed and typewritten texts of very different production and publication forms. It is a balanced according to time, form and content criteria:

  • Text class: formal criterion
  • Quarter of century: time criterion
  • Domain: content criterion

With this structure, the Swiss Text Corpus is a balanced data resource for all kinds of linguistic research questions.

The Swiss Text Corpus contains the following amounts of text (according to the criteria mentioned above):

  1900-1924 1925-1949 1950-1974 1975-1999 2000-2018 total
  d w d w d w d w d w d w
functional texts 1042 1'122'547 1'465 1'235'998 969 1'165'808 1'417 1'036'198 1'238 944'778 6'131 5'505'329
factual texts 167 1'447'644 433 2'043'191 804 1'943'462 276 1'846'198 898 985'400 2'578 8'265'832
journalistic texts 833 501'527 1'107 1'006'662 993 970'560 1'929 1'117'639 1'267 973'282 6'129 4'569'670
fiction 188 1'116'823 50 1'248'864 159 1'122'446 59 1'147'943 40 942'760 496 5'578'836
total 2'230 4'188'541 3'055 5'534'715 2'925 5'202'276 3'681 5'147'978 3'443 3'845'700 15'334 23'919'667

d = documents
w = words (tokens minus punctuation characters)