Schweizer Textkorpus - Corpus structure

From the very beginning, the structure of the Swiss Text Corpus was designed to cover the vocabulary of 20th century standard German in Switzerland as widely as possible. The corpus consists of printed and typewritten texts of very different production and publication forms. It is a balanced according to time, form and content criteria:

Text class: formal criterion
Quarter of century: time criterion
Domain: content criterion

With this structure, the Swiss Text Corpus is a balanced data resource for all kinds of linguistic research questions.

The Swiss Text Corpus contains the following amounts of text (according to the criteria mentioned above):

1900-1924

1925-1949

1950-1974

1975-1999

2000-2018

total

functional texts

1042

1'122'547

1'465

1'235'998

969

1'165'808

1'417

1'036'198

1'238

944'778

6'131

5'505'329

factual texts

167

1'447'644

433

2'043'191

804

1'943'462

276

1'846'198

898

985'400

2'578

8'265'832

journalistic texts

833

501'527

1'107

1'006'662

993

970'560

1'929

1'117'639

1'267

973'282

6'129

4'569'670

fiction

188

1'116'823

1'248'864

159

1'122'446

1'147'943

942'760

496

5'578'836

total

2'230

4'188'541

3'055

5'534'715

2'925

5'202'276

3'681

5'147'978

3'443

3'845'700

15'334

23'919'667

d = documents
w = words (tokens minus punctuation characters)

Information

Corpus structure

Footer