Schweizer Textkorpus - Category (en-gb)

Text digitisation is a field of work that has dynamically developed in the last few years. Big commercial projects like Google Books set the pace of technological, legal and political change in the field – copyright issues are thereby particularly contested. Libraries and private companies nowadays invest a lot in the retro-digitisation of their documents. Text digitisation plays a central role, and it is important that universities and the research community stand for quality assurance in the field.

The expertise the team of the Swiss Text Corpus has gathered especially in the domains of text recognition and annotation can prove useful for third parties, as well.

The Swiss Text Corpus mainly focuses on collaboration and open standards in order to be able to use the best technology possible for its purposes.

Open Standards in text and corpus technology are indispensable for the provision of sustainable digital resources. Like many other corpus projects the Swiss Text Corpus annotates the XML versions of its documents according to the Text Encoding Initiative (TEI). The scans with the underlying OCR text are stored as archivable PDFs (PDF/A according to ISO 19005-1).

We mainly use open-source software and/or software from our partner projects for the processing and publication of the corpus in the Internet. The search interface of the Swiss Text Corpus is based on the web framework Django. The linguistic search engine for the indexation of the corpus texts in the background is DDC, developed by our Berlin partner project DWDS.

We are very open for any kind of know-how exchange in the field of corpus technology.

From the very beginning, the structure of the Swiss Text Corpus was designed to cover the vocabulary of 20th century standard German in Switzerland as widely as possible. The corpus consists of printed and typewritten texts of very different production and publication forms. It is a balanced according to time, form and content criteria:

Text class: formal criterion
Quarter of century: time criterion
Domain: content criterion

With this structure, the Swiss Text Corpus is a balanced data resource for all kinds of linguistic research questions.

The Swiss Text Corpus contains the following amounts of text (according to the criteria mentioned above):

1900-1924

1925-1949

1950-1974

1975-1999

2000-2018

total

functional texts

1042

1'122'547

1'465

1'235'998

969

1'165'808

1'417

1'036'198

1'238

944'778

6'131

5'505'329

factual texts

167

1'447'644

433

2'043'191

804

1'943'462

276

1'846'198

898

985'400

2'578

8'265'832

journalistic texts

833

501'527

1'107

1'006'662

993

970'560

1'929

1'117'639

1'267

973'282

6'129

4'569'670

fiction

188

1'116'823

1'248'864

159

1'122'446

1'147'943

942'760

496

5'578'836

total

2'230

4'188'541

3'055

5'534'715

2'925

5'202'276

3'681

5'147'978

3'443

3'845'700

15'334

23'919'667

d = documents
w = words (tokens minus punctuation characters)

Thanks to digitisation and worldwide networking over the Internet we have seen and still see the upcoming of electronic corpora for a wide variety of languages. The British National Corpus (BNC) containing about 100 million words and constructed between 1991 and 1994 can be seen as a model for many corpora. The BNC is a balanced corpus, annotated, lemmatised an part-of-speech tagged.

Since then a lot of digitisation and corpus projects have emerged and there are still new projects emerging. For German there are mainly two projects that have to be cited. On the one hand there is COSMAS, the huge corpus of the Institut für deutsche Sprache (IdS) in Mannheim with billions of words (some also from Switzerland). On the other hand there is the German part of project Gutenberg with a community effort of collecting literary texts by more than 800 authors not covered by copyright anymore.

The Mannheim Corpus is the biggest collection of German texts in the world. However it mainly contains newspaper texts and is therefore not balanced enough for many lexicographical and other linguistic purposes. The Gutenberg database contains literary texts only, and exclusively older texts (authors deceased more than 70 years ago).

Most other digital corpora of German have been built by research groups with a main focus on computational linguistics. They often contain contemporary texts only, coming from newspaper archives or the Internet.

Before the Swiss Text Corpus there was no digital corpus of German texts from Switzerland. The Swiss Text Corpus has closed this gap and offer a balanced empirical data resource for lexicographical and other linguistic research.

Subcategories

Examples

Page 2 of 3

Start
Prev
1
2
3
Next
End

Information

Subcategories

Examples

Notifications

Footer