Home Korpus

The Swiss Text Corpus also contains little noticed texts from archives - as long as they are printed or typewritten. Handwritten texts are not considered for the time being.

The example shown opposite is a patent document from the 1930s.

 

The Swiss Text Corpus contains texts from Swiss newspapers of the 20th century.

The example shown opposite is from the Walliser Bote of the year 1919. The main difficulty in digitising newspaper texts is the oftentimes poor paper quality of old newspapers. This makes OCR very difficult.

The separate articles of a newspaper are treated as separate documents in the Swiss Text Corpus.

The Swiss Text Corpus contains advertising texts, as well. They contain characteristic text blocks that are not to be found in other text categories.

Graphical elements make the task of digitising such texts difficult. Very often, they have not been considered in other corpora.

The image shown opposite is one page form Rudolf Zäch: Die neuzeitliche Küche. Wallisellen, 1931.

Like this one, many of the texts of the first half of the 20th century are published in Gothic type, a type especially younger readers sometimes have trouble to read. The digitisation of these texts makes them more accessible again.

After the acquisition of a text all relevant bibliographical data is entered into a database.

In many cases for digitisation the original book is cut into separate pages. These are scanned and prepared for OCR (optical character recognition).

The page shown here contains several challenges for digitisation, such as recognition of Gothic type and spaced out words. The processing of special text elements like marginal notes, enumerations and footnotes have to be defined.