CMC

Computer-mediated communication (CMC) is human communication that occurs through the use of two or more electronic devices.

Logo CMC Corpora

A corpus is a large and structured set of texts created for linguistic research. Ideally, there are also meta data and annotations.

This page has the aim of listing all CMC corpora freely available for (linguistic) research with either texts from Switzerland or compiled by Swiss researchers. It was created and is maintained in a cooperation between Elisabeth Stark, Simone Ueberwasser and the Zurich Center for Linguistics. Do you want to inform us about your CMC corpus? Send us an email.

For CMC corpora without a link to Switzerland, please check the CLARIN page on CMC corpora.

sms4science

Data Text messages (SMS)
0.5 Mio tokens
Languages Swiss German, German, French, Italian, Romansh
Collected 2009
Availability Freely available for linguistic research, no access for commercial use.

In 2009 under the lead of Elisabeth Stark around 25'000 SMS were collected from the Swiss population. From 2011 to 2015, the data were investigated in the frame of the SNSF Sinergia project sms4science. For the French data, sister projects are available in Belgium, France and Canada (cf. www.sms4science.org)

What's up, Switzerland?

Data WhatsApp messages
Size 5 Mio tokens
Languages Swiss German, German, French, Italian, Romansh
Collected 2014
Availability Available upon request until 31.12.2018. After that: freely available.

As a followup-project so sms4science, Elisabeth Stark and her team collected 216 WhatsApp chats in 2014. They are currently being investigated in the SNSF Sinergia project "What's up, Switzerland?".

SB-10k: German Sentiment Corpus

Data Tweets
Size 5 Mio tokens
Languages German
Collected ?
Availability Creative Commons License CC BY 4.0 from spinningbytes.

SB-10k is a publicly available corpus that contains 9738 German tweets, each labeled by 3 annotators with “positive”, “negative”, “neutral”, “mixed”, or “unknown”. It was created by SpinningBytes in collaboration with the Zurich University of Applied Sciences (ZHAW).

SB-CH: A Swiss German Corpus with Sentiment Annotations

Data Facebook, Chats,
Size 203,242 Swiss German phrases with 981,247 tokens
Languages Swiss German
Collected 2010-2017
Availability Creative Commons License CC BY 4.0 from spinningbytes.

Check Ralf Grubenmann, Don Tuggener, Pius von Daniken, Jan Deriu, Mark Cieliebak (2018): SB-CH: A Swiss German Corpus with Sentiment Annotations. for more information.