TweetS Corpus

In the last decade, social media became a part of our daily life. Furthermore, social media illustrates authentic language from wide range of people. Obviously, this diversity turns social media to a very valuable linguistic source. Moreover, unlike many other sources such as newspapers, magazines or  books, social media does not pass through an editorial process.

TweetS Corpus uses a unique part of speech tag set for Turkish, including YY (misspelling), intAbbr (Internet Abbreviations), Emoticons (Smileys), intEmphasis (Internet Emphasis) and intSlang (Internet Slang). A list of internet slangs harvested from TweetS Corpus could be find by this link.

Million Tokens
1 Million TweetS

