A corpus is a collection of texts from written or spoken language. Generally, these texts are put together according to predefined criteria to fit intended aims. Building a corpus is a hard, tedious and time consuming task. The data should be processed carefully.

This project started with the idea of “building an online available, part-of-speech tagged Turkish Corpus“, which wasn’t exist then. In order to do this, we focused on existing NLP tools (tokenizers, part-of-speech taggers, morphological analyzers, etc. ) that were already out there and we wanted to use them. However, at every step of text processing and corpus building we had to modify these software or in most cases we had to create our own scripts or tools.

In 2011, we had published, the very first one, TS Corpus v2, as the first Turkish corpus which was available online with part-of-speech and morphological tagging. This was a general purpose corpus. Since then we have released 7 different corpora under our project and TS Corpus became a productive, growing and well-known project.

Million Tokens in 7 Corpora
Queries Users Ran and Counting More

If you have registered to TS Corpus

Login Now

If you haven’t registered you can sign up now

Sign Up Now

If you’re not familiar with corpora and CQP queries please visit our documentation pages for query tips.
You may also find quick answers to frequently asked questions from FAQ pages.