A corpus is a collection of texts from written or spoken language. Generally, these texts are put together according to predefined criteria to fit intended aims. Building a corpus is a hard, tedious and time consuming task. The data should be processed carefully.
This project started with the idea of “building an online available, part-of-speech tagged Turkish Corpus“, which wasn’t exist then. In order to do this, we focused on existing NLP tools (tokenizers, part-of-speech taggers, morphological analyzers, etc. ) that were already out there and we wanted to use them. However, at every step of text processing and corpus building we had to modify these software or in most cases we had to create our own scripts or tools.
In 2011, we had published, the very first one, TS Corpus v2, as the first Turkish corpus which was available online with part-of-speech and morphological tagging. This was a general purpose corpus. Since then we have released 7 different corpora under our project and TS Corpus became a productive, growing and well-known project.