TS TimeLine Corpus
This is a “general purpose” corpus, that contains over 700 million tokens harvested from on-line sources. The data contains over 2.2 million news and articles with a range of 19 years. We may also call this corpus as “the contemporary news/columns corpus of Turkish.”
Also, this is the very first corpus that we used machine learning models. We built two models while working with TimeLine Corpus data.
The first model predicts the language of the given text, whether it is Turkish or English, as crawled data contains many news and articles in English. This was the easy step and we trained the model using “TS English-Turkish Parallel Corpus” and “TS Turkish-English Parallel Corpus”.
The second and challenging task was building a model to classify texts. With 12 predefined categories we build a machine learning model that run over 90% accuracy.
Please note that the corpus is still in beta version so the queries may run slow (or may crash in some cases) and data may contain inconsistencies.