TS Corpus

A free and independent project that aims to build Turkish corpora, lexical tools, and linguistic datasets.

A central hub for Turkish corpora and linguistic research
TS Corpus

The TS Corpus Project is a free and independent initiative dedicated to building Turkish language corpora, developing natural language processing (NLP) tools, and compiling linguistic datasets. The project began in 2011, and in March 2012, the first corpus was released — marking a significant milestone as the first publicly available, part-of-speech-tagged Turkish online corpus.

Since then, the project has continued to grow, releasing new corpora, tools, and datasets. Today, TS Corpus includes over 25 corpora comprising more than 1.8 billion tokens, sourced from a wide variety of domains such as online newspapers, news, forums, social media, text-books, and academic texts. All resources are provided openly, without restrictions, for academic study and research. Users are free to run queries, save their results, and download datasets for their own analyses.

At its core, TS Corpus is guided by the belief that linguistic resources and knowledge should be shared freely. For this reason, the project is built upon free software and continues to expand with contributions to Turkish computational linguistics and language technology.

Corpora

Access diverse Turkish corpora across multiple genres, designed for linguistic research, computational analysis, and academic study.

LexiTR

LexiTR is a specialized platform offering advanced lexical tools built on large-scale Turkish corpora, designed to support linguistic research and analysis.

TS Tools

Explore tokenizers, frequency analyzers, and more — practical NLP tools created specifically for processing and studying Turkish.

Contact

Have questions or collaboration ideas? Reach out anytime.