General Purpose Corpora

TS Timeline Corpus
+702 Million Tokens

The TS Timeline Corpus contains over 700 million tokens drawn from 2.2 million news articles published between 1998 and 2016. The collection reflects nearly two decades of Turkish news language, making it a valuable resource for studying linguistic, cultural, and social change across time.
All articles in the corpus are automatically classified using a custom AI model, ensuring structured access to topics and trends within the dataset.
A directly link to each entry is corpus is also presented.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TS Wikipedia
35 Million Tokens

A large-scale corpus built from 2015 Turkish Wikipedia articles, ideal for general-purpose NLP tasks. Further details on LDC Catalog

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TS Corpus V2
491 Million Tokens

The updated second version of the Corpus, over 490 million tokens, widely used in Turkish NLP research since 2012.
This corpus uses BOUN Web Corpus as source that is composed from various internet sources, such as online newspapers, forums, blogs, etc.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TweetS Corpus
12.5 Million Tokens

A corpus of 1 million tweets. This corpus features various novel tags such as "YY" (misspells), "intAbbr" (internet abbreviation) , "intSlang" (internet slang), "intEmphasis" (internet emphasis). It also presents definitions of smileys in query results. For details plesae refer to this article.

POS Lemma Morph Metadata
✔️ ✔️ ✔️

Specialized Corpora

Dictionary Corpus V2
~1 Million Tokens

The Dictionary Corpus V2 is built from the headwords and explanations of the entries in the TDK Turkish Contemporary Dictionary. Queries are processed over the explanations, while the headwords serve as the primary metadata target.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TS Syllable Corpus
5.7 Million Tokens

The Syllable Corpus is a peculiar corpus that features syllable tagging for Turkish. The corpus includes 5 million 714 thousand and 422 unique words. Each word has hyphenated and each syllable is tagged with a special tag set developed by TS corpus.
The corpus serves "Status" and "Tag" as annotation. Status defines if the syllable is a "valid" or "invalid" syllable and tag presents consonant and vowel pattern of the word.
The main idea behind the corpus is calculating syllable frequency and building an index of valid syllables of Turkish.

POS Lemma Morph Metadata
Abstract Corpus V2
1 Million Tokens

Abstract Corpus V2 serves abstracts of 6,234 academic papers from various disciplines. Corpus metadata covers the main field, scientific discipline and sub-discipline.
For details of the corpus data please check this article.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TS Idioms & Proverbs Corpus
27k Tokens

A corpus consisted of +9 thousand Turkish proverbs and idioms.

POS Lemma Morph Metadata
✔️ ✔️
Turkish Constitutions Corpus
32k Tokens

A corpus of 1924, 1961 and 1982 Turkish Constitutions.
Please refer to following thesis for more information.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Evrim Ağacı
4.4 Million Tokens

This corpus consists of 7,287 articles published on the popular science platform Evrim Ağacı between 2011 and 2019. The corpus serves a broad range of topics from biology and physics to psychology and philosophy, reflecting the language of contemporary science communication in Turkish.
The corpus offers a valuable resource for analyzing popular science discourse, terminology development, and lexical variation across nearly a decade of public engagement with science.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️

Social Media Corpora

Covid 19 TweetS
7.5 Million Tokens

A corpus of Turkish Tweets harvested during the first two months of the Covid-19 pandemic. The metadata of the corpus includes information such as Follower Count, Sentiment, and Account Creation Date.

The Covid-19 corpus was compiled within the scope of TÜBİTAK SOBAG Project No: 120K634: Kriz İletişimi: Covid-19 Salgınından Çıkarılan Dersler Işığında Önleyici İletişimsel Yaklaşımlar.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
6 Şubat Tweets
4.9 Million Tokens

This corpus consists of tweets posted after the February 6 earthquakes. It was compiled as part of the ongoing TÜBİTAK 1001 Project No. 123K778, titled “Determining the Attitudes and Behaviors of Citizens Affected Directly or Indirectly by the Kahramanmaraş Earthquakes: An Analysis of Twitter Data.”
For the first publication based on this dataset, please refer here.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Confess Corpus
450k Tokens

The Confess Corpus consists of posts shared by anonymous users on a popular women’s forum. It comprises 15,053 unique entries collected between 2015 and 2019, offering valuable insights into everyday discourse and online expression.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️

News&Newspaper Corpora

Columns Corpus V2
28 Million Tokens

The Columns Corpus V2 contains 25,915 newspaper columns, equally distributed between female and male writers. The collection spans the years 2006 to 2017, offering a balanced resource for studying gender, discourse, and style in contemporary Turkish journalism. For detailed information please refer to this article.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Türk Medyasında Göç
640k Tokens

The Corpus is a specialized collection of news articles on migration events across different historical periods between 1950 and 2017, capturing the language, narratives, and discourses surrounding migration in Turkey and beyond.
The collection reflects shifts in public debate, political framing, and social perception of migration and refugee flows and mobility.
This makes it a valuable resource for researchers studying historical change, media representation, and linguistic patterns related to migration. Please refer to this thesis for further information.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TRT 2015
1.6 Million Tokens

This corpus was compiled from news texts broadcast in the TRT Main News Bulletin in 2015.
For detailed information, please refer to the related doctoral dissertation.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️

Social Sciences Corpora

Communication Sciences Abstracts
1.5 Million Tokens

This corpus consists of 6,573 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1986 and 2023, focusing on communication-related disciplines such as journalism, media studies, and public relations.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️

Educational Sciences Corpora

The Reading Teacher Corpus
3.5 Million Tokens

This corpus is composed of academic papers published in The Reading Teacher Journal between 2012 and 2021, covering a ten-year period. It traces the evolution of scholarly research on reading, capturing key developments in pedagogy, methodology, and literacy studies. The corpus metadata includes the top two keywords from each article, enabling focused lexical and thematic analyses. For detailed information, please refer to this article.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Educational Sciences Abstracts
9.2 Million Tokens

This corpus consists of abstracts from 30,515 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1980 and 2020.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Second Language Teaching Corpus
1.3 Million Tokens

This corpus consists of 44 textbooks commonly used in teaching Turkish as a Second Language.
It covers proficiency levels A1, A2, B1, B2 and C1 providing a representative sample of instructional language across beginner to advanced stages of learning.
Please refer to following doctoral dissertation. for more details.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
TaSL Corpus
121k Tokens

TaSL (Turkish as Second Language) corpus presents 14 coursebooks for A1 and A2 level commonly used in teaching Turkish to foreign student in higher education.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️

Parallel Corpus

Turkish-English Parallel Corpus
647 Tokens

A Turkish-English parallel corpus of translated sentences.
The corpus works parallel with English-Turkish Parallel Corpus.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
English-Turkish Parallel Corpus
837k Tokens

An English-Turkish parallel corpus of translated sentences.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️
Turkish Movie Subtitles
266 Million Tokens

A corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with English Movie Subtitles.

POS Lemma Morph Metadata
✔️ ✔️ ✔️
English Movie Subtitles
353 Million Tokens

A corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with Turkish Movie Subtitles.

POS Lemma Morph Metadata
✔️ ✔️ ✔️

Other

Dystopian Movies Corpus
281k Tokens

This corpus consists of scripts from English-language dystopian films, providing linguistic data that capture the themes, dialogues, and narrative structures characteristic of dystopian cinema.

POS Lemma Morph Metadata
✔️ ✔️ ✔️
Yaşar Kemal 19
2.4 Million Tokens

This corpus is consisted of 19 books written by Yaşar Kemal.

POS Lemma Morph Metadata
✔️ ✔️ ✔️ ✔️