Corpora

General Purpose Corpora

TS Timeline Corpus

+702 Million Tokens

The TS Timeline Corpus contains over 700 million tokens drawn from 2.2 million news articles published between 1998 and 2016. The collection reflects nearly two decades of Turkish news language, making it a valuable resource for studying linguistic, cultural, and social change across time.
All articles in the corpus are automatically classified using a custom AI model, ensuring structured access to topics and trends within the dataset.
A directly link to each entry is corpus is also presented.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TS Wikipedia

35 Million Tokens

A large-scale corpus built from 2015 Turkish Wikipedia articles, ideal for general-purpose NLP tasks. Further details on LDC Catalog

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TS Corpus V2

491 Million Tokens

The updated second version of the Corpus, over 490 million tokens, widely used in Turkish NLP research since 2012.
This corpus uses BOUN Web Corpus as source that is composed from various internet sources, such as online newspapers, forums, blogs, etc.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TweetS Corpus

12.5 Million Tokens

A corpus of 1 million tweets. This corpus features various novel tags such as "YY" (misspells), "intAbbr" (internet abbreviation) , "intSlang" (internet slang), "intEmphasis" (internet emphasis). It also presents definitions of smileys in query results. For details plesae refer to this article.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	❌

Go to Corpus

Specialized Corpora

Dictionary Corpus V2

~1 Million Tokens

The Dictionary Corpus V2 is built from the headwords and explanations of the entries in the TDK Turkish Contemporary Dictionary. Queries are processed over the explanations, while the headwords serve as the primary metadata target.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TS Syllable Corpus

5.7 Million Tokens

The Syllable Corpus is a peculiar corpus that features syllable tagging for Turkish. The corpus includes 5 million 714 thousand and 422 unique words. Each word has hyphenated and each syllable is tagged with a special tag set developed by TS corpus.
The corpus serves "Status" and "Tag" as annotation. Status defines if the syllable is a "valid" or "invalid" syllable and tag presents consonant and vowel pattern of the word.
The main idea behind the corpus is calculating syllable frequency and building an index of valid syllables of Turkish.

POS	Lemma	Morph	Metadata
❌	❌	❌	❌

Go to Corpus

Abstract Corpus V2

1 Million Tokens

Abstract Corpus V2 serves abstracts of 6,234 academic papers from various disciplines. Corpus metadata covers the main field, scientific discipline and sub-discipline.
For details of the corpus data please check this article.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TS Idioms & Proverbs Corpus

27k Tokens

A corpus consisted of +9 thousand Turkish proverbs and idioms.

POS	Lemma	Morph	Metadata
✔️	✔️	❌	❌

Go to Corpus

Turkish Constitutions Corpus

32k Tokens

A corpus of 1924, 1961 and 1982 Turkish Constitutions.
Please refer to following thesis for more information.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Evrim Ağacı

4.4 Million Tokens

This corpus consists of 7,287 articles published on the popular science platform Evrim Ağacı between 2011 and 2019. The corpus serves a broad range of topics from biology and physics to psychology and philosophy, reflecting the language of contemporary science communication in Turkish.
The corpus offers a valuable resource for analyzing popular science discourse, terminology development, and lexical variation across nearly a decade of public engagement with science.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Social Media Corpora

Covid 19 TweetS

7.5 Million Tokens

A corpus of Turkish Tweets harvested during the first two months of the Covid-19 pandemic. The metadata of the corpus includes information such as Follower Count, Sentiment, and Account Creation Date.

The Covid-19 corpus was compiled within the scope of TÜBİTAK SOBAG Project No: 120K634: Kriz İletişimi: Covid-19 Salgınından Çıkarılan Dersler Işığında Önleyici İletişimsel Yaklaşımlar.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

6 Şubat Tweets

4.9 Million Tokens

This corpus consists of tweets posted after the February 6 earthquakes. It was compiled as part of the ongoing TÜBİTAK 1001 Project No. 123K778, titled “Determining the Attitudes and Behaviors of Citizens Affected Directly or Indirectly by the Kahramanmaraş Earthquakes: An Analysis of Twitter Data.”
For the first publication based on this dataset, please refer here.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Confess Corpus

450k Tokens

The Confess Corpus consists of posts shared by anonymous users on a popular women’s forum. It comprises 15,053 unique entries collected between 2015 and 2019, offering valuable insights into everyday discourse and online expression.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

News&Newspaper Corpora

Columns Corpus V2

28 Million Tokens

The Columns Corpus V2 contains 25,915 newspaper columns, equally distributed between female and male writers. The collection spans the years 2006 to 2017, offering a balanced resource for studying gender, discourse, and style in contemporary Turkish journalism. For detailed information please refer to this article.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Türk Medyasında Göç

640k Tokens

The Corpus is a specialized collection of news articles on migration events across different historical periods between 1950 and 2017, capturing the language, narratives, and discourses surrounding migration in Turkey and beyond.
The collection reflects shifts in public debate, political framing, and social perception of migration and refugee flows and mobility.
This makes it a valuable resource for researchers studying historical change, media representation, and linguistic patterns related to migration. Please refer to this thesis for further information.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TRT 2015

1.6 Million Tokens

This corpus was compiled from news texts broadcast in the TRT Main News Bulletin in 2015.
For detailed information, please refer to the related doctoral dissertation.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Social Sciences Corpora

Communication Sciences Abstracts

1.5 Million Tokens

This corpus consists of 6,573 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1986 and 2023, focusing on communication-related disciplines such as journalism, media studies, and public relations.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Educational Sciences Corpora

The Reading Teacher Corpus

3.5 Million Tokens

This corpus is composed of academic papers published in The Reading Teacher Journal between 2012 and 2021, covering a ten-year period. It traces the evolution of scholarly research on reading, capturing key developments in pedagogy, methodology, and literacy studies. The corpus metadata includes the top two keywords from each article, enabling focused lexical and thematic analyses. For detailed information, please refer to this article.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Educational Sciences Abstracts

9.2 Million Tokens

This corpus consists of abstracts from 30,515 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1980 and 2020.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Second Language Teaching Corpus

1.3 Million Tokens

This corpus consists of 44 textbooks commonly used in teaching Turkish as a Second Language.
It covers proficiency levels A1, A2, B1, B2 and C1 providing a representative sample of instructional language across beginner to advanced stages of learning.
Please refer to following doctoral dissertation. for more details.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

TaSL Corpus

121k Tokens

TaSL (Turkish as Second Language) corpus presents 14 coursebooks for A1 and A2 level commonly used in teaching Turkish to foreign student in higher education.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Parallel Corpus

Turkish-English Parallel Corpus

647 Tokens

A Turkish-English parallel corpus of translated sentences.
The corpus works parallel with English-Turkish Parallel Corpus.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

English-Turkish Parallel Corpus

837k Tokens

An English-Turkish parallel corpus of translated sentences.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus

Turkish Movie Subtitles

266 Million Tokens

A corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with English Movie Subtitles.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	❌

Go to Corpus

English Movie Subtitles

353 Million Tokens

A corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with Turkish Movie Subtitles.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	❌

Go to Corpus

Other

Dystopian Movies Corpus

281k Tokens

This corpus consists of scripts from English-language dystopian films, providing linguistic data that capture the themes, dialogues, and narrative structures characteristic of dystopian cinema.

POS	Lemma	Morph	Metadata
✔️	✔️	❌	✔️

Go to Corpus

Yaşar Kemal 19

2.4 Million Tokens

This corpus is consisted of 19 books written by Yaşar Kemal.

POS	Lemma	Morph	Metadata
✔️	✔️	✔️	✔️

Go to Corpus