General Purpose Corpora
TS Timeline Corpus
+702 Million TokensThe TS Timeline Corpus contains over 700 million tokens drawn from 2.2 million news articles published between 1998 and 2016. The collection reflects nearly two decades of Turkish news language, making it a valuable resource for studying linguistic, cultural, and social change across time.
All articles in the corpus are automatically classified using a custom AI model, ensuring structured access to topics and trends within the dataset.
A directly link to each entry is corpus is also presented.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TS Wikipedia
35 Million TokensA large-scale corpus built from 2015 Turkish Wikipedia articles, ideal for general-purpose NLP tasks. Further details on LDC Catalog
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TS Corpus V2
491 Million TokensThe updated second version of the Corpus, over 490 million tokens, widely used in Turkish NLP research since 2012.
This corpus uses BOUN Web Corpus as source that is composed from various internet sources, such as online newspapers, forums, blogs, etc.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TweetS Corpus
12.5 Million TokensA corpus of 1 million tweets. This corpus features various novel tags such as "YY" (misspells), "intAbbr" (internet abbreviation) , "intSlang" (internet slang), "intEmphasis" (internet emphasis). It also presents definitions of smileys in query results. For details plesae refer to this article.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ❌ |
Specialized Corpora
Dictionary Corpus V2
~1 Million TokensThe Dictionary Corpus V2 is built from the headwords and explanations of the entries in the TDK Turkish Contemporary Dictionary. Queries are processed over the explanations, while the headwords serve as the primary metadata target.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TS Syllable Corpus
5.7 Million TokensThe Syllable Corpus is a peculiar corpus that features syllable tagging for Turkish. The corpus includes 5 million 714 thousand and 422 unique words. Each word has hyphenated and each syllable is tagged with a special tag set developed by TS corpus.
The corpus serves "Status" and "Tag" as annotation.
Status defines if the syllable is a "valid" or "invalid" syllable and tag presents consonant and vowel pattern of the word.
The main idea behind the corpus is calculating syllable frequency and building an index of valid syllables of Turkish.
POS | Lemma | Morph | Metadata |
---|---|---|---|
❌ | ❌ | ❌ | ❌ |
Abstract Corpus V2
1 Million TokensAbstract Corpus V2 serves abstracts of 6,234 academic papers from various disciplines.
Corpus metadata covers the main field, scientific discipline and sub-discipline.
For details of the corpus data please check this article.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TS Idioms & Proverbs Corpus
27k TokensA corpus consisted of +9 thousand Turkish proverbs and idioms.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ❌ | ❌ |
Turkish Constitutions Corpus
32k TokensA corpus of 1924, 1961 and 1982 Turkish Constitutions.
Please refer to following thesis for more information.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Evrim Ağacı
4.4 Million TokensThis corpus consists of 7,287 articles published on the popular science platform Evrim Ağacı between 2011 and 2019.
The corpus serves a broad range of topics from biology and physics to psychology and philosophy, reflecting the language of contemporary science communication in Turkish.
The corpus offers a valuable resource for analyzing popular science discourse, terminology development, and lexical variation across nearly a decade of public engagement with science.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Social Media Corpora
Covid 19 TweetS
7.5 Million TokensA corpus of Turkish Tweets harvested during the first two months of the Covid-19 pandemic. The metadata of the corpus includes information such as Follower Count, Sentiment, and Account Creation Date.
The Covid-19 corpus was compiled within the scope of TÜBİTAK SOBAG Project No: 120K634: Kriz İletişimi: Covid-19 Salgınından Çıkarılan Dersler Işığında Önleyici İletişimsel Yaklaşımlar.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
6 Şubat Tweets
4.9 Million TokensThis corpus consists of tweets posted after the February 6 earthquakes. It was compiled as part of the ongoing TÜBİTAK 1001 Project No. 123K778, titled “Determining the Attitudes and Behaviors of Citizens Affected Directly or Indirectly by the Kahramanmaraş Earthquakes: An Analysis of Twitter Data.”
For the first publication based on this dataset, please refer here.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Confess Corpus
450k TokensThe Confess Corpus consists of posts shared by anonymous users on a popular women’s forum. It comprises 15,053 unique entries collected between 2015 and 2019, offering valuable insights into everyday discourse and online expression.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
News&Newspaper Corpora
Columns Corpus V2
28 Million TokensThe Columns Corpus V2 contains 25,915 newspaper columns, equally distributed between female and male writers. The collection spans the years 2006 to 2017, offering a balanced resource for studying gender, discourse, and style in contemporary Turkish journalism. For detailed information please refer to this article.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Türk Medyasında Göç
640k TokensThe Corpus is a specialized collection of news articles on migration events across different historical periods between 1950 and 2017, capturing the language, narratives, and discourses surrounding migration in Turkey and beyond.
The collection reflects shifts in public debate, political framing, and social perception of migration and refugee flows and mobility.
This makes it a valuable resource for researchers studying historical change, media representation, and linguistic patterns related to migration.
Please refer to this thesis for further information.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TRT 2015
1.6 Million TokensThis corpus was compiled from news texts broadcast in the TRT Main News Bulletin in 2015.
For detailed information, please refer to the related
doctoral dissertation.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Social Sciences Corpora
Communication Sciences Abstracts
1.5 Million TokensThis corpus consists of 6,573 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1986 and 2023, focusing on communication-related disciplines such as journalism, media studies, and public relations.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Educational Sciences Corpora
The Reading Teacher Corpus
3.5 Million TokensThis corpus is composed of academic papers published in The Reading Teacher Journal between 2012 and 2021, covering a ten-year period. It traces the evolution of scholarly research on reading, capturing key developments in pedagogy, methodology, and literacy studies. The corpus metadata includes the top two keywords from each article, enabling focused lexical and thematic analyses. For detailed information, please refer to this article.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Educational Sciences Abstracts
9.2 Million TokensThis corpus consists of abstracts from 30,515 Master’s and Doctoral theses submitted to the Turkish Council of Higher Education (YÖK) Thesis Center between 1980 and 2020.
The theses included in this collection were selected based on the index terms assigned by YÖK, ensuring thematic relevance and disciplinary diversity.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Second Language Teaching Corpus
1.3 Million TokensThis corpus consists of 44 textbooks commonly used in teaching Turkish as a Second Language.
It covers proficiency levels A1, A2, B1, B2 and C1 providing a representative sample of instructional language across beginner to advanced stages of learning.
Please refer to following doctoral dissertation. for more details.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
TaSL Corpus
121k TokensTaSL (Turkish as Second Language) corpus presents 14 coursebooks for A1 and A2 level commonly used in teaching Turkish to foreign student in higher education.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Parallel Corpus
Turkish-English Parallel Corpus
647 TokensA Turkish-English parallel corpus of translated sentences.
The corpus works parallel with English-Turkish Parallel Corpus.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
English-Turkish Parallel Corpus
837k TokensAn English-Turkish parallel corpus of translated sentences.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
Turkish Movie Subtitles
266 Million TokensA corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with English Movie Subtitles.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ❌ |
English Movie Subtitles
353 Million TokensA corpus of movie subtitles translated by Open Subtitles and presented by Opus Project.
The corpus works in parallel with Turkish Movie Subtitles.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ❌ |
Other
Dystopian Movies Corpus
281k TokensThis corpus consists of scripts from English-language dystopian films, providing linguistic data that capture the themes, dialogues, and narrative structures characteristic of dystopian cinema.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ❌ | ✔️ |
Yaşar Kemal 19
2.4 Million TokensThis corpus is consisted of 19 books written by Yaşar Kemal.
POS | Lemma | Morph | Metadata |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |