TS Wikipedia Corpus

TS Wikipedia Corpus is composed from July 2013 dump of Turkish Wikipedia pages. The corpus includes 215,068 entries from Wikipedia. Wikipedia is a useful source as for building a general purpose corpus as it derives texts from variousĀ  subjects. This corpus presents 1,779,228 word types.

The source data had preprocessed in order to eliminate auto-generated empty entries at first. Then external URL, image, table and other non-text contents had deleted.

Like other corpora, TS Wikipedia Corpus has part of speech tagging and morphological annotation.

Million Tokens
Word Types

If you have registered to TS Corpus

Login Now

If you haven’t registered you can sign up now

Sign Up Now