TS Wikipedia Corpus
TS Wikipedia Corpus is composed from July 2013 dump of Turkish Wikipedia pages. The corpus includes 215,068 entries from Wikipedia. Wikipedia is a useful source as for building a general purpose corpus as it derives texts from variousĀ subjects. This corpus presents 1,779,228 word types.
The source data had preprocessed in order to eliminate auto-generated empty entries at first. Then external URL, image, table and other non-text contents had deleted.
Like other corpora, TS Wikipedia Corpus has part of speech tagging and morphological annotation.
0
Million Tokens
0
Word Types