Deniz Yuret's Homepage: Turkish Language Resources

January 27, 2014

Turkish Language Resources

This post contains links to various Turkish language resources that I have collected. Please send a comment if you find Turkish resources that you would like to see on this page.

Turkish Data Depository (TDD): Our new website that we hope will act as a central location for all computational resources and models for Turkish. Most resources listed below and more will be found there.
ITU Turkish Natural Language Processing Pipeline: This webpage provides the Turkish NLP Tools and API developed in Istanbul Technical University by our Natural Language Processing group led by Asst. Prof. Dr. Gülşen Eryiğit including a Tokenizer, Normalization tools, Deasciifier, Vocalizer, Spelling Corrector, Turkish word detector, Morphological Analyzer, Morphological Disambiguator, Named Entity Recognizer, and Dependency Parser.
TS Corpus: Taner Sezer's TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.
BounWebCorpus: Hasim Sak's page contains some useful Turkish language resources and code in addition to a large web corpus. (2021-08-26: link broken)
Bibliography: Özgür Yılmazel's Bibliography on Turkish Information Retrieval and Natural Language Processing.
tr-disamb.tgz: Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.
correctparses_03.txt.gz, train.merge.gz: Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct). A newer version can be found in TrMor2006.
test.1.2.dis.gz, test.merge.gz: Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy. A newer version can be found in TrMor2006.
tr-tagger.tgz: Turkish morphological tagger, includes Oflazer's finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software. For a better tagger use Morse.
turklex.tgz, pc_kimmo.tgz: Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu
Milliyet.clean.bz2: Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, originally was in three 11MB parts. From Kemal Oflazer. Please use with permission.
Turkish wordnet: From Kemal Oflazer. Please use with permission. (2021-08-26: link broken)
METU-Sabanci Turkish Treebank: Turkish treebank with dependency annotations. Please use with permission.
sozluk.txt.gz: English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.
sozluk-boun.txt.gz
: Turkish word list (25822 words, 73K) Originally from www.cmpe.boun.edu.tr/courses/cmpe230
Avrupa Birliği Temel Terimler Sözlüğü: (Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)
BilisimSozlugu.zip: Bilişim Sözlüğü by Bülent Sankur (Originally from: www.bilisimsozlugu.com, Oct 9, 2006)
turkish.el: Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.
en-tr.zip, lm.tr.gz: Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.
Zemberek (blog) (github): Zemberek doğal dil işleme kütüphanesi.

14 comments:

Steve Neufeld said...: Great list of resources. I am only just now beginning to dabble in parallel corpora. Do you know this online parallel corpus search engine at http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus=OpenSubtitles;lang=en ?; April 12, 2012
Deniz Yuret said...: From: Taner Sezer
Subject: TS Corpus

TS Corpus tamamı sözcük türü ve biçimbirimsel bazda işaretlenmiş toplam 491 milyon birimden (491,360,398 milyon token) oluşan genel amaçlı bir Türkçe derlemdir.  Proje kendinden önce Türkçe adına yapılmış bilişimsel dilbilim çalışmaları ile dünyada yapılmış derlem dilbilim çalışmalarını biraraya getirerek, kullanılabilir bir ürün olarak modern bir derlem oluşturmayı hedeflemektedir. TS Corpus Türkiye Türkçe'si için oluşturulmuş ve "etiketlenmiş" bir derlemdir.  TS Corpus'un ilk versiyonu 1 Mart 2012'de yayınlanmıştır.  İkinci versiyon 30 Ağustos 2012 tarihinde yayınlanarak kullanıma sunulmuştur.  Derlemin 2.1 versiyonu üzerinde çalışmalar devam etmektedir. 
Derlemle ilgili detaylı bilgi ve kayıt formuna http://tscorpus.com adresinden erişilebilir.
TS Corpus içerdiği veriyi dilbilimsel ve biçimsel olarak iki ayrı düzlemde etiketlenmiş biçimde sunmaktadır. Dilbilimsel etiketleme PosTag (sözcük türü), Morphological Tagging (biçimbirimsel etiketleme) ve Lemma (kök) olarak üç düzeyde sunulmaktadır. Kullanıcılar üç etiket üstünden de arama yapabilmektedir.
TS Corpus'u özel yapan bazı unsurları şöyle sıralayabiliriz:

- Şu ana kadar yapılmış en büyük Türkçe derlemlerden (491 milyon token),
- Sözcük türü (PosTAG) etiketlenmiş,
- Biçimbirimsel etiketleme (Morphological Tagging) yapılmış,
- Kök sözcük (Lemma) işaretlemiş ve kök sözcük ile arama yapılabilen,
- Online erişimli olarak kullanılabilen,
- 7 farklı dilbilimsel istatistiki veriyi birarada kullanıcıya sunabilen,
- CWB/CQP altyapısı kullanılarak yapılmış,
- Kullanıcıların sonuçlarını farklı biçimlerde kaydetmesine olanak tanıyan,
- Açık erişimli olarak sunulan bir Türkçe derlemdir.; April 05, 2013
Unknown said...: Some more resources from YTU's NLP group: http://www.kemik.yildiz.edu.tr/?id=28; May 10, 2013
Alejandro Gutman said...: I suggest our website (The Language Gulper) as a possible useful link: http://www.languagesgulper.com/

It describes almost 200 languages, including articles about Turkic and Turkish. Besides, it has articles about another nine Turkic languages: http://www.languagesgulper.com/eng/Turkic.html; July 03, 2013
Ergun said...: English-Turkish Parallel Corpus: 1984_en-tr_SentenceAligned_ParallelCorpus.zip [Usage: For training SMT systems or running sentence alignment experiments.]
https://github.com/bicici/SMTData; January 28, 2014
Unknown said...: Nuve would be a good addition: https://github.com/hrzafer/nuve. It's been very useful at my company.; October 13, 2015
Magister said...: Kemal Oflazer's page is no more available and a Turkish WordNet page doesn't seem to exist anywhere in the web. Have you any updated link? Thank you; December 08, 2015
Deniz Yuret said...: Dear Giuliano, please contact Kemal Oflazer directly for the Turkish WordNet.; December 11, 2015
Zargan Ltd. said...: Here is a link to Turkish Wordnet: https://bitbucket.org/ozlemc/twn/downloads; May 14, 2016
Anonymous said...: Why is zemberek not listed here?; February 04, 2018
Deniz Yuret said...: Zemberek eklendi.; February 05, 2018
Anonymous said...: Sözlüğü kullanabilirmiyiz kısıtlamasız
Teşekkürler; August 12, 2019
Deniz Yuret said...: Kısıtlamalar varsa bilmiyorum, orijinal kaynağa bakabilirsiniz.; August 19, 2019
Halil Ural said...: Merhabalar Deniz Hocam,

Snowball Turkish Stemmer, bunu da ekleyebilirsiniz.

https://github.com/otuncelli/turkish-stemmer-python

Snowball için sample turkish vocabulary:

https://raw.githubusercontent.com/snowballstem/snowball-data/master/turkish/voc.txt

Resha Turkish Stemmer

https://github.com/hrzafer/resha-turkish-stemmer

Turkish Stemmer Deep Learning Based Seq2Seq

https://github.com/deeplearningturkiye/kelime_kok_ayirici; March 01, 2020

Deniz Yuret's Homepage

January 27, 2014

Turkish Language Resources

14 comments:

Labels

Popular Posts

My Blog List

Archive