February 16, 2009

Ergun's English-Turkish machine translation notes

Here are some useful notes from Ergun Bicici on getting started with Turkish-English machine translation, followed by some suggestions by Murat Alperen on collecting Turkish-English parallel text data.

Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8):
en-tr.zip

The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not.

Here is the Turkish text to develop the language model:
lm.tr.gz

The directions for the Moses baseline system:
http://www.statmt.org/wmt09/baseline.html

The link for the scripts:
http://www.statmt.org/wmt08/scripts.tgz

Be careful to put the stems and suffixes back together before computing the BLEU score. Splitting them artificially increases the score.

To compute the score do not use the mteval scorer at http://www.statmt.org/wmt09/baseline.html - because it retokenizes the input and splits all the '+' characters that are used to denote suffixes. Either use the multi-bleu perl script, or comment out the language-dependent part of NormalizeText in mteval.

For Turkish dictionaries and other resources please see Turkish language resources.


Full post...

February 11, 2009

Turkish morphology presentation

Please click on "Full Post" to view an introductory Turkish morphology presentation. To learn more about Turkish morphology I recommend this paper by Kemal Oflazer. For morphological disambiguation check out Haşim Sak's webpage or this paper by Deniz Yuret and Ferhan Türe. You can try analyzing a Turkish word here:
Full post... Related link

February 02, 2009

İngilizce Türkçe otomatik tercüme

Google sonunda otomatik tercüme yaptığı dillere Türkçe'yi de
ekledi. Denemek isterseniz: http://translate.google.com

Bu teknolojinin İngilizce bilmeyen Türk nüfusunun internetteki
bilgi birikimine ulaşımı için önemli olduğunu düşünüyor ve birkaç
yıldır üzerinde ben de çalışıyorum. En büyük engellerden biri
araştırma amacıyla kullanılabilecek yüklü miktarda
İngilizce-Türkçe paralel metne ihtiyaç olması (yaklaşık 100 milyon
kelime = 1000 kitap). Bu metni toplayabilmek için bir iki yıl
telefonla devlet kurumları, uluslararası kuruluşlar, yayınevi,
haber kurumu, hukuk ve tercümanlık şirketleri, üniversite
bölümleri vs ile görüşüp pozitif bir cevap alamayınca yoruldum ve
vazgeçtim. İşin üzücü tarafı karşılaştığım büyük engelin yayın
hakkı, fikir mülkiyeti gibi hukuksal bir konu değil, insanların
ilgisizliği olması. Şimdilik bir iki milyon kelimelik metinden
geliştirilmiş oyuncak bir sistemle uğraşıyorum öğrencilerimle.
Google'ın sistemini ben yazmış olmak isterdim. Ama henüz yarışma
bitmiş değil, sistemin kalitesine bir örnek olarak bu paragrafın
bir tercümesini veriyorum...

This technology does not speak English in the Turkish population
on the Internet access to knowledge is important to think and a
few years, I am working on. One of the biggest obstacles can be
used for research purposes in the amount of installed
English-English parallel text is needed (about 100 million word =
1000 books). This text to be able to collect a two-year phone and
the state institutions, international organizations, publishing,
news agency, law and translation companies, universities sections
and with a positive answer so do not get tired and I've given
up. The major obstacle faced by the unfortunate job of
broadcasting the The right to legal issues such as property, not
the ideas, people is indifference. Currently, one of two million
words of text I'm dealing with a system developed for students
with toys. I would like it to Google's system. But the contest
yet not finished, as an example of the quality of the system of
this paragraph I give a translation...


Full post... Related link