February 16, 2009

Ergun's English-Turkish machine translation notes

Here are some useful notes from Ergun Bicici on getting started with Turkish-English machine translation, followed by some suggestions by Murat Alperen on collecting Turkish-English parallel text data.

Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8):
en-tr.zip

The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not.

Here is the Turkish text to develop the language model:
lm.tr.gz

The directions for the Moses baseline system:
http://www.statmt.org/wmt09/baseline.html

The link for the scripts:
http://www.statmt.org/wmt08/scripts.tgz

Be careful to put the stems and suffixes back together before computing the BLEU score. Splitting them artificially increases the score.

To compute the score do not use the mteval scorer at http://www.statmt.org/wmt09/baseline.html - because it retokenizes the input and splits all the '+' characters that are used to denote suffixes. Either use the multi-bleu perl script, or comment out the language-dependent part of NormalizeText in mteval.

For Turkish dictionaries and other resources please see Turkish language resources.


10 comments:

Anonymous said...

Deniz Bey,
İstatistiksel Çeviri konusuyla ben de bir süre ilgilendiğim için Moses sistemini eğitme amacıyla internette paralel metin avlama yoluna gitmiştim. Sonuçta sizin istediğiniz ölçekte olmasa da deneme amaçlı bir sistem kurmak için yeterli olduğunu düşündüğüm seviyede metin bulmayı başarmıştım. Hatta 2008'in yaz aylarında bulduğum listeyi Google'a da ilettim ve sanırım şu anki sistemlerini geliştirme aşamasında bu metinleri de kullandılar. Bahsettiğim liste aşağıdaki adreste bulunmaktadır.

Benim aklıma gelen başka bir fikir de acaba kural tabanlı cümleler geliştirme yoluyla bir metnin hem ana dildeki hem hedef dildeki halini aynı anda oluşturup kendi yapay derlemimizi oluşturup oluşturamayacağımız. Sonuçta salt deyim tabanlı SMT'nin İngilizce Türkçe dil çifti için beklenen sonucu vermeyeceğini, factored SMT altında yeni geliştirilen gösterimlerin de kullanılması gerektiğini düşünüyorum.

http://markmail.org/message/ty4hqb53qpo33w6s

Saygılarımla
msalperen@gmail.com

Deniz Yuret said...

I would like to thank msalperen@gmail.org for the comment. I will copy the two email messages from the link in the comment below for reference:

Deniz Yuret said...

Subject: English-Turkish Parallel Texts
From: msal...@gmail.com
Date: Jul 25, 2008 10:52:53 am
List: com.googlegroups.google-translate

Here is a list of English-Turkish parallel files that I've found accross the web in bulk amount (at least in good amoun to be able to start something). I've succesfully used sptoolkit and then hunalign to align the corpus as parallel text. Hunalign was successfull to correctly align paired sentences through some trials that I've run. So these files stand for good candidates to work on. There are some techinques that you can gather parallel text using Google search. I will mention some of them here if they are asked for.

Here's the list that contains both language corresponding items. They total up to 5 mill. or more words in my estimatiıon

1-6 http://www.dpt.gov.tr/bgyu/bkp/dap.html 7-14 http://www.dpt.gov.tr/bgyu/bkp/dokap.html 15 http://www.dpt.gov.tr/bgyu/kirsalka/kirsalka.html 16-22 http://ekutup.dpt.gov.tr/kep/kep.asp 23-25 http://ekutup.dpt.gov.tr/ovp/ovp.asp 26 http://www.teknikengel.gov.tr/index.cfm?action=icerik&id=70 27 http://www.bddk.org.tr/turkce/Basel-II/Basel-II.aspx 28-29 http://www.tbb.org.tr/english/v12/legislation.htm http://www.tbb.org.tr/v12/TemelDuzenlemeler.htm 30 http://www.bis.org/publ/bcbs109.htm http://www.bddk.org.tr/turkce/Basel-II/Basel-II.aspx 31 http://www.nato.int/docu/review/2006/issue1/turkish/main.htm http://www.nato.int/docu/review/2006/issue1/english/main.htm 32 http://www.muhasebat.gov.tr/yayin/indexE.php http://www.muhasebat.gov.tr/yayin/index.php 33
http://www.hm.saglik.gov.tr/pdf/kitaplar/200711061632380.neredennereyeturkce.pdf http://www.hm.saglik.gov.tr/pdf/kitaplar/200709241502590.NeredenNereyeEn20070924.pdf 34 http://www.hm.saglik.gov.tr/pdf/kitaplar/200704061330550.hhtr.pdf http://www.hm.saglik.gov.tr/pdf/kitaplar/200704061334530.hhing.pdf 35 http://www.hm.saglik.gov.tr/pdf/kitaplar/ulusalsagheseng.pdf http://www.hm.saglik.gov.tr/pdf/kitaplar/ulusalsaghestr.pdf 36 http://www.hm.saglik.gov.tr/pdf/kitaplar/sdping.pdf http://www.hm.saglik.gov.tr/pdf/kitaplar/ulusalsaghestr.pdf 37
http://www.hm.saglik.gov.tr/pdf/kitaplar/200802051624140.SAIK_calistay_tr.pdf http://www.hm.saglik.gov.tr/pdf/kitaplar/200802051627580.SAIK_calistay_en.pdf 38 http://www.hm.saglik.gov.tr/pdf/kitaplar/200704061342050.NBDing.pdf http://www.hm.saglik.gov.tr/images/amazon/200704061339590.NBDtr.png 39 http://ab.calisma.gov.tr/dnn/Docs/yayinlar/WGI.pdf 40 http://ab.calisma.gov.tr/dnn/Docs/yayinlar/WGII.pdf 41 http://www.kulturturizm.gov.tr/genel/text/eng/TST2023.pdf http://www.kultur.gov.tr/TR/Tempdosyalar/189566__TTStratejisi2023.pdf 42-47 http://www.hakikatkitabevi.com/download/download.htm#turkce http://www.hakikatkitabevi.com/download/download.htm#english 48
http://www.eurydice.org/portal/page/portal/Eurydice/ByCountryResults?countryCode=TR http://www.eurydice.org/ressources/eurydice/eurybase/pdf/0_integral/TR_EN.pdf http://www.eurydice.org/ressources/eurydice/eurybase/pdf/0_integral/TR_TR.pdf 49 http://www.eurydice.org/ressources/eurydice/pdf/041DN/041_TR_EN.pdf http://www.eurydice.org/ressources/eurydice/pdf/041DN/041_TR_TR.pdf 50
http://www.stgm.org.tr/docs/1200663137Cevresel%20Ayrimcilik-JorgeDaniel%20Taillant.doc http://www.cedha.org.ar/docs/doc24.doc 51 http://www.invest.gov.tr/documents/investorguide_tr.pdf http://www.invest.gov.tr/documents/investorguide_tr.pdf 52
http://www.tusiad.org/tusiad_cms.nsf/bec/BB2C1347DC29D8A6C225746700355DB5?OpenDocument http://www.tusiad.org/tusiad_cms.nsf/bec/BB2C1347DC29D8A6C225746700355DB5?OpenDocument 53
http://www.tusiad.org/tusiad_cms_eng.nsf/bd/571E8607889539A8C2257364003822C2?OpenDocument http://www.tusiad.org/tusiad_cms.nsf/bd/D568C5F7B70692F5C225733E00432454?OpenDocument 54
http://www.tusiad.org/tusiad_cms_eng.nsf/Raporlar?OpenForm&Seq=1&TumRaporlar#_RefreshKW_sene 55 http://www.undp.org.tr/publicationsDocuments/NHDR_Tr.pdf http://www.undp.org.tr/publicationsDocuments/NHDR_En.pdf 56
http://www.undp.org.tr/povRedDocuments/Competitiveness_Agenda_Report_tr.pdf http://www.undp.org.tr/publicationsDocuments/Competitiveness_Agenda_(EN).pdf 57 http://www.undp.org.tr/publicationsDocuments/CSR_Report_tr.pdf http://www.undp.org.tr/publicationsDocuments/CSR_Report_en.pdf 58 http://www.undp.org.tr/publicationsDocuments/brosur_en.pdf http://www.undp.org.tr/publicationsDocuments/brosur_tr.pdf 59 http://www.undp.org.tr/publicationsDocuments/Practical_Guide_2008_tr.pdf http://www.undp.org.tr/publicationsDocuments/Practical_Guide_2008_En.pdf 60 http://www.ihb.gov.tr/yayinlar/insan_haklari_nedir_kitap.pdf http://www.ihb.gov.tr/yayinlar/What_are_human_rights.pdf 61 http://www.hrw.org/reports/2005/turkey0305/turkey0305text.pdf http://www.hrw.org/turkish/reports/turkey0305/turkey0305trweb.pdf 62 http://hrw.org/reports/2008/turkey0508/turkey0508tuweb.pdf

Deniz Yuret said...

Subject: Re: English-Turkish Parallel Texts Actions...
From: msal...@gmail.com
Date: Jul 25, 2008 11:33:28 am
List: com.googlegroups.google-translate

You can gather translated text from the very same page or you can find translated versions of a single document from an index pages that links both versions.

For the first, try for frequently used phrases, such as:

"bazı hallerde" "in some cases"

Running this for instance turns following page as a search result, which includes the translated english version of an article on chemistry (source and target text on the same page)

http://www.kimpeks.com/modules.php?name=Content&pa=showpage&pid=39

For the second, search phrases like

"also available in turkish". Note that the query is english but aims for the translated text in Turkish

This turns following web page within the search results which links to a number of docs.

http://www.esiweb.org/index.php?lang=en&id=156

Note that this page also links to parallel text for some low resource languages

(This document is also available in Albanian. ;)

Another example for this technique

"this page in turkish" and the associated web result: http://www.khazaria.com/khazar-history.html

in which you can find a link to the translated version: http://www.khazaria.com/turkce/hazartarih.html

Focus on organizational web sites or government web sites, .org or .gov sites will yield more translated text than any other domain.

Hope this helps. By the way, we should announce parallel links here to inform Google about more resources which will benefit ourselves in terms of new language pairs ;) by

Murat Alperen said...

There is also a news source which includes stories with high quality translations in 10 Balkan languages. I think this is also a very valuable multilingual resource since those language pairs are very rare. Here it is:

http://www.setimes.com/

Regards

Tux said...

"...Be careful to put the stems and suffixes back together before computing the BLEU score. Splitting them artificially increases the score..."

Is there a script that is available to do this for Turkish stems and suffixes?
Thanks for all the info. Very helpful.

Deniz Yuret said...

About wordifying (putting stems and suffixes back together) of the Turkish MT output: Typically all the suffixes start with the '+' character. If you delete the spaces preceding any '+' character you will be 99% there. There are a few exceptions like the actual usages of '+' in the text, you can try to escape those but they are rare enough that they will not make a large difference in the score.

Tux said...

I actually thought this was about restoring the right case for the vowel dependent suffixes (like +mak becoming "mek" or "mak" depending on the preceeding characters).
I can imagine this might be quite tricky in Turkish, for some uncommon combinations. That is why I thought maybe there is a script available for this.

Thank you for the very quick reply :).

Deniz Yuret said...

For recovering the actual compound word with vowel harmony etc. you would have to use Oflazer's finite state implementation - which I believe can be run in both analysis and synthesis directions. For the purposes of the BLEU score however, this is unnecessary as long as both the test and gold output are in the same format.

Deniz Yuret said...

Murat Bey'den setimes konusunda bir mesaj:

Deniz Bey,

Bu akşam setimes.com sitesini incelediğimde daha önceki ziyaretlerime
göre çok daha fazla paralel metin içerdiğini gördüm. Hatırlayacak
olursanız internet üzerindeki paralel metin dosyalarının bir listesini
yapıp önce google groups'a sonra da sizin blog sayfanıza göndermiştim.
setimes.com adresindeki metinlerin toplam kelime sayısının her dil
başına 3 milyondan fazla olduğunu tahmin ediyorum. Sitenin içeriğini
daha yakından incelemeniz için aşağıdaki Google sorgularını
deneyebilirsiniz. Newsbriefs kısmında herbiri ortalama 100 kelimeden
oluşan 12-13 bin civarında metin var, diğer kısımlarda ise nispeten
daha uzun fakat az sayında metin var. Nerden bakarsak bakalım en az 3
milyon kelimelik bir derlem var şu an. Benzer bir site de nato reviews
dergsinin sayfaları, bu derginin arşivi de geniş sayılır.

Bütün bunları hesaba katarsak biraz da sözlüklerden derlenecek yapay
bir derlemle 6-7 milyon hatta 10 milyon kelimelik bir sınıra
yaklaşılabilir diye düşünüyorum.

site:setimes.com inurl:/features/ inurl:/tr/ -archivelist inurl:/newsbriefs/
site:setimes.com inurl:/features/ inurl:/tr/ -archivelist inurl:/articles/
site:setimes.com inurl:/features/ inurl:/tr/ -archivelist inurl:/roundup/
site:setimes.com inurl:/features/ inurl:/tr/ -archivelist inurl:/blogreview/
http://www.nato.int/docu/review/2007/issue4/lang.html

Saygılarımla