Here are some useful notes from Ergun Bicici on getting started with Turkish-English machine translation, followed by some suggestions by Murat Alperen on collecting Turkish-English parallel text data.
Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8):
The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not.
Here is the Turkish text to develop the language model:
The directions for the Moses baseline system:
The link for the scripts:
Be careful to put the stems and suffixes back together before computing the BLEU score. Splitting them artificially increases the score.
To compute the score do not use the mteval scorer at http://www.statmt.org/wmt09/baseline.html - because it retokenizes the input and splits all the '+' characters that are used to denote suffixes. Either use the multi-bleu perl script, or comment out the language-dependent part of NormalizeText in mteval.
For Turkish dictionaries and other resources please see Turkish language resources.