I am an associate professor in Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. This year I am helping organize SemEval-2013. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. Bu yıl SemEval-2013 organizasyonunda görev alıyorum. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

June 20, 2013

NAACL 2013 Notes

NAACL was great fun this year. It seems everybody is interested in Semantics and ways of representing it using vectors, matrices, tensors, quantum operators, deriving these representations using neural networks, linear algebra, etc. Some open questions that are active research areas are (1) how to represent ambiguity, (2) how to represent asymmetric lexical relations, (3) how to represent larger chunks like phrases and clauses. Here are some notes and highlights:
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
Slides and video from the deep learning tutorial. Papers:
http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf
http://ronan.collobert.com/senna/
people seem more interested in finding good representations for supervised tasks than purely unsupervised solving a task. the primary way they get word representations is contrastive estimation (unsupervised pretraining). maybe we can compete with paradigmatic representations based on lm. at the very least lm based representations will be faster to compute and converge.
mikolov has nn based lm model, may be better than srilm to use:
http://www.fit.vutbr.cz/~imikolov/rnnlm/
he also has a paper showing vector space reps are amazingly good: e.g. x(dogs) - x(dog) = x(cats) - x(cat) or x(dog) - x(animal) = x(chair) - x(furniture) we should see if this kind of thing is true on scode sphere...
http://www.aclweb.org/anthology/N13-1090.pdf
A quote from the paper that highlights the advantages of vector space models compared to symbolic (or equivalently one-hot models; I realized symbolic representations are also vector representations, just pretty dumb ones with very high dimensions and all vectors being orthogonal): As pointed out by the original proposers, one of the main advantages of these models is that the distributed representation achieves a level of generalization that is not possi- ble with classical ␣-gram language models; whereas a ␣-gram model works in terms of discrete units that have no inherent relationship to one another, a con- tinuous space model works in terms of word vectors where similar words are likely to have similar vec- tors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences.
Listening to Richard Socher on parsing, relation, sentiment and paraphrase recognition etc. I don't think we should try to use upos classes but rather the actual 25D vectors for extrinsic evaluation problems... Check out his work from the last couple years... www.socher.org has his code. He also talked about representing ambiguous words with multiple vectors. Some papers on ambiguity detection and representation:
http://aclweb.org/anthology/N/N10/N10-1013.pdf
http://aclweb.org/anthology/P/P12/P12-1092.pdf
Ambiguity icin soyle birsey denesek: Her kelime scode'da bir degil 5 vektorle represent edilsin. scode calisirken Y vektorune en yakin X hangisiyse o attract edilsin. Converge ettiginde umit unambiguous kelimelerin 5 vektoru de ustuste binecek, ama ambiguous olanlar farkli yerlere converge edecek...
Yukarida dedigimi kimse anlamamis, herkese ayri ayri anlattim, vakit olunca daha duzgun yazarim. Presumably scode converge ettiginde ambiguous X'lerin paired oldugu Y'lar daha genis bir alanda multimodal bir sekilde yayilmis olmali. Bunu test ettik mi? Bir X'le birlikte gorulen Y'lerin type ya da token weighted variance'ina bakilabilir. Bir adim oteye giderek bu X'in vektorunu ilgili Y'leri generate eden bir gaussian'in merkezi gibi gorebiliriz. Sonra da birden fazla gaussian var mi diye sorabiliriz DP gibi non-parametric bir clustering modeliyle. Bunu denedik mi? (Birileri denedigimiz seyleri bir yerlere yazsa da unutmasak :)
Bu vector representation'larla ilgili pek cok konusmaya girince bizim CL draft bu haliyle zayif kaliyor oldugunu gordum. Ya ambiguity cozelim dogru durust, ya da en azindan diger vektorlerle karsilastiralim: C&W (senna), RNNLM (mikolov), CCA based, quantum, tensor based vs. Bu adamlara yazip vektorlerini sadece alsak ve clustering denesek yeter.
http://www.cs.columbia.edu/~scohen/naacl13tutorial/
The slides from the spectral tutorial. It turns out co-training and CCA are methods that work with co occuring vars as well. Would be nice to compare with scode and our wsd cl paper. Collins says Co-clustering may also be relevant to what we do with scode.
http://en.wikipedia.org/wiki/Biclustering
Guess what Lyle Ungar (one of the tutorial hosts) does with CCA: (1) find 30-d representations for words based on co occurrence with syntagmatic context features. (2) then feed these vectors to supervised prediction problems like pos tagging...

http://www.aclweb.org/anthology/N13-1014.pdf
Learning a pos tagger with little annotation. Another evaluation metric between supervised and unsupervised...
Mali: Iki dil kullanmislar, bunlardan Malagasy icin github da bir test file var. Kullandiklari train+dev ve wikipedia dump file ile bir LM olusturdum. %16 unknown kelime var bu esas sorun. 64 subs m2o skoru %64.1, adamlarin elde ettigi type accuracy skorlari farkli model configurasyonlari icin yuzde 71, 72,74 ve 76. LM mimiz kotu oldugu icin bu seviyede kaldik bence. Suan farkettim ingilizce deneyleri var sanirim orada 'da karsilastiricam. Ingilizce WSJ uzerindeymis ve 15-24 arasi test set olarak kullanilmis. Biz unsupervised oldugumuz icin tum datayi kullanabiliriz. Iki annotator var ingilizce icin biri %78 diger %76 elde etmis. Biz suan m2o %80 ile ikisindende iyiyiz. Bu cocugu taniyorum, hatta konusmustum Ravi ve bizim dictionary cleaning in birlesimi bir is yapiyor ve israrla bizi cite etmiyor.
https://lirias.kuleuven.be/bitstream/123456789/397128/1/NAACL2013VulicMoensFinal.pdf
Volkan: cross lingual word similarities, burada da birseyler yapabiliriz diye tahmin ediyorum.
http://www.transacl.org/papers/
TACL started publishing! Latest TACL has three papers connecting language to action / world, the first of which is being presented now. Aydin, this is the type of domain in which MT techniques I think could help. There is going to be an ACL tutorial, and data/code is already available.
http://www.transacl.org/wp-content/uploads/2013/03/paper49.pdf http://www.transacl.org/wp-content/uploads/2013/03/paper25.pdf http://www.transacl.org/wp-content/uploads/2013/05/paper193.pdf
http://aclweb.org/anthology/N/N13/N13-1071.pdf
Best short paper uses ontonotes to study coreference. I didnt know ontonotes had coreference. We should look at all the information and possible tasks in ontonotes. More generally it would be nice to have a repository of standard datasets and tasks on our website.
http://www.aclweb.org/anthology/N13-1104.pdf
A frame induction paper building on Chambers&Jurafsky 2011: http://www.stanford.edu/~jurafsky/acl2011-chambers-templates.pdf
http://www.aclweb.org/anthology-new/N/N13/N13-1105.pdf
They even use quantum theory to derive word vectors! (based on syntagmatic representations :( I now have an official reason to study quantum theory :)
Bu sabahki invited talk productive gecti. Asagida ikinci paragraftaki variance hesabini yaptim. Yani belli bir X'le birlikte observe edilen Y'lerin variance'lari o X'in ambiguity'si hakkinda birsey soyluyor mu. Ekte yazdigim kod ve cikan graph'i gonderiyorum. Pozitif bir correlation gorunuyor ama cok conclusive degil. Bu ambiguity detection isinde temel problem gold tag'lerle ve gold tag perplexity ile karsilastirmaya calismamiz olabilir. Bildiginiz gibi scode NN'leri 5, NNP'leri 3, VB'leri 4 class'a vs ayirmayi tercih ediyor. Dolayisiyla PTB'nin unambiguous dedigi bazi kelimeler bizim standartlara gore ambiguous olabilir. Kaldi ki scode part-of-speech ile ugrastigimizi bilmiyor. Dolayisiyla word sense ambiguity'si olan kelimeleri de ayirmak istiyor olabilir. Sonuc: gold tag perplexity'den bagimsiz bir kriter bulmamiz lazim ambiguity detection ve handling icin. scode matematigindeki likelihood fonksiyonunu improve ediyor mu bir kelimeyi ikiye bolmek vs gibi bir soru sorabilir miyiz?
http://www.aclweb.org/anthology/N13-1120.pdf
Vector based relational similarity bizim scode icin iyi bir extrinsic evaluation olabilir.
http://www.transacl.org/wp-content/uploads/2013/05/paper231.pdf
Dan Roth's preposition paper seem relevant to stateml. There is no verb, no event, just a stated relation...
http://www.cc.gatech.edu/~jeisenst/
Jacob suggested Kevin Murphy's new ML book and Collins' NLP notes. Here is a nice amazon review comparing various ML texts. BRML, ITILA, GPML, and ESL are freely available online.
http://www.transacl.org/wp-content/uploads/2013/03/paper631.pdf
Husnu: DMV modelinde phonetic feature (word duration) ekleyerek improvement elde etmisler. Mali: biz de pos induction icin bu tip feature kullanalim diyorduk... Belki adamin dataseti ise yarar. Model purely lexical by the way...
http://www.aclweb.org/anthology/N13-1134.pdf
uses tensor arithmetic to represent word composition. they seem to get disambiguation for free...
http://www.transacl.org/wp-content/uploads/2013/03/paper75.pdf
Unsupervised CCG grammar induction using HDP. State of the art on 15 languages. Performance does not fall as fast with longer sentences.
You shall know a word by the company it keeps.
Everybody uses this quotation from (Firth, J. R. 1957:11). Must find a paradigmatic version to wake people up! You can know a word better by what it can replace :)
http://www.aclweb.org/anthology/N13-1092.pdf
There is a new paraphrase database.
More coming...

Full post...

May 28, 2013

AI-KU: Using Substitute Vectors and Co-Occurrence Modeling For Word Sense Induction and Disambiguation

Baskaya, Osman and Sert, Enis and Cirik, Volkan and Yuret, Deniz. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). June, 2013. Atlanta, Georgia, USA. (Download PDF, see the proceedings).

Abstract:
Word sense induction aims to discover different senses of a word from a corpus by using unsupervised learning approaches. Once a sense inventory is obtained for an ambiguous word, word sense discrimination approaches choose the best-fitting single sense for a given context from the induced sense inventory. However, there may not be a clear distinction between one sense and another, although for a context, more than one induced sense can be suitable. Graded word sense method allows for labeling a word in more than one sense. In contrast to the most common approach which is to apply clustering or graph partitioning on a representation of first or second order co-occurrences of a word, we propose a system that creates a substitute vector for each target word from the most likely substitutes suggested by a statistical language model. Word samples are then taken according to probabilities of these substitutes and the results of the co-occurrence model are clustered. This approach outperforms the other systems on graded word sense induction task in SemEval-2013.

Full post...

May 19, 2013

Pitfalls of studying language in isolation

Studies of language acquisition and language understanding display a remarkable lack of attention to the subject matter of the utterances being studied.  This is probably because nobody knows how to represent and process meaning whereas the forms of utterances are readily available.  Thus "language acquisition" have come to mean the study of learning how to construct utterances "of the right form" and studies of language understanding focus on translating forms of utterances into other symbolic forms equally devoid of the richness and detail of the things the utterance is supposed to convey.
A real theory of language acquisition should study how babies learn to decode form-meaning mappings in an environment where lots of things are going on in addition to what is being said.  A real theory of language understanding should study what kinds of rich interconnected concepts and embodied simulations get triggered by words and constructions, how we decide what to simulate given the scant detail in descriptions, and what inferences are made possible beyond what is explicitly stated.  

All this is AI-complete you say?  Well by limiting ourselves to study language in isolation, we may have come to the end of the line where the ~80% accuracy limit of machine learning based computational linguistics (on almost any linguistic problem you can think of) is preventing us from building truly transformative applications.  Maybe we are shooting ourselves in the foot, and maybe, just maybe, some problems that look difficult right now are difficult not because we are missing the right machine learning algorithm or sufficient labeled data but because we are ignoring the constraints imposed by the meaning side of things.  We may have finally run out of options other than to try and crack the real problem, i.e. modeling what utterances are ABOUT.


Full post...

April 05, 2013

Turkish Language Resources

This post contains links to various Turkish language resources that I have collected. Please send a comment if you find Turkish resources that you would like to see on this page.

TS Corpus

Taner Sezer's TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.

BounWebCorpus

Hasim Sak's page contains some useful Turkish language resources and code in addition to a large web corpus.

Bibliography

Özgür Yılmazel's Bibliography on Turkish Information Retrieval and Natural Language Processing.

tr-disamb.tgz

Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.

correctparses_03.txt.gz, train.merge.gz

Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct).

test.1.2.dis.gz, test.merge.gz

Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy.

tr-tagger.tgz

Turkish morphological tagger, includes Oflazer's finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software.

turklex.tgz, pc_kimmo.tgz

Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu

Milliyet1.bz2, Milliyet2.bz2, Milliyet3.bz2

Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, in three 11MB parts. From Kemal Oflazer. Please use with permission.

Turkish wordnet

From Kemal Oflazer. Please use with permission.

METU-Sabanci Turkish Treebank

Turkish treebank with dependency annotations. Please use with permission.

sozluk.txt.gz

English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.

sozluk-boun.txt.gz
Turkish word list (25822 words, 73K) Originally from www.cmpe.boun.edu.tr/courses/cmpe230

Avrupa Birliği Temel Terimler Sözlüğü

(Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)

BilisimSozlugu.zip

Bilişim Sözlüğü by Bülent Sankur (Originally from: www.bilisimsozlugu.com, Oct 9, 2006)

turkish.el

Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.

en-tr.zip, lm.tr.gz

Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.


Full post...

February 27, 2013

Bret Victor

Bret Victor - Inventing on Principle from CUSEC on Vimeo.

Bret Victor's inspirational talk with his views on (1) how to flourish fragile ideas, and (2) how to live your life. For more from Bret, check out his website.
Full post...