June 20, 2013

NAACL 2013 Notes

NAACL was great fun this year. It seems everybody is interested in Semantics and ways of representing it using vectors, matrices, tensors, quantum operators, deriving these representations using neural networks, linear algebra, etc. Some open questions that are active research areas are (1) how to represent ambiguity, (2) how to represent asymmetric lexical relations, (3) how to represent larger chunks like phrases and clauses. Here are some notes and highlights:
Slides and video from the deep learning tutorial. Papers: http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf http://ronan.collobert.com/senna/ people seem more interested in finding good representations for supervised tasks than purely unsupervised solving a task. the primary way they get word representations is contrastive estimation (unsupervised pretraining). maybe we can compete with paradigmatic representations based on lm. at the very least lm based representations will be faster to compute and converge. mikolov has nn based lm model (RNNLM), may be better than srilm to use: http://www.fit.vutbr.cz/~imikolov/rnnlm/ he also has a paper showing vector space reps are amazingly good: e.g. x(dogs) - x(dog) = x(cats) - x(cat) or x(dog) - x(animal) = x(chair) - x(furniture) we should see if this kind of thing is true on scode sphere... http://www.aclweb.org/anthology/N13-1090.pdf A quote from the paper that highlights the advantages of vector space models compared to symbolic (or equivalently one-hot models; I realized symbolic representations are also vector representations, just pretty dumb ones with very high dimensions and all symbols equivalent to one-hot orthogonal vectors): As pointed out by the original proposers, one of the main advantages of these models is that the distributed representation achieves a level of generalization that is not possi- ble with classical ␣-gram language models; whereas a ␣-gram model works in terms of discrete units that have no inherent relationship to one another, a con- tinuous space model works in terms of word vectors where similar words are likely to have similar vec- tors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences. Listening to Richard Socher on parsing, relation, sentiment and paraphrase recognition etc. I don't think we should try to use upos classes but rather the actual 25D vectors for extrinsic evaluation problems... Check out his work from the last couple years... www.socher.org has his code. He also talked about representing ambiguous words with multiple vectors. Some papers on ambiguity detection and representation: http://aclweb.org/anthology/N/N10/N10-1013.pdf http://aclweb.org/anthology/P/P12/P12-1092.pdf Ambiguity icin soyle birsey denesek: Her kelime scode'da bir degil 5 vektorle represent edilsin. scode calisirken Y vektorune en yakin X hangisiyse o attract edilsin. Converge ettiginde umit unambiguous kelimelerin 5 vektoru de ustuste binecek, ama ambiguous olanlar farkli yerlere converge edecek... Yukarida dedigimi kimse anlamamis, herkese ayri ayri anlattim, vakit olunca daha duzgun yazarim. Presumably scode converge ettiginde ambiguous X'lerin paired oldugu Y'lar daha genis bir alanda multimodal bir sekilde yayilmis olmali. Bunu test ettik mi? Bir X'le birlikte gorulen Y'lerin type ya da token weighted variance'ina bakilabilir. Bir adim oteye giderek bu X'in vektorunu ilgili Y'leri generate eden bir gaussian'in merkezi gibi gorebiliriz. Sonra da birden fazla gaussian var mi diye sorabiliriz DP gibi non-parametric bir clustering modeliyle. Bunu denedik mi? (Birileri denedigimiz seyleri bir yerlere yazsa da unutmasak :) Bu vector representation'larla ilgili pek cok konusmaya girince bizim CL draft bu haliyle zayif kaliyor oldugunu gordum. Ya ambiguity cozelim dogru durust, ya da en azindan diger vektorlerle karsilastiralim: C&W (senna), RNNLM (mikolov), CCA based, quantum, tensor based vs. Bu adamlara yazip vektorlerini sadece alsak ve clustering denesek yeter.
The slides from the spectral tutorial. It turns out co-training and CCA are methods that work with co occuring vars as well. Would be nice to compare with scode and our wsd cl paper. Collins says Co-clustering may also be relevant to what we do with scode. http://en.wikipedia.org/wiki/Biclustering Guess what Lyle Ungar (one of the tutorial hosts) does with CCA: (1) find 30-d representations for words based on co occurrence with syntagmatic context features. (2) then feed these vectors to supervised prediction problems like pos tagging...

Learning a pos tagger with little annotation. Another evaluation metric between supervised and unsupervised... Mali: Iki dil kullanmislar, bunlardan Malagasy icin github da bir test file var. Kullandiklari train+dev ve wikipedia dump file ile bir LM olusturdum. %16 unknown kelime var bu esas sorun. 64 subs m2o skoru %64.1, adamlarin elde ettigi type accuracy skorlari farkli model configurasyonlari icin yuzde 71, 72,74 ve 76. LM mimiz kotu oldugu icin bu seviyede kaldik bence. Suan farkettim ingilizce deneyleri var sanirim orada 'da karsilastiricam... Ingilizce WSJ uzerindeymis ve 15-24 arasi test set olarak kullanilmis. Biz unsupervised oldugumuz icin tum datayi kullanabiliriz. Iki annotator var ingilizce icin biri %78 diger %76 elde etmis. Biz suan m2o %80 ile ikisindende iyiyiz.
Volkan: cross lingual word similarities, burada da birseyler yapabiliriz diye tahmin ediyorum.
TACL started publishing! Latest TACL has three papers connecting language to action / world, the first of which is being presented now. Aydin, this is the type of domain in which MT techniques I think could help. There is going to be an ACL tutorial, and data/code is already available. http://www.transacl.org/wp-content/uploads/2013/03/paper49.pdf http://www.transacl.org/wp-content/uploads/2013/03/paper25.pdf http://www.transacl.org/wp-content/uploads/2013/05/paper193.pdf
Best short paper uses ontonotes to study coreference. I didnt know ontonotes had coreference. We should look at all the information and possible tasks in ontonotes. More generally it would be nice to have a repository of standard datasets and tasks on our website.
A frame induction paper building on Chambers&Jurafsky 2011: http://www.stanford.edu/~jurafsky/acl2011-chambers-templates.pdf . Here is an 2008 paper: http://acl.eldoc.ub.rug.nl/mirror/P/P08/P08-1090.pdf
They even use quantum theory to derive word vectors! (based on syntagmatic representations :( I now have an official reason to study quantum theory :)
Tag perplexity vs substitute variance:
Bu sabahki invited talk productive gecti. Asagida ikinci paragraftaki variance hesabini yaptim. Yani belli bir X'le birlikte observe edilen Y'lerin variance'lari o X'in ambiguity'si hakkinda birsey soyluyor mu. Ekte yazdigim kod ve cikan graph'i gonderiyorum. Pozitif bir correlation gorunuyor ama cok conclusive degil. Bu ambiguity detection isinde temel problem gold tag'lerle ve gold tag perplexity ile karsilastirmaya calismamiz olabilir. Bildiginiz gibi scode NN'leri 5, NNP'leri 3, VB'leri 4 class'a vs ayirmayi tercih ediyor. Dolayisiyla PTB'nin unambiguous dedigi bazi kelimeler bizim standartlara gore ambiguous olabilir. Kaldi ki scode part-of-speech ile ugrastigimizi bilmiyor. Dolayisiyla word sense ambiguity'si olan kelimeleri de ayirmak istiyor olabilir. Sonuc: gold tag perplexity'den bagimsiz bir kriter bulmamiz lazim ambiguity detection ve handling icin. scode matematigindeki likelihood fonksiyonunu improve ediyor mu bir kelimeyi ikiye bolmek vs gibi bir soru sorabilir miyiz?
Vector based relational similarity bizim scode icin iyi bir extrinsic evaluation olabilir.
Dan Roth's preposition paper seem relevant to stateml. There is no verb, no event, just a stated relation...
Jacob suggested Kevin Murphy's new ML book and Collins' NLP notes. Here is a nice amazon review comparing various ML texts. BRML, ITILA, GPML, and ESL are freely available online.
From: john.pate@mq.edu.au Hi Deniz, It was nice to meet you! I've put the brent forced alignment on my Macquarie webspace: http://web.science.mq.edu.au/~jpate/br_tmpdata.33.map+score This is the Large Brent dataset from: http://dx.doi.org/10.1017/S0305000910000085 The first column is the number of the phone. The second column is the number of the utterance. The third column identifies the mother-child dyad, recording session and start and end timestamp. The fourth column identifies how many 10ms frames into the utterance the phone starts (so if it says 10, the phone starts 100ms from the start of the utterance timestamp). The fifth column identifies the end of the phone, the sixth provides the gold-standard phone, and the seventh column is the log-likelihood of the alignment. I've attached the original train/test partition. The DMV library is located at: https://github.com/jpate/predictabilityParsing To see examples of how to run it, look at: https://github.com/jpate/runDMV I reccommend always using the fold-unfold implementation, since it's faster and gives identical results; you can see how by following the - -foldUnfold flags in the runDMV repository. Also, if you were interested in the language acquisition angle of my talk, you might have a look at chapter 6 of my dissertation: http://web.science.mq.edu.au/~jpate/jkpate_dissertation.pdf It also has experiments on hand-annotated break index, and, in section 6.4, an examination of the posterior distribution over trees to try to tease apart the relative contribution of predictability effects and prosodic structure. Best, John - -- John K Pate http://jkpate.net/ Husnu: DMV modelinde phonetic feature (word duration) ekleyerek improvement elde etmisler. Mali: biz de pos induction icin bu tip feature kullanalim diyorduk... Belki adamin dataseti ise yarar. Model purely lexical by the way...
uses tensor arithmetic to represent word composition. they seem to get disambiguation for free...
Unsupervised CCG grammar induction using HDP. State of the art on 15 languages. Performance does not fall as fast with longer sentences.
You shall know a word by the company it keeps.
Everybody uses this quotation from (Firth, J. R. 1957:11). Must find a paradigmatic version to wake people up! You can know a word better by what it can replace :)
There is a new paraphrase database.
The first two talks of *SEM made me think of limitations of our word vectors: - we have one word = one vector - how do we represent sets (apple and banana) - how do we represent ambiguity (either apple or banana) - how do we represent probability (80% apple, %20 banana) and is this different from ambiguity - we use cosine for similarity - similarity is symmetric, lexical relations usually aren't: apple implies fruit, but fruit does not imply apple. Probabilistically if I see apple 100% I see fruit. But if I see fruit maybe it is apple 20% of the time. Symmetric cosine is not going to cut it. - so how do we represent entailment, or generally set inclusion and/or intersection (cucumber and pickle do not entail each other but there is a large intersection) in a vector space (or matrix, tensor, quantum) model, and how do we learn this from data
Merhaba hocam, Bu bahsettiginiz paperlara amerikadayken bakarim. Bu durumda iki yon var: Representation Based: Bu adamlar yeni representation mi olusturmuslar? Paradigmac vs C&W, RNNLM, CCA gibi mi olacak. Algorithmic Based: Yoksa bu metodlar syntagmatic representation kullaniyor biz bu metodlara paradigmatic bilgiyi entegre edicez. Deniz: Ben representation based dusunmustum, ama digeri daha uzun vadede daha da iyi sonuc verebilir.

Laura Rimell ek olarak Beagle representation da suggest etti. http://www.indiana.edu/~clcl/BEAGLE/ Diger representation'lara referans iki tutorial slide'larinda olmali: http://www.cs.columbia.edu/~scohen/naacl13tutorial/ http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial Senna (C&W): bu thread'deki ilk email'imdan: http://ronan.collobert.com/senna/ CCA: http://www.cs.columbia.edu/~scohen/naacl13tutorial/ Ozellikle Lyle Ungar'in yaptigi islere bakin.
Abstract meaning representations seem to be getting popular (from kevin night panel ppt). Seems to suggest the future of SMT: string, tree, and graph automata... nobody knows how to go to graphs yet...
On 6/15/2013 10:09 AM, Deniz Yuret wrote: Dear Kevin, I couldn't catch you after the panel, but one thing in your slides stuck with me: the column for learning string to graph mappings was empty! Is there really no theory / work about this? (I was hoping I would learn how to do this from MT folk and apply to other learning problems :)
deniz, tree transducers go back to 1970, and after three decades of work, there is a lot of excellent theory -- even so, we had to adapt some of the standard devices to work in machine translation ... likewise, there is a lot of graph grammar work, but i'm not satisfied yet that we've found the right thing for us, i.e., for unordered, sparse semantic feature structures, entities playing multiple roles, english realization with pronouns, zero pronouns, control structures, reflexives, etc. we started exploring, though! you can see a couple of papers on my webpage with daniel quernheim (on dag acceptors and dag-to-tree transducers), plus a couple of newer papers (including rejected ones, of course) on hyperedge replacement grammars (HRG). we'll have a paper at acl (first author david chiang) about a parsing algorithm for HRG and a synchronous version of HRG aimed at transduction (e.g., meaning to english, and english to meaning). of course, unification grammar is super-appropriate & powerful, but probabilizing/inverting/learning it has been hard. hope that helps -- also wish we had the solution already! maybe we do & haven't recognized it yet... kevin
On combining vectors:
Laura Rimell uses circular convolution: http://www.aclweb.org/anthology/S13-1011.pdf Stephen Wu uses structured vectorial semantics: http://www.aclweb.org/anthology/S13-1021.pdf They both use random indexing to represent words. We could use random indexing on our subst vectors: assign a random vector to each substitute, each token then becomes a probability weighted sum of substitute vectors.
new dataset from google on syntactic ngrams.
David Forsyth had a nice invited talk. Julia Hockenmeier seems to have a good vision to language dataset. We should find his slides...
Jason Weston from google gave an invited talk in vision language workshop. He embeds images in the same space as words. He solves the ambiguity problem using multiple vectors for each word as I suggested. Words map to multiple vectors, context (in his case image) maps to one. Similarity determined using closest match. Number of senses determined using cross validation. eccv 2012 paper discusses ambiguity.
The melodi system from semeval also seems to represent ambiguity.
The spatial annotation task could potentially be used for (or converted to) a visual inference task. Part of the difficulty as you explained is to decide on the "correct" annotation format of a sentence. However regardless of the annotation, people's understanding of spatial structure is probably pretty consistent. Have you thought of a spatial task where annotation can be implicit but its details would not matter as much: i.e. a textual entailment, visual inference, spatial question answering type task? Something that would involve non-obvious spatial thinking such that pure keyword matching would go wrong but simple spatial labeling / modeling may get right... I would be very interested in such a task both as a participant and as an organizer. In the vision-language workshop yesterday people were talking about a visual entailment type task in the panel. But by visual entailment they mean questions that can be answered looking at a picture + text. Dealing with pixels is difficult and I think misses the point. Directly working with pure spatial concepts and representations may be a good way to go...
Oleksandr's story annotation tool. Here is Mark's related tool for reference: http://projects.csail.mit.edu/workbench and a link to the narrative workshop.

Ray Mooney on work other than WordsEye on language visualization: This is a paper by Jerry Zhu at U Wisc
Schuler, Badler, Palmer have papers on using NL instructions to guide behavior of virtual people.
Kevin Knight's Bayes tutorial.

Islam Beltagy on asymmetric relations in VSM: I think the two papers below summarize the attempts in doing asymmetric distributional semantics. This one compares between the known techniques: Identifying hypernyms in distributional semantic spaces Lenci, Alessandro and Benotto, Giulia and this one extends the same technique to phrase level. Entailment above the word level in distributional semantics Baroni, Marco and Bernardi, Raffaella and Do, Ngoc-Quynh and Shan, Chung-chieh
Turney and Pantel's review of vector space semantics. Could be useful to find applications and methods.

Full post...