September 25, 2019

Morphological analysis using a sequence decoder

Ekin Akyürek, Erenay Dayanık, Deniz Yuret (2019). Transactions Of The Association For Computational Linguistics, 7, 567-579. (PDF, arXiv)

Abstract: We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and outperform whole-tag models. In addition, generating morphological features as a sequence rather than e.g. an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the art results in nine languages of different morphological complexity under low-resource, high-resource and transfer learning settings. We also introduce TrMor2018, a new high accuracy Turkish morphology dataset. Our Morse implementation and the TrMor2018 dataset are available online to support future research.

See https://github.com/ai-ku/Morse.jl for a Morse implementation in Julia/Knet and https://github.com/ai-ku/TrMor2018 for the new Turkish dataset.


Full post...

September 09, 2019

Overview of CLEF 2019 Lab ProtestNews: Extracting Protests from News in a Cross-context Setting

Hürriyetoğlu, Ali and Yörük, Erdem and Yuret, Deniz and Yoltar, C. and Gürel, B. and Duruşan, F. and Mutlu, O. and Akdemir, A. In CLEF 2019 Working Notes. September, 2019. (PDF, Proceedings).

Abstract: We present an overview of the CLEF-2019 Lab ProtestNews on Extracting Protests from News in the context of generalizable natural language processing. The lab consists of document, sentence, and token level information classification and extraction tasks that were referred as task 1, task 2, and task 3 respectively in the scope of this lab. The tasks required the participants to identify protest relevant information from English local news at one or more aforementioned levels in a cross-context setting, which is cross-country in the scope of this lab. The training and development data were collected from India and test data was collected from India and China. The lab attracted 58 teams to participate in the lab. 12 and 9 of these teams submitted results and working notes respectively. We have observed neural networks yield the best results and the performance drops significantly for majority of the submissions in the cross-country setting, which is China.


Full post...