Mehmet Ali Yatbaz, Enis Sert, Deniz Yuret. COLING 2014. (PDF. This is a token based and multilingual extension of our EMNLP 2012 model. Up to date versions of the code can be found at github.)
Abstract:
We develop an instance (token) based extension of the state of
the art word (type) based part-of-speech induction system
introduced in (Yatbaz et al. 2012). Each word instance is
represented by a feature vector that combines information from
the target word and probable substitutes sampled from an n-gram
model representing its context. Modeling ambiguity using an
instance based model does not lead to significant gains in
overall accuracy in part-of-speech tagging because most words in
running text are used in their most frequent class (e.g. 93.69%
in the Penn Treebank). However it is important to model
ambiguity because most frequent words are ambiguous and not
modeling them correctly may negatively affect upstream tasks.
Our main contribution is to show that an instance based model can
achieve significantly higher accuracy on ambiguous words at the
cost of a slight degradation on unambiguous ones, maintaining a
comparable overall accuracy. On the Penn Treebank, the overall
many-to-one accuracy of the system is within 1% of the
state-of-the-art (80%), while on highly ambiguous words it is up
to 70% better. On multilingual experiments our results are
significantly better than or comparable to the best published
word or instance based systems on 15 out of 19 corpora in 15
languages. The vector representations for words used in our
system are available for download for further experiments.
Full post...
August 23, 2014
August 08, 2014
Volkan Cirik, M.S. 2014
Current position: PhD student, Carnegie Mellon University, Pittsburgh (LinkedIn).
M.S. Thesis: Analysis of SCODE Word Embeddings based on Substitute Distributions in Supervised Tasks. Koç University, Department of Computer Engineering. August, 2014. (PDF, Presentation, word vectors (github), word vectors (dropbox))
Publications: bibtex.php
Abstract
One of the interests of the Natural Language Processing (NLP) community is to find representations for lexical items using large amount of unlabeled data. Inducing low-dimensional, continuous, dense word vectors, or word embeddings, have become the principal technique to find representations for words. Word embeddings address the issues of the classical categorical representation of words by capturing syntactic and semantic information of words in the dimensions of a vector. These representations are shown to be successful across NLP tasks including Named Entity Recognition, Part-of-speech Tagging, Parsing, and Semantic Role Labeling.
In this work, I analyze a word embedding method in supervised Natural Language Processing (NLP) tasks. The framework maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. I compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. I examine the framework in a multilingual setup as well. The results show that the examined method achieves as good as or better results compared to the other word embeddings. The framework is consistent in improving the baseline systems across languages and achieves state-of-the-art results in multilingual dependency parsing.
Full post...
M.S. Thesis: Analysis of SCODE Word Embeddings based on Substitute Distributions in Supervised Tasks. Koç University, Department of Computer Engineering. August, 2014. (PDF, Presentation, word vectors (github), word vectors (dropbox))
Publications: bibtex.php
Abstract
One of the interests of the Natural Language Processing (NLP) community is to find representations for lexical items using large amount of unlabeled data. Inducing low-dimensional, continuous, dense word vectors, or word embeddings, have become the principal technique to find representations for words. Word embeddings address the issues of the classical categorical representation of words by capturing syntactic and semantic information of words in the dimensions of a vector. These representations are shown to be successful across NLP tasks including Named Entity Recognition, Part-of-speech Tagging, Parsing, and Semantic Role Labeling.
In this work, I analyze a word embedding method in supervised Natural Language Processing (NLP) tasks. The framework maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. I compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. I examine the framework in a multilingual setup as well. The results show that the examined method achieves as good as or better results compared to the other word embeddings. The framework is consistent in improving the baseline systems across languages and achieves state-of-the-art results in multilingual dependency parsing.
Full post...
Subscribe to:
Posts (Atom)