Mehmet Ali Yatbaz, Enis Sert, Deniz Yuret. COLING 2014. (PDF. This is a token based and multilingual extension of our EMNLP 2012 model. Up to date versions of the code can be found at github.)
Abstract:
We develop an instance (token) based extension of the state of
the art word (type) based part-of-speech induction system
introduced in (Yatbaz et al. 2012). Each word instance is
represented by a feature vector that combines information from
the target word and probable substitutes sampled from an n-gram
model representing its context. Modeling ambiguity using an
instance based model does not lead to significant gains in
overall accuracy in part-of-speech tagging because most words in
running text are used in their most frequent class (e.g. 93.69%
in the Penn Treebank). However it is important to model
ambiguity because most frequent words are ambiguous and not
modeling them correctly may negatively affect upstream tasks.
Our main contribution is to show that an instance based model can
achieve significantly higher accuracy on ambiguous words at the
cost of a slight degradation on unambiguous ones, maintaining a
comparable overall accuracy. On the Penn Treebank, the overall
many-to-one accuracy of the system is within 1% of the
state-of-the-art (80%), while on highly ambiguous words it is up
to 70% better. On multilingual experiments our results are
significantly better than or comparable to the best published
word or instance based systems on 15 out of 19 corpora in 15
languages. The vector representations for words used in our
system are available for download for further experiments.
August 23, 2014
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment