Mehmet Ali Yatbaz, Enis Sert, Deniz Yuret. EMNLP 2012. (Download the paper, presentation, code, fastsubs paper, lm training data (250MB), wsj substitute data (1GB), scode output word vectors (5MB), scode visualization demo (may take a few minutes to load). More up to date versions of the code can be found at github.)
Abstract: We investigate paradigmatic representations of word context in the domain of unsupervised syntactic category acquisition. Paradigmatic representations of word context are based on potential substitutes of a word in contrast to syntagmatic representations based on properties of neighboring words. We compare a bigram based baseline model with several paradigmatic models and demonstrate significant gains in accuracy. Our best model based on Euclidean co-occurrence embedding combines the paradigmatic context representation with morphological and orthographic features and achieves 80% many-to-one accuracy on a 45-tag 1M word corpus.