January 09, 2014

FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an N-gram Language Model

Deniz Yuret. Nov. 2012. Signal Processing Letters, IEEE. Volume: 19, Issue: 11, Pages: 725-728. DOI, URL. (Download the pdf, code, and data (1GB). Our EMNLP-2012 paper uses FASTSUBS to get the best published result in part of speech induction.).
Update of Jan 9, 2014: The latest version gets rid of all glib dependencies (glib is broken, it blows up your code without warning if your arrays or hashes get too big). It also fixes a bug: The previous versions of the code and the paper assumed that the log back-off weights were upper bounded by 0 which they are not. In my standard test of generating the top 100 substitutes for all the 1,222,974 positions in the PTB, this caused a total of 66 low probability substitutes to be missed and 31 to be listed out of order. Finally a multithreaded version, fastsubs-omp is implemented. The number of threads can be controlled using the environment variable OMP_NUM_THREADS. The links above have been updated to the latest version.

Abstract: Lexical substitutes have found use in areas such as paraphrasing, text simplification, machine translation, word sense disambiguation, and part of speech induction. However the computational complexity of accurately identifying the most likely substitutes for a word has made large scale experiments difficult. In this paper I introduce a new search algorithm, FASTSUBS , that is guaranteed to find the K most likely lexical substitutes for a given word in a sentence based on an n-gram language model. The computation is sub-linear in both K and the vocabulary size V . An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available at http://goo.gl/jzKH0.


talioris said...

Hello Yuret, I tried your FASTSUBS code with a toy example and it gives me a segmentation fault:

fastsubs ~/tmp/fastsubs/test.arpa < ~/tmp/fastsubs/test.wlist
[t=0 m=12812288] Get substitutes until count=-1 OR probability=1
[t=0 m=12828672] Loading model file /users/spraak/jpeleman/tmp/fastsubs/test.arpa
Segmentation fault

I compiled your source code by simply running make which gave no errors. Anything I might have overlooked?

Thanks in advance!


Deniz Yuret said...

Hi Joris,

I think we figured it out. One of my students increased the size of
some constants like _LINE and BLOCKSIZE to handle bigger data which
somehow broke the code. We are trying to figure out why that could
be, but in the meantime we reverted these values back to their old
values, and the latest from github should work.

Let me know if you have any problems.

Thanks for pointing this out.