January 14, 2014

Smoothing a Tera-word Language Model

Deniz Yuret. In The 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT) (Download PDF, code)
Update of Jan 14, 2014: The latest version gets rid of all glib dependencies (glib is broken, it blows up your code without warning if your arrays or hashes get too big). The glookup program reads ngram patterns with wildcards from stdin and prints their counts from the Web1T Google ngram data. The glookup.pl script quickly searches for a given pattern in uncompressed Google Web1T data. Use the C version for bulk processing, the perl version to get a few counts quickly. The model.pl script optimizes and tests various language models. See the README in the code repository for details. Typical usage:
$ model.pl -patterns < text > patterns
$ glookup -p web1t_path < patterns > counts
$ model.pl -counts counts < text

Abstract: Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.

Related link


Andrew said...


I used your c code "glookup" and tried to search for count of an ngram "hellow world i" in web1T and the memory usage quickly shot up into gigabytes and the program crashed! What should be done?

Deniz Yuret said...

Hi Andrew, unfortunately an update to glib broke the code. It has now been rewritten without glib and should work fine.