glookup

(c) Deniz Yuret 2007

Download here

glookup reads ngram patterns (possibly containing wildcards) from stdin, finds their counts in one pass from google ngram data, and prints the results.

The input should have a single pattern on each line consisting of space separated tokens with '_' representing the wildcard token that matches any word. The output will have up to three counts (tab separated) next to the pattern:

n0: the total count of the ngrams matching a given pattern.

n1: the number of distinct ngrams matching a given pattern. This is only output for patterns with wildcards.

n2: the number of distinct words that appear as the last word in a pattern that ends with a wildcard and has more than one wildcard. This is needed for Kneser-Ney smoothing.

Please see the README file and the user manual for more information.