[Language_] A statistical language model assigns probabilities to the strings in a language. Speech recognition and other language technologies rely on language models to judge the likeliness of an utterance. The field has long been dominated by n-gram language models which estimate the probability for a word conditional on the previous n-1 words (where n is typically 3). Data sparseness is a big problem, no corpus contains enough samples of all n-word sequences. Here I present some experiments with using Google counts as a language model.
Some rough estimates to give the reader intuition on the sparseness of data: English has roughly 100,000 unique words. This means the number of unique bigrams is on the order of 10 billion and unique trigrams is on the order of 10^15. The distributions are far from flat: the most frequent word "the" occurs approximately 5% of the time. The frequency of a "typical" English noun would be in the 1/10,000 - 1/50,000 range, so to get a good sampling of its concordance (say to observe it 1000 times) would require a corpus of 50 million words. A typical novel contains 200,000 words. A year of Wall Street Journal has 20 million. The Gigaword news corpus from LDC contains roughly one billion (as the name suggests).
How many words does Google have? Well I took some sample words and did some calculations. The columns of the table below give:
2. bnc frequency in a million words
3. google count in millions of words
4. implied total number of words in google if it had the same ratios as bnc
a 21626 16770 7.75455e+11
in 18214 15360 8.43307e+11
of 29391 14030 4.77357e+11
the 61847 13880 2.24425e+11
to 16284 13850 8.50528e+11
and 26817 12740 4.75072e+11
whose 198 395 1.99495e+12
sculpture 13 64.8 4.98462e+12
current 133 1930 1.45113e+13
reliable 22 226 1.02727e+13
down 98 1540 1.57143e+13
execution 14 153 1.09286e+13
addressed 27 185 6.85185e+12
planted 14 31.9 2.27857e+12
ordered 49 179 3.65306e+12
division 90 553 6.14444e+12
barely 23 66.3 2.88261e+12
dancing 12 138 1.15e+13
My first intuition of using the most frequent words backfired. See, google gives document counts, not occurance counts. That means words that occur multiple times in a document get under-counted. So less frequent words give a more accurate guess (their document and occurance counts will be close). I take the number of English words indexed by Google to be 1e13 looking at the above data.
We still need to do smoothing. I used the baseline smoothing as described by Goodman and Chen and looked for the optimal parameters on some test data (it was hard to find pieces of text that do not exist in Google). Basically I searched for some numbers to multiply each k-gram estimate where k=0...n that gave me the highest likelihood. The biggest bang for the buck comes from 4-grams, in fact the increase in model likelihood from 3 to 4 grams is comparable to the one from 2 to 3 grams. When we go to 5 grams, not so hot.
Here is the results for the file A2/A20.txt from bnc (132 lines, 1557 words). a0 is multiplied by 1/vocabulary_size which is assumed to be 1e6. a1 is multiplied by the frequency of the word which is assumed to be google_count/1e13. a2..a6 are the coefficients of bigram...six-gram conditional probabilities.
bits a0 a1 a2 a3 a4 a5 a6
N=1 13.7334 .01925 .98075
N=2 9.3085 .01103 .03460 .95436
N=3 8.9581 .01167 .03649 .62960 .32224
N=4 8.5835 .01214 .03965 .52324 .15047 .27450
N=5 8.5453 .01218 .04013 .51352 .14090 .23877 .05449
N=6 8.5345 .01220 .04044 .50807 .14147 .23397 .04314 .02070
The resulting bit-rate per word is 8.5. The best models reported in Goodman and Chen dip below 7 bits with 10 million words of data and trigrams. The possible explanations are:
- I ignored sentence breaks. (Starting word of a sentence is conditioned on the last words in the previous sentence).
- They only used news data. (Their results on the brown corpus are significantly worse).
- Google counts are not very accurate. Here are some examples:
223 cherished memory is
497 most cherished memory is
958 author of the legend
1010 and author of the legend
- Bias due to document vs. occurance counts effect the most frequent words badly. (The frequency of "the" is 0.1% according to my estimate)
- I did nothing special for numbers, proper nouns etc. (They are not clear about the treatment of unknown words, but most likely they just relied on 1/vocabulary_size which they took to be about 50,000)
Which of these reasons dominate or whether there is another significant reason is a matter of future research :)