[Language_] A statistical language model assigns probabilities to the strings in a language.  Speech recognition and other language technologies rely on language models to judge the likeliness of an utterance.  The field has long been dominated by n-gram language models which estimate the probability for a word conditional on the previous n-1 words (where n is typically 3).  Data sparseness is a big problem, no corpus contains enough samples of all n-word sequences.  Here I present some experiments with using Google counts as a language model.
Some rough estimates to give the reader intuition on the sparseness of data:  English has roughly 100,000 unique words.  This means the number of unique bigrams is on the order of 10 billion and unique trigrams is on the order of 10^15.  The distributions are far from flat: the most frequent word "the" occurs approximately 5% of the time.  The frequency of a "typical" English noun would be in the 1/10,000 - 1/50,000 range, so to get a good sampling of its concordance (say to observe it 1000 times) would require a corpus of 50 million words.  A typical novel contains 200,000 words.  A year of Wall Street Journal has 20 million.  The Gigaword news corpus from LDC contains roughly one billion (as the name suggests).
How many words does Google have?  Well I took some sample words and did some calculations.  The columns of the table below give:
1. word
2. bnc frequency in a million words
3. google count in millions of words
4. implied total number of words in google if it had the same ratios as bnc
a       21626   16770   7.75455e+11
in      18214   15360   8.43307e+11
of      29391   14030   4.77357e+11
the     61847   13880   2.24425e+11
to      16284   13850   8.50528e+11
and     26817   12740   4.75072e+11
whose   198     395     1.99495e+12
sculpture       13      64.8    4.98462e+12
current 133     1930    1.45113e+13
reliable        22      226     1.02727e+13
down    98      1540    1.57143e+13
execution       14      153     1.09286e+13
addressed       27      185     6.85185e+12
planted 14      31.9    2.27857e+12
ordered 49      179     3.65306e+12
division        90      553     6.14444e+12
barely  23      66.3    2.88261e+12
dancing 12      138     1.15e+13
My first intuition of using the most frequent words backfired.  See, google gives document counts, not occurance counts.  That means words that occur multiple times in a document get under-counted.  So less frequent words give a more accurate guess (their document and occurance counts will be close).  I take the number of English words indexed by Google to be 1e13 looking at the above data.
We still need to do smoothing.  I used the baseline smoothing as described by Goodman and Chen and looked for the optimal parameters on some test data (it was hard to find pieces of text that do not exist in Google).  Basically I searched for some numbers to multiply each k-gram estimate where k=0...n that gave me the highest likelihood.  The biggest bang for the buck comes from 4-grams, in fact the increase in model likelihood from 3 to 4 grams is comparable to the one from 2 to 3 grams.  When we go to 5 grams, not so hot.
Here is the results for the file A2/A20.txt from bnc (132 lines, 1557 words).  a0 is multiplied by 1/vocabulary_size which is assumed to be 1e6.  a1 is multiplied by the frequency of the word which is assumed to be google_count/1e13.  a2..a6 are the coefficients of bigram...six-gram conditional probabilities.
                bits    a0      a1      a2      a3      a4      a5      a6
        N=1     13.7334 .01925  .98075
        N=2     9.3085  .01103  .03460  .95436
        N=3     8.9581  .01167  .03649  .62960  .32224
        N=4     8.5835  .01214  .03965  .52324  .15047  .27450
        N=5     8.5453  .01218  .04013  .51352  .14090  .23877  .05449
        N=6     8.5345  .01220  .04044  .50807  .14147  .23397  .04314  .02070
The resulting bit-rate per word is 8.5.  The best models reported in Goodman and Chen dip below 7 bits with 10 million words of data and trigrams.  The possible explanations are:
- I ignored sentence breaks.  (Starting word of a sentence is conditioned on the last words in the previous sentence).
- They only used news data.  (Their results on the brown corpus are significantly worse).
- Google counts are not very accurate.  Here are some examples:
        223     cherished memory is
        497     most cherished memory is
        958     author of the legend
        1010    and author of the legend
- Bias due to document vs. occurance counts effect the most frequent words badly.  (The frequency of "the" is 0.1% according to my estimate)
- I did nothing special for numbers, proper nouns etc.  (They are not clear about the treatment of unknown words, but most likely they just relied on 1/vocabulary_size which they took to be about 50,000)
Which of these reasons dominate or whether there is another significant reason is a matter of future research :)
No comments:
Post a Comment