October 05, 2006

Why you should not use the Penn Treebank to train a parser

Wrote a script to find inconsistently labeled constitutents in penn treebank. Script is inconsistent.scm, uses escape.pl as a prefilter for the mrg format, and its output is inconsistent.out.gz. Treebank has 49208 sentences, 1173766 tokens. There are 735722 constituents (only counting ones that have more than one non-empty child). 36306 constituent strings appear more than once, of these 5646 (15%) have multiple parses. It is a reasonable guess to say that the annotation for the whole treebank has at least this much inconsistency. A cursory analysis of the output shows the following types of ambiguities:

1. pos tagging inconsistencies (3088, 8.5%):
((NP (JJ clerical) (NNS workers)) . 1)
((NP (NN clerical) (NNS workers)) . 1)

2. constituent label inconsistencies (1126, 3.1%):
((VP (VBZ remains) (ADJP (JJ sound))) . 1)
((VP (VBZ remains) (NP (JJ sound))) . 1)

* Note that some constituent label changes may be legitimate:
but this does not seem to be very frequent.
((NP (CD 2) (NN %)) . 29)
((ADJP (CD 2) (NN %)) . 18)

3. bracketing changes (1432, 3.9%):
((PP (IN At) (NP (DT the) (ADJP (RB very) (JJS least)))) . 4)
((PP (IN At) (NP (DT the) (RB very) (JJS least))) . 3)

* Note that about half of these constituents are NPs which
generally have incomplete bracketing in Penn Treebank. About 25%
are PP's and the rest include VP, ADJP etc.

* Here is an example demonstrating all of the above:
((ADVP (RB earlier) (NP (DT this) (NN year))) . 1)
((NP (ADVP (RBR earlier)) (DT this) (NN year)) . 4)
((ADVP (RBR earlier) (DT this) (NN year)) . 1)
((NP (RB earlier) (DT this) (NN year)) . 8)
((NP (RBR earlier) (DT this) (NN year)) . 70)
((ADVP (RBR earlier) (NP (DT this) (NN year))) . 12)
((NP (JJR earlier) (DT this) (NN year)) . 3)

Related link

1 comment:

Deniz said...

Found out that Dickinson and Meurers have been doing similar work for error detection and correction since 2003. Relevant papers can be found at Dickonson's page. There is an NSF funded project: DECCA.