November 15, 2010

Next Generation Parser Evaluation

An ACL Workshop Proposal by Laura Rimell and Deniz Yuret.

This workshop aims to foster the development of innovative, targeted, formalism-independent parser evaluation resources and methods that will guide us in building the next generation of parsers.

Under many of our existing evaluation measures, parsing accuracy appears to have plateaued around the 90% mark. To continue making meaningful improvements to parsing technology, we first need to clarify what this 90% represents. Do our evaluations measure semantically-relevant syntactic phenomena? Do they accurately represent multiple domains, languages, and formalisms? How relevant are they for downstream tasks? Do they reflect the level of inter-annotator agreement? We also need to identify and understand the "missing 10%": there is a growing awareness in the community that parsers may perform poorly on less frequent but semantically important syntactic phenomena, but in fact we are not even certain whether such crucial phenomena are represented in our current evaluation schemes. We need new ways of highlighting the specific areas where parsers need to improve.

We believe parser evaluation should:

- be relevant for multiple formalisms, languages, and domains
- be targeted towards finding parser weaknesses
- focus on semantically important tasks
- be extrinsic or task-oriented as well as intrinsic
- be based on schemes with high inter-annotator agreement
- show us how we can improve parser training methods

The workshop builds on the insights gained from the COLING-08 workshop on Cross-Framework and Cross-Domain Parser Evaluation. This earlier workshop made particular inroads towards framework-independent parser evaluation by fostering discussion of formalism-independent schemes, especially grammatical relation schemes.

Despite the advances made in cross-framework evaluation, such evaluations still suffer from a loss of accuracy arising from conversion between output formats. One recent answer to this problem is the PETE task. In PETE (Yuret et al., 2010, parser evaluation is performed using simple syntactic entailment questions. Given the sentence "The man who stole my car went to jail", the annotator is asked to judge entailments like "The man went to jail" or "My car went to jail". This scheme is formalism-independent, has high inter-annotator agreement, and focuses evaluation on semantically relevant distinctions. A new version of PETE will form the shared task for this workshop.

Another known weakness in existing evaluation measures, including ones based on grammatical relation formalisms, is that they are aggregate measures, in which syntactic phenomena are de facto weighted by frequency rather than by degree of syntactic difficulty or semantic importance. Thus such measures are are likely to have disproportionate contributions from high-frequency, "easy" grammatical phenomena such as determiners and subjects; while frequency weighting is obviously important, it makes it difficult to discern the phenomena where parsers really need to improve.

One answer to this problem is to focus evaluation on syntactic phenomena which we know to be difficult for parsers, such as the unbounded dependency evaluations performed in Rimell et al. (2009) and Nivre et al. (2010). This area is wide open for development: we have known for a long time that parsers have difficulty with phenomena like coordination and PP attachment, but are there other problematic constructions? We should also focus on finding new ways of determining which phenomena are most difficult, and hence where we need to focus parser training efforts. Also crucial is finding ways to measure the importance of parser errors for downstream tasks, especially semantic tasks, and weighting parser performance accordingly.

Third, many evaluations are still intrinsic, and while intrinsic evaluations play an important role -- especially for developing new parsers, and for fine-grained comparisons with previous work -- it is increasingly clear that performance on intrinsic evaluations doesn't always predict task performance.

Recent papers such as Miyao et al. (2008, 2009) and Miwa et al. (2010) focus on task-based evaluation, especially for the biomedical domain. We need more evaluations that focus on a greater range of tasks, languages, and domains (or even subdomains, since the field has barely begun to address how the vocabulary and writing conventions across e.g. biomedical subdomains may affect parsing accuracy).

Finally, unlike other NLP subfields, almost no parser evaluation studies discuss the relevance of inter-annotator agreement. It may be that the 90% evaluation plateau reflects the limits of inter-annotator agreement, but we lack a clear picture of how these figures correspond. New, more natural annotation methods may help in this area.

At this workshop we especially encourage papers that consider how techniques and resources from other NLP subfields can be brought to bear on parser evaluation. Perhaps resources annotated with information on compound nouns, subcategorization frames, selectional preferences, or textual entailments may serve as gold standards. Perhaps new gold standards may be created by exploiting shallow parsing or novel approaches to human annotation. Perhaps we can learn something from sentence simplification, semantic parsing, or active learning. Ultimately we are interested in finding new and exciting ways to identify where we need to improve our parsers.

The workshop will have two parts.

Part I: PETE-2 shared task. This will be an updated version of the successful SemEval-2010 shared task on Parser Evaluation using Textual Entailments. As noted in Yuret et al. (2010), two important improvements to the task are re-balancing the composition of syntactic phenomena covered in the task dataset, and automating the entailment generation process. Both of these improvements will be made for the new PETE-2 dataset.

Anyone will be welcome to submit a system to the shared task portion of the workshop, and reports on the shared task will make up part of the workshop program. For teams not wishing to build their own RTE system to interpret their parser output, we will ofter a simple system that generates RTE judgments from Stanford Dependency output, based on the top performing systems from SemEval-10 PETE.

Part II: Papers. We invite full-length papers which present evaluation resources, tools, techniques, or ideas; results of new evaluations; or new methods for targeted parser training based on evaluation results. We welcome submissions on all related topics, including but not limited to:

- new formalism-independent evaluation resources
- new domain-specific or cross-domain evaluation resources
- new language-specific or multi-lingual evaluation resources
- new evaluation resources targeted to specific syntactic phenomena
- new approches to identifying syntactic phenomena that are difficult for parsers
- evaluation schemes that consider semantic relevance
- new extrinsic or task-based evaluations
- schemes for improvement of a parser based on evaluation results
- evaluation techniques that consider inter-annotator agreement
- ideas for bringing insights from other NLP subfields to bear on parser evaluation

Desired Workshop Length: one day

Estimated Number of Attendees: 25


Laura Rimell
Computer Laboratory
University of Cambridge
William Gates Building
15 JJ Thomson Ave
United Kingdom
+44 (0)1223 334696

Statement of research interests and areas of expertise: Rimell has worked on domain adaptation for parsing and is interested in novel parser evaluation methods. She has worked on the evaluation of a variety of treebank, grammar-based, and dependency parsers on unbounded dependencies and was a contributor to the COLING-08 parser evaluation workshop as well as a member of the top-performing team in the SemEval-10 PETE task. She is also currently working on acquisition of lexical resources and has an interest in their relationship with parsing and parser evaluation.

Deniz Yuret
Department of Computer Engineering
Koc University

Statement of research interests and areas of expertise: Yuret has worked on unsupervised parsing and various unsupervised disambiguation problems, including word senses, semantic relations, and morphology. He was the organizer of the SemEval-10 PETE task and is currently co-organizing the next SemEval.

No comments: