1. Definition: A targeted textual entailment (TTE) task uses entailment questions to test a specific competence of a system, such as word sense disambiguation, semantic relation recognition, or parsing. Even if we do not know the best theory underlying a competence, we know what having that competence enables people to do. For example:
"They had a board meeting today."
==> "They had a committee meeting today." [yes]
==> "They had a plank meeting today." [no]
"John opened the car door."
==> "The door is part of the car." [yes]
==> "The car produced the door." [no]
"I saw the bird using a telescope."
==> "I used a telescope" [yes]
==> "The bird used a telescope" [no]
2. Motivation: Targeted textual entailment tasks address the following issues:
2.1 Currently most shared tasks use or favor a specific inventory, representation, or linguistic theory. In WSD, WordNet is often used as the sense inventory although everybody complains about it. We have FrameNet, Propbank, Nombank, various logical formalisms and different sets of noun-noun relations people work on in the semantic relations area. The parsing community is split into a constituency group and a dependency group that rarely compare results. Formulating TTE tasks in these fields will help test systems on a level playing field no matter which inventory, presentation, or linguistic theory they use.
2.2 Large annotation efforts struggle to achieve high inter annotator agreement (ITA). My hypothesis is most annotators understand the sentences they are supposed to annotate equally well, but do not understand the formalism good enough for consistent labeling. By asking simple entailment questions where all they need to do is choose yes/no/uncertain, it is hoped that (i) no education in a particular formalism will be needed for annotators, (ii) annotation will proceed faster, and (iii) final ITA will be higher. (We throw away the examples that get a lot of "uncertain" answers).
3. Methodology: For a TTE task to be useful and challenging the examples should be chosen close to the border that separates the positives from the negatives. In other words, the positive examples should have non-trivial alternatives and the negative examples should be "near-misses". (My examples in Section 1 are not all very good according to this criteria). For example in the WSD TTE task (which is basically lexical substitution), the substitute should be chosen such that (i) it is a near-synonym for one of the target's senses, and/or (ii) it has a high probability of occuring in the given context. In the parsing task, examples should be based on decision points where a typical parser can go either way or where the n-best parses disagree. This suggests that examples can be generated automatically by taking the best automated systems of the day and focusing on decisions about which they are least confident. This type of "active learning" methodology will uncover weaknesses that the next generation systems can focus on.