November 16, 2010

Naive Bayes is a joint Maximum Entropy Model

Naive Bayes and Maximum Entropy models are both popular in NLP applications. In this post I will show that Naive Bayes is really a type of joint Maximum Entropy model (and much easier to compute). Maximum Entropy models aim to find the distribution that maximizes entropy while still obeying certain feature expectation constraints. Maximizing entropy with feature expectation constraints turns out to be equivalent to maximizing likelihood in a log-linear (aka exponential) model. Logistic regression is a specific type of conditional maxent model. Naive Bayes is a specific type of joint maxent model.

Consider a classification problem with a D dimensional input vector x and a discrete output variable y. Naive Bayes models assume that the individual components of x are independent given the output y:

\[ p(\mathbf{x}, y) = p(y) \prod_{d=1}^D p(x_d|y) \]

The factors on the right hand side are then estimated using maximum likelihood, which boils down to counting frequencies if x and y are both discrete. In comparison here is a joint maximum entropy model:

\[ p(\mathbf{x}, y) = \frac{1}{Z} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y)) \]

where fm are arbitrary real valued "feature functions" of the input and the output and Z is a normalization constant. This looks nothing like Naive Bayes at first glance, but if we choose feature functions that look at the output y with at most one component of the input x:

\[ p(\mathbf{x}, y) = \frac{1}{Z} \exp(\lambda_0 f_0(y)) \prod_{d=1}^D \exp(\lambda_d f_d(x_d, y)) \]

the similarity becomes more apparent. In fact if the feature functions are binary, this expression will have exactly the same form as the Naive Bayes expression, and maximizing likelihood will give exactly the same answers. Let us illustrate with an example:

\[ \begin{array}{cccc}
x_1 & x_2 & y & n \\
1 & 1 & 1 & 3 \\
1 & 1 & 0 & 1 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 3 \\
\end{array} \]

Here x1 and x2 are inputs, y is the output, and n is the number of times a particular combination has been observed. The Naive Bayes maximum likelihood estimates will give:

\[ \begin{array}{l}
p(y=1) = p(y=0) = 1/2 \\
p(x_1=1|y=1) = p(x_2=1|y=1) = 3/4 \\
p(x_1=1|y=0) = p(x_2=1|y=0) = 1/4 \\
\end{array} \]

which will lead to the following not-so-good estimates:

\[ \begin{array}{lll}
x_1 & x_2 & p(y=1|x_1, x_2) \\
1 & 1 & 0.9 \\
1 & 0 & 0.5 \\
0 & 1 & 0.5 \\
0 & 0 & 0.1 \\
\end{array} \]

The equivalent joint maximum entropy model will have 10 feature functions. The following table specifies when each feature function will return 1, star indicating "don't care":

\[ \begin{array}{lllll}
f & y & x_1 & x_2 & \lambda \\
f_0 & 0 & * & * & \lambda_0 \\
f_1 & 1 & * & * & \lambda_0 \\
f_2 & 0 & 0 & * & \lambda_1 \\
f_3 & 0 & 1 & * & \lambda_2 \\
f_4 & 1 & 0 & * & \lambda_2 \\
f_5 & 1 & 1 & * & \lambda_1 \\
f_6 & 0 & * & 0 & \lambda_1 \\
f_7 & 0 & * & 1 & \lambda_2 \\
f_8 & 1 & * & 0 & \lambda_2 \\
f_9 & 1 & * & 1 & \lambda_1 \\
\end{array} \]

Because of the symmetry of this example, not all lambda coefficients will be distinct, in fact only three distinct lambdas are necessary as indicated in the last column of the table. The maximum likelihood estimates of the lambdas will satisfy the expectation constraints: the empirical frequency of each feature will be equal to the expected frequency:

\[ \frac{1}{N} \sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{\mathbf{x},y} p(\mathbf{x}, y) f(\mathbf{x}, y) \]

Here the sum on the left is over the instances in the dataset, whereas the sum on the right is over all possible x, y pairs. This equation can be derived by setting the derivative of the log likelihood expression to zero. The following parameters satisfy these constraints and thus maximize the likelihood:

\[ \lambda_0 = \lambda_2 = 0, \lambda_1 = \log(3), Z = 32 \]

This joint maximum entropy model gives exactly the same results as the Naive Bayes model. For example:

\[ p(x_1=1, x_2=1, y=1) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_1) = \frac{9}{32} \]
\[ p(x_1=1, x_2=1, y=0) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_2) = \frac{1}{32} \]

This example is from Chris Manning's NLP class (lecture 7) where he compares the Naive Bayes model to a conditional Maximum Entropy model:

\[ p(y | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y)) \]

A conditional maxent model will satisfy conditional expectation constraints and maximize conditional likelihood (which can again be derived by setting the derivative of the log conditional likelihood to zero):

\[ \sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{n=1}^N \sum_y p(y|\mathbf{x}_n) f(\mathbf{x}_n, y) \]

Using the same set of feature functions and parameters, one can derive the values that satisfy the constraints:

\[ \lambda_0 = \lambda_2 = 0, \lambda_1 = \frac{1}{2} \log(3) \]

The conditional model will give probabilities that match the data perfectly:

\[ \begin{array}{lll}
x_1 & x_2 & p(y=1|x_1, x_2) \\
1 & 1 & 3/4 \\
1 & 0 & 1/2 \\
0 & 1 & 1/2 \\
0 & 0 & 1/4 \\
\end{array} \]

However it is important to understand that this model handles feature interactions better than Naive Bayes not because of any magic associated with maxent models, but as a result of maximizing conditional instead of joint likelihood. When you maximize joint likelihood, maxent gives the same results as Naive Bayes. Of course with maxent, one always has the option of defining features with cross terms (more than one input variable) to handle feature interaction.

Full post...

November 15, 2010

Next Generation Parser Evaluation

An ACL Workshop Proposal by Laura Rimell and Deniz Yuret.

This workshop aims to foster the development of innovative, targeted, formalism-independent parser evaluation resources and methods that will guide us in building the next generation of parsers.

Under many of our existing evaluation measures, parsing accuracy appears to have plateaued around the 90% mark. To continue making meaningful improvements to parsing technology, we first need to clarify what this 90% represents. Do our evaluations measure semantically-relevant syntactic phenomena? Do they accurately represent multiple domains, languages, and formalisms? How relevant are they for downstream tasks? Do they reflect the level of inter-annotator agreement? We also need to identify and understand the "missing 10%": there is a growing awareness in the community that parsers may perform poorly on less frequent but semantically important syntactic phenomena, but in fact we are not even certain whether such crucial phenomena are represented in our current evaluation schemes. We need new ways of highlighting the specific areas where parsers need to improve.

We believe parser evaluation should:

- be relevant for multiple formalisms, languages, and domains
- be targeted towards finding parser weaknesses
- focus on semantically important tasks
- be extrinsic or task-oriented as well as intrinsic
- be based on schemes with high inter-annotator agreement
- show us how we can improve parser training methods

The workshop builds on the insights gained from the COLING-08 workshop on Cross-Framework and Cross-Domain Parser Evaluation. This earlier workshop made particular inroads towards framework-independent parser evaluation by fostering discussion of formalism-independent schemes, especially grammatical relation schemes.

Despite the advances made in cross-framework evaluation, such evaluations still suffer from a loss of accuracy arising from conversion between output formats. One recent answer to this problem is the PETE task. In PETE (Yuret et al., 2010, http://pete.yuret.com) parser evaluation is performed using simple syntactic entailment questions. Given the sentence "The man who stole my car went to jail", the annotator is asked to judge entailments like "The man went to jail" or "My car went to jail". This scheme is formalism-independent, has high inter-annotator agreement, and focuses evaluation on semantically relevant distinctions. A new version of PETE will form the shared task for this workshop.

Another known weakness in existing evaluation measures, including ones based on grammatical relation formalisms, is that they are aggregate measures, in which syntactic phenomena are de facto weighted by frequency rather than by degree of syntactic difficulty or semantic importance. Thus such measures are are likely to have disproportionate contributions from high-frequency, "easy" grammatical phenomena such as determiners and subjects; while frequency weighting is obviously important, it makes it difficult to discern the phenomena where parsers really need to improve.

One answer to this problem is to focus evaluation on syntactic phenomena which we know to be difficult for parsers, such as the unbounded dependency evaluations performed in Rimell et al. (2009) and Nivre et al. (2010). This area is wide open for development: we have known for a long time that parsers have difficulty with phenomena like coordination and PP attachment, but are there other problematic constructions? We should also focus on finding new ways of determining which phenomena are most difficult, and hence where we need to focus parser training efforts. Also crucial is finding ways to measure the importance of parser errors for downstream tasks, especially semantic tasks, and weighting parser performance accordingly.

Third, many evaluations are still intrinsic, and while intrinsic evaluations play an important role -- especially for developing new parsers, and for fine-grained comparisons with previous work -- it is increasingly clear that performance on intrinsic evaluations doesn't always predict task performance.

Recent papers such as Miyao et al. (2008, 2009) and Miwa et al. (2010) focus on task-based evaluation, especially for the biomedical domain. We need more evaluations that focus on a greater range of tasks, languages, and domains (or even subdomains, since the field has barely begun to address how the vocabulary and writing conventions across e.g. biomedical subdomains may affect parsing accuracy).

Finally, unlike other NLP subfields, almost no parser evaluation studies discuss the relevance of inter-annotator agreement. It may be that the 90% evaluation plateau reflects the limits of inter-annotator agreement, but we lack a clear picture of how these figures correspond. New, more natural annotation methods may help in this area.

At this workshop we especially encourage papers that consider how techniques and resources from other NLP subfields can be brought to bear on parser evaluation. Perhaps resources annotated with information on compound nouns, subcategorization frames, selectional preferences, or textual entailments may serve as gold standards. Perhaps new gold standards may be created by exploiting shallow parsing or novel approaches to human annotation. Perhaps we can learn something from sentence simplification, semantic parsing, or active learning. Ultimately we are interested in finding new and exciting ways to identify where we need to improve our parsers.

The workshop will have two parts.

Part I: PETE-2 shared task. This will be an updated version of the successful SemEval-2010 shared task on Parser Evaluation using Textual Entailments. As noted in Yuret et al. (2010), two important improvements to the task are re-balancing the composition of syntactic phenomena covered in the task dataset, and automating the entailment generation process. Both of these improvements will be made for the new PETE-2 dataset.

Anyone will be welcome to submit a system to the shared task portion of the workshop, and reports on the shared task will make up part of the workshop program. For teams not wishing to build their own RTE system to interpret their parser output, we will ofter a simple system that generates RTE judgments from Stanford Dependency output, based on the top performing systems from SemEval-10 PETE.

Part II: Papers. We invite full-length papers which present evaluation resources, tools, techniques, or ideas; results of new evaluations; or new methods for targeted parser training based on evaluation results. We welcome submissions on all related topics, including but not limited to:

- new formalism-independent evaluation resources
- new domain-specific or cross-domain evaluation resources
- new language-specific or multi-lingual evaluation resources
- new evaluation resources targeted to specific syntactic phenomena
- new approches to identifying syntactic phenomena that are difficult for parsers
- evaluation schemes that consider semantic relevance
- new extrinsic or task-based evaluations
- schemes for improvement of a parser based on evaluation results
- evaluation techniques that consider inter-annotator agreement
- ideas for bringing insights from other NLP subfields to bear on parser evaluation

Desired Workshop Length: one day

Estimated Number of Attendees: 25

Organizers:

Laura Rimell
Computer Laboratory
University of Cambridge
William Gates Building
15 JJ Thomson Ave
Cambridge
CB3 0FD
United Kingdom
+44 (0)1223 334696
laura.rimell@cl.cam.ac.uk

Statement of research interests and areas of expertise: Rimell has worked on domain adaptation for parsing and is interested in novel parser evaluation methods. She has worked on the evaluation of a variety of treebank, grammar-based, and dependency parsers on unbounded dependencies and was a contributor to the COLING-08 parser evaluation workshop as well as a member of the top-performing team in the SemEval-10 PETE task. She is also currently working on acquisition of lexical resources and has an interest in their relationship with parsing and parser evaluation.

Deniz Yuret
Department of Computer Engineering
Koc University
Istanbul
Turkey
+90-212-338-1724
dyuret@ku.edu.tr

Statement of research interests and areas of expertise: Yuret has worked on unsupervised parsing and various unsupervised disambiguation problems, including word senses, semantic relations, and morphology. He was the organizer of the SemEval-10 PETE task and is currently co-organizing the next SemEval.

Full post...

Computational Models of Narrative

AAAI 2010 Fall Symposium (Proceedings), November 11-13, 2010, Arlington, VA. Mark Finlayson, Program Chair.

Narratives are ubiquitous. We use them to educate, communicate, convince, explain, and entertain. As far as we know every society has narratives, which suggests they are deeply rooted and serve an important cognitive function: that narratives do something for us. It is clear that, to fully explain human intelligence, beliefs, and behaviors, we will have to understand and explain narrative.

The symposium will bring together researchers with a wide variety of perspectives to share what is known about the fundamentals of the computational modeling of narrative and to explore the forefront of that knowledge. We seek participation from as wide a variety of approaches as possible, including not only AI researchers and technologists, but also psychologists, cognitive scientists, linguists, philosophers, narrative theorists, anthropologists, educators, storytellers, and neuroscientists.

Full post... Related link

November 03, 2010

Probability twisters

Of all the math problems I collect, these are my favorites. They do not require anything more than elementary math, but they do seem to trigger a software bug in most people's brains. Choose between several different arguments that lead to different answers for each problem. (Updated Nov 2010: two siblings problem)
  1. (Two siblings) If you pick a random family with two kids and calculate the probability of both being girls the obvious answer would be 1/4 (assuming girls and boys being equally likely). However simple variations of this problem easily lead to some confusion:
    • Variation 1: If you ask the family whether they have at least one girl, and they say yes, the two girl probability is 1/3.
    • Variation 2: If you see one of their kids on the street and notice that she is a girl, the two girl probability is 1/2.
    You can verify these answers by imagining the sample space of all (say four million) two-child families and assuming equal numbers of boy-boy, boy-girl, girl-boy, and girl-girl families (say one million each). What is tricky to understand is why these two variations have different answers when it seems like they give you the exact same information. Here are some more variations:
    • Variation 3: If you learn that the older sibling is a girl, the two girl probability is 1/2.
    • Variation 4: If you learn that the family has one girl named Florida, the two girl probability is approximately 1/2.
    • Variation 5: If you learn that the family has one girl born on a Wednesday, the two girl probability is 13/27.

  2. (Umit - Monty Hall Problem) You are a participant on the game show "Let's Make a Deal." Monty Hall shows you three closed doors. He tells you that two of the closed doors have a goat behind them and that one of the doors has a new car behind it. You pick one door, but before you open it, Monty opens one of the two remaining doors and shows that it hides a goat. He then offers you a chance to switch doors with the remaining closed door. Is it to your advantage to do so?
    • Argument 1: It does not matter. The probability of finding the car in the remaining two doors was equal in the beginning, and they are still equal now. The fact that you put your hand on one of them cannot increase or decrease its probability of having the car under it.
    • Argument 2: If we repeated this experiment a million times, you would get the the car only one third of the time by sticking to your first door. People who consistently switch would win the other two thirds. Therefore you should switch.
    • Argument 3: Think about what you would do if there were a thousand doors, rather than three, and Monty Hall opened 998 doors with goats behind them.
    • Bibliography: http://math.rice.edu/~pcmi/mathlinks/montyurl.html

  3. (Encyclopedia of Bridge) You are South with three small of a suit, and dummy has QJ9. You desperately need a trick from this suit. You lead low to the Queen, and East wins with the King. When you get a second chance, you lead low to the J9 and West plays low. Should you play the Jack or the 9?
    • Argument 1: If either opponent has A10, it does not matter. If East has the Ace and West the 10, you want to play the 9. If it is the other way around, you want to play the Jack. Both sides are equally likely to have the Ace so it does not matter what you play.
    • Argument 2: You should play the Jack because East has the Ace only 1/3 of the time. If East had AK, he would play the King to the first trick only half the time. If he had K10, he would always play the King. Since we know he played the King, it is twice as likely that he has the K10 and not AK.
    • Note: Note the similarity with the Monty Hall problem.

  4. (Memduh - Two envelope problem) I offer you a pick between two envelopes with money. One envelope has twice as much money as the other. You pick one, and out comes 10 dollars. Now I give you a chance to switch. Would you like to switch? How much are you willing to pay to switch?
    • Argument 1: Of course you switch. The expected amount of money in the other envelope is 0.5x5 + 0.5x20 = 12.5 dollars. In fact you are willing to pay up to 2.5 dollars to switch.
    • Argument 2: What if I asked you the question before you opened the envelope and saw the 10 dollars? Using the same reasoning, you can assume there is A dollars in the envelope and compute 0.5x(A/2) + 0.5x(2A) = 1.25A for the expected money in the other envelope. So you would switch. Just before you open your new envelope, I ask you whether you would like to switch again? What would your answer be?
    • Note: In fact if I can find two people that believe in Argument 1, I can build a money machine. Just keep giving them two envelopes with 5 and 10 dollars and charge for switching... :^) (Of course I charge them whatever comes out of the first envelope for playing the game, so that it is a zero sum game.)
    • Bibliography: http://www.u.arizona.edu/~chalmers/papers/envelope.html

  5. (Neal) I pick two real numbers. You look at one of them. Can you find a strategy that lets you guess whether you are looking at the larger or smaller number with more than 1/2 success rate.
    • Argument 1: Obviously you cannot find such a strategy.
    • Argument 2: Take a probability distribution that is non-zero over all the real numbers (standard normal for example). Draw a random number from this distribution and respond assuming that the hidden number is equal to your random number. There are three cases: (i) Your random number will be smaller than both my numbers, in which case you have 50% chance of winning. (ii) Your random number will be larger than both my numbers, in which case you have 50% chance of winning. (iii) Your random number will be between my two numbers, in which case you have 100% chance of winning. The average is greater than 50%.
    • Note: Using a similar argument one can show that you could in fact make a profit in the two envelope problem by employing a mixed strategy.

  6. (Alkan) You draw a random line that cuts a circle with unit radius. What is the probability that the chord will be longer than sqrt(3)?
    • Argument 1: Consider the distance between the midpoint of the chord and the center of the circle. If this distance is less than 1/2 the chord will be longer than sqrt(3). Therefore the answer is 1/2.
    • Argument 2: Draw a tangent at one of the points the line intersects the circle. Consider the angle between this tangent and the chord. If this angle is between 60 and 120 degrees, the chord will be longer than sqrt(3). Therefore the answer is 1/3.
    • Argument 3: Consider the midpoint of the chord. If this midpoint is within a concentric circle with half the radius, the chord will be longer than sqrt(3). The area of a circle with half the radius is 1/4th of the original. Therefore the answer is 1/4.

Full post...