## December 23, 2010

### Research focus

In 1976 John McCarthy, one of the founders of artificial intelligence, wrote a memo discussing the problem of getting a computer to understand a story from the New York Times:
"A 61-year old furniture salesman, John J. Hug, was pushed down the shaft of a freight elevator yesterday in his downtown Brooklyn store by two robbers while a third attempted to crush him with the elevator car because they were dissatisfied with the $1,200 they had forced him to give them." McCarthy suggested that a real understanding of this story would entail being able to answer questions like: Who was in the store when the events began? Who had the money at the end? Did Mr. Hug know he was going to be robbed? Does he know now that he was robbed? Answering these questions is still beyond the state of the art in natural language processing, because they require common sense knowledge in addition to the text of the story. In fact the problems associated with answering questions only based on the text of the story are only beginning to be solved on a large scale: When and where was Mr. Hug pushed? Who forced who to give$1,200 to whom?
Did the money satisfy the robbers?
To achieve an understanding at this level, we need to address linguistic problems like word sense disambiguation ("push" has 15 senses), named entity recognition (Mr. Hug = John J. Hug), anaphora resolution (him = John J. Hug), parsing (who did what to whom?), semantic relation identification (dissatisfied with $1,200 =$1,200 did not satisfy). The figure below illustrates the main challenge: the ambiguity present in most natural language expressions. Our group studies statistical machine learning methods to address these problems with the eventual goal of natural language understanding by machines.

Different interpretations of the sentence: "I saw the man on the hill with a telescope."

Full post...

## November 16, 2010

### Naive Bayes is a joint Maximum Entropy Model

Naive Bayes and Maximum Entropy models are both popular in NLP applications. In this post I will show that Naive Bayes is really a type of joint Maximum Entropy model (and much easier to compute). Maximum Entropy models aim to find the distribution that maximizes entropy while still obeying certain feature expectation constraints. Maximizing entropy with feature expectation constraints turns out to be equivalent to maximizing likelihood in a log-linear (aka exponential) model. Logistic regression is a specific type of conditional maxent model. Naive Bayes is a specific type of joint maxent model.

Consider a classification problem with a D dimensional input vector x and a discrete output variable y. Naive Bayes models assume that the individual components of x are independent given the output y:

$p(\mathbf{x}, y) = p(y) \prod_{d=1}^D p(x_d|y)$

The factors on the right hand side are then estimated using maximum likelihood, which boils down to counting frequencies if x and y are both discrete. In comparison here is a joint maximum entropy model:

$p(\mathbf{x}, y) = \frac{1}{Z} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y))$

where fm are arbitrary real valued "feature functions" of the input and the output and Z is a normalization constant. This looks nothing like Naive Bayes at first glance, but if we choose feature functions that look at the output y with at most one component of the input x:

$p(\mathbf{x}, y) = \frac{1}{Z} \exp(\lambda_0 f_0(y)) \prod_{d=1}^D \exp(\lambda_d f_d(x_d, y))$

the similarity becomes more apparent. In fact if the feature functions are binary, this expression will have exactly the same form as the Naive Bayes expression, and maximizing likelihood will give exactly the same answers. Let us illustrate with an example:

$\begin{array}{cccc} x_1 & x_2 & y & n \\ 1 & 1 & 1 & 3 \\ 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 3 \\ \end{array}$

Here x1 and x2 are inputs, y is the output, and n is the number of times a particular combination has been observed. The Naive Bayes maximum likelihood estimates will give:

$\begin{array}{l} p(y=1) = p(y=0) = 1/2 \\ p(x_1=1|y=1) = p(x_2=1|y=1) = 3/4 \\ p(x_1=1|y=0) = p(x_2=1|y=0) = 1/4 \\ \end{array}$

which will lead to the following not-so-good estimates:

$\begin{array}{lll} x_1 & x_2 & p(y=1|x_1, x_2) \\ 1 & 1 & 0.9 \\ 1 & 0 & 0.5 \\ 0 & 1 & 0.5 \\ 0 & 0 & 0.1 \\ \end{array}$

The equivalent joint maximum entropy model will have 10 feature functions. The following table specifies when each feature function will return 1, star indicating "don't care":

$\begin{array}{lllll} f & y & x_1 & x_2 & \lambda \\ f_0 & 0 & * & * & \lambda_0 \\ f_1 & 1 & * & * & \lambda_0 \\ f_2 & 0 & 0 & * & \lambda_1 \\ f_3 & 0 & 1 & * & \lambda_2 \\ f_4 & 1 & 0 & * & \lambda_2 \\ f_5 & 1 & 1 & * & \lambda_1 \\ f_6 & 0 & * & 0 & \lambda_1 \\ f_7 & 0 & * & 1 & \lambda_2 \\ f_8 & 1 & * & 0 & \lambda_2 \\ f_9 & 1 & * & 1 & \lambda_1 \\ \end{array}$

Because of the symmetry of this example, not all lambda coefficients will be distinct, in fact only three distinct lambdas are necessary as indicated in the last column of the table. The maximum likelihood estimates of the lambdas will satisfy the expectation constraints: the empirical frequency of each feature will be equal to the expected frequency:

$\frac{1}{N} \sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{\mathbf{x},y} p(\mathbf{x}, y) f(\mathbf{x}, y)$

Here the sum on the left is over the instances in the dataset, whereas the sum on the right is over all possible x, y pairs. This equation can be derived by setting the derivative of the log likelihood expression to zero. The following parameters satisfy these constraints and thus maximize the likelihood:

$\lambda_0 = \lambda_2 = 0, \lambda_1 = \log(3), Z = 32$

This joint maximum entropy model gives exactly the same results as the Naive Bayes model. For example:

$p(x_1=1, x_2=1, y=1) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_1) = \frac{9}{32}$
$p(x_1=1, x_2=1, y=0) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_2) = \frac{1}{32}$

This example is from Chris Manning's NLP class (lecture 7) where he compares the Naive Bayes model to a conditional Maximum Entropy model:

$p(y | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y))$

A conditional maxent model will satisfy conditional expectation constraints and maximize conditional likelihood (which can again be derived by setting the derivative of the log conditional likelihood to zero):

$\sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{n=1}^N \sum_y p(y|\mathbf{x}_n) f(\mathbf{x}_n, y)$

Using the same set of feature functions and parameters, one can derive the values that satisfy the constraints:

$\lambda_0 = \lambda_2 = 0, \lambda_1 = \frac{1}{2} \log(3)$

The conditional model will give probabilities that match the data perfectly:

$\begin{array}{lll} x_1 & x_2 & p(y=1|x_1, x_2) \\ 1 & 1 & 3/4 \\ 1 & 0 & 1/2 \\ 0 & 1 & 1/2 \\ 0 & 0 & 1/4 \\ \end{array}$

However it is important to understand that this model handles feature interactions better than Naive Bayes not because of any magic associated with maxent models, but as a result of maximizing conditional instead of joint likelihood. When you maximize joint likelihood, maxent gives the same results as Naive Bayes. Of course with maxent, one always has the option of defining features with cross terms (more than one input variable) to handle feature interaction.

Full post...

## November 15, 2010

### Next Generation Parser Evaluation

An ACL Workshop Proposal by Laura Rimell and Deniz Yuret.

This workshop aims to foster the development of innovative, targeted, formalism-independent parser evaluation resources and methods that will guide us in building the next generation of parsers.

Under many of our existing evaluation measures, parsing accuracy appears to have plateaued around the 90% mark. To continue making meaningful improvements to parsing technology, we first need to clarify what this 90% represents. Do our evaluations measure semantically-relevant syntactic phenomena? Do they accurately represent multiple domains, languages, and formalisms? How relevant are they for downstream tasks? Do they reflect the level of inter-annotator agreement? We also need to identify and understand the "missing 10%": there is a growing awareness in the community that parsers may perform poorly on less frequent but semantically important syntactic phenomena, but in fact we are not even certain whether such crucial phenomena are represented in our current evaluation schemes. We need new ways of highlighting the specific areas where parsers need to improve.

We believe parser evaluation should:

- be relevant for multiple formalisms, languages, and domains
- be targeted towards finding parser weaknesses
- focus on semantically important tasks
- be extrinsic or task-oriented as well as intrinsic
- be based on schemes with high inter-annotator agreement
- show us how we can improve parser training methods

The workshop builds on the insights gained from the COLING-08 workshop on Cross-Framework and Cross-Domain Parser Evaluation. This earlier workshop made particular inroads towards framework-independent parser evaluation by fostering discussion of formalism-independent schemes, especially grammatical relation schemes.

Despite the advances made in cross-framework evaluation, such evaluations still suffer from a loss of accuracy arising from conversion between output formats. One recent answer to this problem is the PETE task. In PETE (Yuret et al., 2010, http://pete.yuret.com) parser evaluation is performed using simple syntactic entailment questions. Given the sentence "The man who stole my car went to jail", the annotator is asked to judge entailments like "The man went to jail" or "My car went to jail". This scheme is formalism-independent, has high inter-annotator agreement, and focuses evaluation on semantically relevant distinctions. A new version of PETE will form the shared task for this workshop.

Another known weakness in existing evaluation measures, including ones based on grammatical relation formalisms, is that they are aggregate measures, in which syntactic phenomena are de facto weighted by frequency rather than by degree of syntactic difficulty or semantic importance. Thus such measures are are likely to have disproportionate contributions from high-frequency, "easy" grammatical phenomena such as determiners and subjects; while frequency weighting is obviously important, it makes it difficult to discern the phenomena where parsers really need to improve.

One answer to this problem is to focus evaluation on syntactic phenomena which we know to be difficult for parsers, such as the unbounded dependency evaluations performed in Rimell et al. (2009) and Nivre et al. (2010). This area is wide open for development: we have known for a long time that parsers have difficulty with phenomena like coordination and PP attachment, but are there other problematic constructions? We should also focus on finding new ways of determining which phenomena are most difficult, and hence where we need to focus parser training efforts. Also crucial is finding ways to measure the importance of parser errors for downstream tasks, especially semantic tasks, and weighting parser performance accordingly.

Third, many evaluations are still intrinsic, and while intrinsic evaluations play an important role -- especially for developing new parsers, and for fine-grained comparisons with previous work -- it is increasingly clear that performance on intrinsic evaluations doesn't always predict task performance.

Recent papers such as Miyao et al. (2008, 2009) and Miwa et al. (2010) focus on task-based evaluation, especially for the biomedical domain. We need more evaluations that focus on a greater range of tasks, languages, and domains (or even subdomains, since the field has barely begun to address how the vocabulary and writing conventions across e.g. biomedical subdomains may affect parsing accuracy).

Finally, unlike other NLP subfields, almost no parser evaluation studies discuss the relevance of inter-annotator agreement. It may be that the 90% evaluation plateau reflects the limits of inter-annotator agreement, but we lack a clear picture of how these figures correspond. New, more natural annotation methods may help in this area.

At this workshop we especially encourage papers that consider how techniques and resources from other NLP subfields can be brought to bear on parser evaluation. Perhaps resources annotated with information on compound nouns, subcategorization frames, selectional preferences, or textual entailments may serve as gold standards. Perhaps new gold standards may be created by exploiting shallow parsing or novel approaches to human annotation. Perhaps we can learn something from sentence simplification, semantic parsing, or active learning. Ultimately we are interested in finding new and exciting ways to identify where we need to improve our parsers.

The workshop will have two parts.

Part I: PETE-2 shared task. This will be an updated version of the successful SemEval-2010 shared task on Parser Evaluation using Textual Entailments. As noted in Yuret et al. (2010), two important improvements to the task are re-balancing the composition of syntactic phenomena covered in the task dataset, and automating the entailment generation process. Both of these improvements will be made for the new PETE-2 dataset.

Anyone will be welcome to submit a system to the shared task portion of the workshop, and reports on the shared task will make up part of the workshop program. For teams not wishing to build their own RTE system to interpret their parser output, we will ofter a simple system that generates RTE judgments from Stanford Dependency output, based on the top performing systems from SemEval-10 PETE.

Part II: Papers. We invite full-length papers which present evaluation resources, tools, techniques, or ideas; results of new evaluations; or new methods for targeted parser training based on evaluation results. We welcome submissions on all related topics, including but not limited to:

- new formalism-independent evaluation resources
- new domain-specific or cross-domain evaluation resources
- new language-specific or multi-lingual evaluation resources
- new evaluation resources targeted to specific syntactic phenomena
- new approches to identifying syntactic phenomena that are difficult for parsers
- evaluation schemes that consider semantic relevance
- new extrinsic or task-based evaluations
- schemes for improvement of a parser based on evaluation results
- evaluation techniques that consider inter-annotator agreement
- ideas for bringing insights from other NLP subfields to bear on parser evaluation

Desired Workshop Length: one day

Estimated Number of Attendees: 25

Organizers:

Laura Rimell
Computer Laboratory
University of Cambridge
William Gates Building
15 JJ Thomson Ave
Cambridge
CB3 0FD
United Kingdom
+44 (0)1223 334696
laura.rimell@cl.cam.ac.uk

Statement of research interests and areas of expertise: Rimell has worked on domain adaptation for parsing and is interested in novel parser evaluation methods. She has worked on the evaluation of a variety of treebank, grammar-based, and dependency parsers on unbounded dependencies and was a contributor to the COLING-08 parser evaluation workshop as well as a member of the top-performing team in the SemEval-10 PETE task. She is also currently working on acquisition of lexical resources and has an interest in their relationship with parsing and parser evaluation.

Deniz Yuret
Department of Computer Engineering
Koc University
Istanbul
Turkey
+90-212-338-1724
dyuret@ku.edu.tr

Statement of research interests and areas of expertise: Yuret has worked on unsupervised parsing and various unsupervised disambiguation problems, including word senses, semantic relations, and morphology. He was the organizer of the SemEval-10 PETE task and is currently co-organizing the next SemEval.

Full post...

### Computational Models of Narrative

AAAI 2010 Fall Symposium (Proceedings), November 11-13, 2010, Arlington, VA. Mark Finlayson, Program Chair.

Narratives are ubiquitous. We use them to educate, communicate, convince, explain, and entertain. As far as we know every society has narratives, which suggests they are deeply rooted and serve an important cognitive function: that narratives do something for us. It is clear that, to fully explain human intelligence, beliefs, and behaviors, we will have to understand and explain narrative.

The symposium will bring together researchers with a wide variety of perspectives to share what is known about the fundamentals of the computational modeling of narrative and to explore the forefront of that knowledge. We seek participation from as wide a variety of approaches as possible, including not only AI researchers and technologists, but also psychologists, cognitive scientists, linguists, philosophers, narrative theorists, anthropologists, educators, storytellers, and neuroscientists.

## November 03, 2010

### Probability twisters

Of all the math problems I collect, these are my favorites. They do not require anything more than elementary math, but they do seem to trigger a software bug in most people's brains. Choose between several different arguments that lead to different answers for each problem. (Updated Nov 2010: two siblings problem)
1. (Two siblings) If you pick a random family with two kids and calculate the probability of both being girls the obvious answer would be 1/4 (assuming girls and boys being equally likely). However simple variations of this problem easily lead to some confusion:
• Variation 1: If you ask the family whether they have at least one girl, and they say yes, the two girl probability is 1/3.
• Variation 2: If you see one of their kids on the street and notice that she is a girl, the two girl probability is 1/2.
You can verify these answers by imagining the sample space of all (say four million) two-child families and assuming equal numbers of boy-boy, boy-girl, girl-boy, and girl-girl families (say one million each). What is tricky to understand is why these two variations have different answers when it seems like they give you the exact same information. Here are some more variations:
• Variation 3: If you learn that the older sibling is a girl, the two girl probability is 1/2.
• Variation 4: If you learn that the family has one girl named Florida, the two girl probability is approximately 1/2.
• Variation 5: If you learn that the family has one girl born on a Wednesday, the two girl probability is 13/27.

2. (Umit - Monty Hall Problem) You are a participant on the game show "Let's Make a Deal." Monty Hall shows you three closed doors. He tells you that two of the closed doors have a goat behind them and that one of the doors has a new car behind it. You pick one door, but before you open it, Monty opens one of the two remaining doors and shows that it hides a goat. He then offers you a chance to switch doors with the remaining closed door. Is it to your advantage to do so?
• Argument 1: It does not matter. The probability of finding the car in the remaining two doors was equal in the beginning, and they are still equal now. The fact that you put your hand on one of them cannot increase or decrease its probability of having the car under it.
• Argument 2: If we repeated this experiment a million times, you would get the the car only one third of the time by sticking to your first door. People who consistently switch would win the other two thirds. Therefore you should switch.
• Argument 3: Think about what you would do if there were a thousand doors, rather than three, and Monty Hall opened 998 doors with goats behind them.

3. (Encyclopedia of Bridge) You are South with three small of a suit, and dummy has QJ9. You desperately need a trick from this suit. You lead low to the Queen, and East wins with the King. When you get a second chance, you lead low to the J9 and West plays low. Should you play the Jack or the 9?
• Argument 1: If either opponent has A10, it does not matter. If East has the Ace and West the 10, you want to play the 9. If it is the other way around, you want to play the Jack. Both sides are equally likely to have the Ace so it does not matter what you play.
• Argument 2: You should play the Jack because East has the Ace only 1/3 of the time. If East had AK, he would play the King to the first trick only half the time. If he had K10, he would always play the King. Since we know he played the King, it is twice as likely that he has the K10 and not AK.
• Note: Note the similarity with the Monty Hall problem.

4. (Memduh - Two envelope problem) I offer you a pick between two envelopes with money. One envelope has twice as much money as the other. You pick one, and out comes 10 dollars. Now I give you a chance to switch. Would you like to switch? How much are you willing to pay to switch?
• Argument 1: Of course you switch. The expected amount of money in the other envelope is 0.5x5 + 0.5x20 = 12.5 dollars. In fact you are willing to pay up to 2.5 dollars to switch.
• Argument 2: What if I asked you the question before you opened the envelope and saw the 10 dollars? Using the same reasoning, you can assume there is A dollars in the envelope and compute 0.5x(A/2) + 0.5x(2A) = 1.25A for the expected money in the other envelope. So you would switch. Just before you open your new envelope, I ask you whether you would like to switch again? What would your answer be?
• Note: In fact if I can find two people that believe in Argument 1, I can build a money machine. Just keep giving them two envelopes with 5 and 10 dollars and charge for switching... :^) (Of course I charge them whatever comes out of the first envelope for playing the game, so that it is a zero sum game.)
• Bibliography: http://www.u.arizona.edu/~chalmers/papers/envelope.html

5. (Neal) I pick two real numbers. You look at one of them. Can you find a strategy that lets you guess whether you are looking at the larger or smaller number with more than 1/2 success rate.
• Argument 1: Obviously you cannot find such a strategy.
• Argument 2: Take a probability distribution that is non-zero over all the real numbers (standard normal for example). Draw a random number from this distribution and respond assuming that the hidden number is equal to your random number. There are three cases: (i) Your random number will be smaller than both my numbers, in which case you have 50% chance of winning. (ii) Your random number will be larger than both my numbers, in which case you have 50% chance of winning. (iii) Your random number will be between my two numbers, in which case you have 100% chance of winning. The average is greater than 50%.
• Note: Using a similar argument one can show that you could in fact make a profit in the two envelope problem by employing a mixed strategy.

6. (Alkan) You draw a random line that cuts a circle with unit radius. What is the probability that the chord will be longer than sqrt(3)?
• Argument 1: Consider the distance between the midpoint of the chord and the center of the circle. If this distance is less than 1/2 the chord will be longer than sqrt(3). Therefore the answer is 1/2.
• Argument 2: Draw a tangent at one of the points the line intersects the circle. Consider the angle between this tangent and the chord. If this angle is between 60 and 120 degrees, the chord will be longer than sqrt(3). Therefore the answer is 1/3.
• Argument 3: Consider the midpoint of the chord. If this midpoint is within a concentric circle with half the radius, the chord will be longer than sqrt(3). The area of a circle with half the radius is 1/4th of the original. Therefore the answer is 1/4.

Full post...

## October 27, 2010

### Anharmonicity, mode-coupling and entropy in a fluctuating native protein

A. Kabakçıoğlu, D. Yuret, M. Gür, B. Erman. 2010 Phys. Biol. 7 046005 doi: 10.1088/1478-3975/7/4/046005 (PDF, PDF, HTML, Online, Hermite code)

Abstract: We develop a general framework for the analysis of residue fluctuations that simultaneously incorporates anharmonicity and mode-coupling in a unified formalism. We show that both deviations from the Gaussian model are important for modeling the multidimensional energy landscape of the protein Crambin (1EJG) in the vicinity of its native state. The effect of anharmonicity and mode-coupling on the fluctuational entropy is in the order of a few percent.

## October 19, 2010

### Istanbul Marathon 2010

Some pictures from the Istanbul Marathon...

Here is our route from runkeeper.com:

evolutionrunning.com: How to run without injury.
Galloway's Marathon book: Everybody can do it!
Born to run: To get inspired.

And some pictures:

## September 19, 2010

### Adam's Tongue by Derek Bickerton

Bickerton presents his theory on the origin of language. Some notes:

- Human language is qualitatively distinct from animal communication systems (ACSs): most words are incomplete by themselves and need to be combined to express meaning whereas animal calls are self sufficient.

- A better model for origins of language may be pidgins (languages created by people that live together but do not share a common language).

- A protolanguage would have words as we know them that combine but no well defined rules of syntax or morphology.

- ACSs are all about the here and now, most words refer to things outside the current happenings.

- ACSs are mainly manipulative whereas language mainly informative.

- Communicative units come in three flavors: icons resemble the thing talked about, indices point to them, and symbols do neither. (function words which do not refer at all form a separate group). Displacement (referring to things that are not in the here/now) is only possible with symbols and icons. Iconic signs may have been the first displaced ones paving the way for symbols.

- "Since we usually regard language as no more than the means by which we express our thoughts, it seems natural to think that language should issue from intelligence, rather than vice versa. It seemed equally obvious, to naive observers, that the earth was the center of the universe, and the sun, moon, and planets all went around it." pp.58

- Categories are different from concepts. pp.87. Concepts you can think about or think with, whereas all you can do with categories is to tell whether something belongs in them. pp.205

- The ACSs of ants and bees may be closer to humans because they exhibit displacement (about food sources distant in time and space). ch.7

- Crucial as speciation is, it's still far from completely understood. pp.149

- The real breakthrough in language had to be displacement rather than arbitrariness (of signs). pp.160

- Bickerton's sequence:
1. Animals have concepts that won't merge.
2. Protohumans start talking.
3. Talking produces typically human concepts.
4. Merge appears and starts merging concepts.
5. The brain maybe gets rewired.
6. Capacities for complex thought planning etc. develop.

Chomsky's version has concepts appear first and talking last. pp.189.

- Thoughts like "roses are red" are offline as opposed to online thinking about here and now. (still makes the hearer do stuff: think of roses, think of red, merge...) pp.193

- Categories have to be only detailed enough to distinguish between appropriate reactions (affordances?). pp.206.

- I don't quite know what he's saying about memory pp.207 or recursion pp.244.

- Posted using BlogPress from my iPhone

## September 04, 2010

I have looked for a way to get alerted about new releases from my favorite authors for a long time. I think Amazon used to support this in the past but they no longer do. Barnes and Noble has writer alerts for only a small list of authors. This is such an obvious feature for a bibliophile that I do not understand why nobody supports it. I was about to write my own code but luckily ran into www.authoralerts.com first. Highly recommended.

## August 26, 2010

### Unsupervised Part of Speech Tagging Using Unambiguous Substitutes from a Statistical Language Model

Mehmet Ali Yatbaz and Deniz Yuret. Coling 2010. pp. 1391--1398. Beijing, China. (PDF, Poster)
Abstract: We show that unsupervised part of speech tagging performance can be significantly improved using likely substitutes for target words given by a statistical language model. We choose unambiguous substitutes for each occurrence of an ambiguous target word based on its context. The part of speech tags for the unambiguous substitutes are then used to filter the entry for the target word in the word--tag dictionary. A standard HMM model trained using the filtered dictionary achieves 92.25% accuracy on a standard 24,000 word corpus.

## July 15, 2010

### SemEval-2010 Task 12: Parser Evaluation using Textual Entailments

Deniz Yuret, Aydın Han, Zehra Turgut. Proceedings of the 5th International Workshop on Semantic Evaluation. (SemEval-2010) pp. 51--56. July, 2010. Uppsala, Sweden. (PDF, Presentation, Task website, Proceedings, Journal submission).

Abstract: Parser Evaluation using Textual Entailments (PETE) is a shared task in the SemEval-2010 Evaluation Exercises on Semantic Evaluation. The task involves recognizing textual entailments based on syntactic information alone. PETE introduces a new parser evaluation scheme that is formalism independent, less prone to annotation error, and focused on semantically relevant distinctions.

### L1 Regularized Regression for Reranking and System Combination in Machine Translation

Ergun Bicici, Deniz Yuret. Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. pp. 282--289. July 2010. Uppsala, Sweden. (PDF, Slide, Poster)

Abstract: We use L1 regularized transductive regression to learn mappings between source and target features of the training sets derived for each test sentence and use these mappings to rerank translation outputs. We compare the effectiveness of L1 regularization techniques for regression to learn mappings between features given in a sparse feature matrix. The results show the effectiveness of using L1 regularization versus L2 used in ridge regression. We show that regression mapping is effective in reranking translation outputs and in selecting the best system combinations with encouraging results on different language pairs.

## April 22, 2010

### The Trouble with Physics by Lee Smolin

Smolin's book made me think of a dilemma I face often. I find the current system of scientific funding disturbing. Chief among the "values" of a scientist is absolute honesty. Yet, the project proposals we need to fill periodically ask us to describe what we are going to do in detail for the next three years. I don't know what I am going to do during the next three weeks! It depends on what results I am going to get using my current approach during the next couple of days. Maybe I will have a brilliant idea that will change my whole approach to the problem. Maybe I will be taken over with another problem. Honestly I don't know. The only thing I can promise is that I will put all my working energy on making progress on the problem that I find most promising at the time. But apparently that is not enough to get funding, and we are forced to either (i) bend the truth, or (ii) tie ourselves to an approach that we will most likely find suboptimal in the near future.

To me the answer is simple: scientists should be funded not on promises about the future (which nobody can honestly make, let alone scientists whose job is to explore the unknown), but on past performance. That leaves the problem of young scientists who have no past. There should be a reasonable amount of seed funding for such people, just enough to make sure an adventurous spirit has enough time to risk his career tackling an important and deep problem.

Smolin's book should be required reading by all who manage scientists and scientific funding. If you are not interested in the debate on string theory, just read the last few chapters on how science works based on a shared ethic, and why we should take a bit more risk on "Seers" who tend to obsess about high risk problems and may take a long time (sometimes forever) producing anything valuable.

Chapter 17 proposes the shared ethic among scientists rather than some abstract "scientific method" as chiefly responsible for the success of science. Chapter 18 draws a distinction between two types of scientists "Seers" and "Craftspeople". In fact pretty much the whole book is an elaboration of how and why the scientific establishment does not provide enough room for "Seers" who by nature like to obsess about high risk problems and need much longer incubation times.

I find the shared ethic of science to be one of the most important creations of human culture. I had long held the view that science was about "what is" and not about "what ought to be", thus science and ethics had nothing to do each other. Recently I started to see the ethic of scientists as people, if not the result of their work, as being very relevant. Dennis Overbye describes it best:

"Not only does science not provide any values of its own, say its detractors, it also undermines the ones we already have, devaluing anything it can’t measure, reducing sunsets to wavelengths and romance to jiggly hormones. It destroys myths and robs the universe of its magic and mystery. So the story goes. But this is balderdash. Science is not a monument of received Truth but something that people do to look for truth. That endeavor, which has transformed the world in the last few centuries, does indeed teach values. Those values, among others, are honesty, doubt, respect for evidence, openness, accountability and tolerance and indeed hunger for opposing points of view."

## February 23, 2010

### Biçimbilimsel Çözümleme

*********************************************
KOC UNIVERSITY
ELECTRICAL AND COMPUTER ENGINEERING
ECOE 590 SEMINAR
*********************************************

Date : 23 February 2010, Tuesday
Time : 17:00
Place : ENG B29
Title : Biçimbilimsel Çözümleme
Speaker : Gülşen Cebiroğlu Eryiğit

Doğal Dil İşleme’nin en temel seviyelerinden biri olan biçimbilimsel çözümleme, bir sözcüğün yapısının bilgisayarlar tarafından otomatik olarak çözümlenmesi işlemidir. Biçimbilimsel çözümleme işlemi sonucunda bir sözcüğün en küçük anlamlı birimleri olan morfemlerin (biçimbirimlerin) bulunması ve sözcük yapısının çözümlenmesi hedeflenmektedir. Örneğin “arabalar” sözcüğünün gövdesinin “araba” olduğu ve bu sözcüğün çoğul eki almış bir isim olduğunun otomatik olarak belirlenmesi bir biçimbilimsel çözümleme işlemidir. İşlem sırasında, sözcüğü oluşturan morfemlerin birbirlerinden ayrılmasından yola çıkılarak, bu işleme aynı zamanda Biçimbilimsel Ayrıştırma adı da verilmektedir.
Bu konuşmada, biçimbilimsel çözümleme konusu ayrıntılı olarak ele alınacak, kullanım alanları, genel yaklaşımlar ve Türkçe'nin biçimbilimsel çözümlemesi konusunda bilgi verilecektir.

## February 22, 2010

### CFP: Parser Evaluation using Textual Entailments (PETE)

Parser Evaluation using Textual Entailments (PETE)
(http://pete.yuret.com)

The purpose of this post is to encourage participation in the task "Parser Evaluation using Textual Entailments" in the 5th International Workshop on Semantic Evaluations, SemEval-2010 (http://semeval2.fbk.eu/semeval2.php) collocated with ACL-2010, July
15-16, Uppsala.

This shared task should be of interest to researchers working on
* parsing
* semantic role labeling
* recognizing textual entailments

Parser Evaluation using Textual Entailments (PETE) is a shared task in the SemEval-2010 Evaluation Exercises on Semantic Evaluation. The task involves recognizing textual entailments (RTE) based on syntactic information. Given two text fragments called 'Text' and 'Hypothesis', Textual Entailment Recognition is the task of determining whether the meaning of the Hypothesis is entailed (can be inferred) from the Text. The PETE task focuses on entailments that can be inferred using syntactic information alone.

• Text: The man with the hat was tired.

• Hypothesis-1: The man was tired. (YES)

• Hypothesis-2: The hat was tired. (NO)

Our goals in introducing this task are:

• To focus parser evaluation on semantically relevant phenomena.

• To introduce a parser evaluation scheme that is formalism independent.

• To introduce a targeted textual entailment task focused on a single linguistic competence.

• To be able to collect high quality evaluation data from untrained annotators.

The following criteria were used when constructing the entailments:

• They should be decidable using only syntactic inference.

• They should be easy to decide by untrained annotators.

• They should be challenging for state of the art parsers.

You can find more details about our entailment generation process in the PETE Guide. You can download the development and test datasets including gold answers and system scores here: PETE_gold.zip. There is no training data. The evaluation is similar to other RTE tasks. There is a Google group semeval-pete for task related messages.

Instructions:
Important Dates:

• February 19 - the development (trial) data available.

• March 26 - the test data available.

• April 2 - end of submission period for the task.

• April 17 - Submission of description papers.

• May 6 - Notification of acceptance.

• July 15-16 - Workshop at ACL 2010, Uppsala.

Parsers:

Here are some links for publicly available parsers that can be used in this task. You do not have to use any of these parsers, in fact you do not have to use a conventional parsing algorithm at all -- outside the box approaches are highly encouraged. However, to get a quick baseline system using an existing parser may be a good way to start.

• PETE Guide: A description of the entailment generation process (February, 2010).
• D09-1085.pdf: Rimell, L., S. Clark, and M. Steedman. Unbounded Dependency Recovery for Parser Evaluation. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (August, 2009).
• thesis.pdf: Onder Eker's MS thesis (August, 2009).

• semeval-abstract.pdf: The PETE task abstract (December, 2008).
• pete.pdf: The initial PETE task proposal (September, 2008).
• Workshop on Cross-Framework and Cross-Domain Parser Evaluation (August, 2008)
• natlog-wtep07-final.pdf: Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 193-200. (June, 2007).
• targeted textual entailments: On targeted textual entailments in general (June 2007).

• a blog post: On the consistency of Penn Treebank annotation (October, 2006).
• lre98.pdf: Carroll, J., E. Briscoe and A. Sanfilippo (1998) `Parser evaluation: a survey and a new proposal'. In Proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain. 447-454.

Contact:

• Deniz Yuret dyuret@ku.edu.tr

## February 21, 2010

### Preprocessing with Linear Transformations that Maximize the Nearest Neighbor Classiﬁcation Accuracy

Mehmet Ali Yatbaz and Deniz Yuret. 1st CSE Student Workshop (CSW’10), 21 February 2010, Koc Istinye Campus, Istanbul. (PDF, PPT)

Abstract
We introduce a preprocessing technique for classification problems based on linear transformations. The algorithm incrementally constructs a linear transformation that maximizes the nearest neighbor classification accuracy on the training set. At each iteration the algorithm picks a point in the dataset, and computes a transformation
that moves the point closer to points in its own class and/or away from points in other classes. The composition of the resulting linear transformations lead to statistically significant improvements in instance based learning algorithms.

### L1 Regularization for Learning Word Alignments in Sparse Feature Matrices

Ergun Bicici and Deniz Yuret. 1st CSE Student Workshop (CSW’10), 21 February 2010, Koc Istinye Campus, Istanbul. (PDF, Poster)

Abstract
Sparse feature representations can be used in various domains. We compare the effectiveness of $L_1$ regularization techniques for regression to learn mappings between features given in a sparse feature matrix. We apply these techniques for learning word alignments commonly used for machine translation. The performance of the learned mappings are measured using the phrase table generated on a larger corpus by a state of the art word aligner. The results show the effectiveness of using $L_1$ regularization versus $L_2$ used in ridge regression.

## February 19, 2010

### The Noisy Channel Model for Unsupervised Word Sense Disambiguation

Deniz Yuret and Mehmet Ali Yatbaz. Computational Linguistics, Volume 36, Number 1, March 2010. (Abstract, PDF)

Abstract: We introduce a generative probabilistic model, the noisy channel model, for unsupervised word sense disambiguation. In our model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular meaning S using a possibly ambiguous word W. To reconstruct the intended meaning the hearer uses the distribution of possible meanings in the given context P(S|C) and possible words that can express each meaning P(W|S). We assume P(W|S) is independent of the context and estimate it using WordNet sense frequencies. The main problem of unsupervised WSD is estimating context dependent P(S|C) without access to any sense tagged text. We show one way to solve this problem using a statistical language model based on large amounts of untagged text. Our model uses coarse-grained semantic classes for S internally and we explore the effect of using different levels of granularity on WSD performance. The system outputs fine grained senses for evaluation and its performance on noun disambiguation is better than most previously reported unsupervised systems and close to the best supervised systems.

Full post...

## February 17, 2010

### Emacs Turkish mode

This is for people trying to type Turkish documents on a U.S. keyboard using Emacs. The program provides a turkish-mode in which the correct Turkish accents are added to the ascii version of the last word typed each time the user hits space. The latest version is available here.

It was inspired by Gökhan Tür's deasciifier:
http://www.hlst.sabanciuniv.edu/TL/deascii.html

The program uses decision lists (included at the end of this file) which was created based on 1 million words of Turkish news text using the GPA algorithm. For more information on GPA see the Greedy prepend algorithm for decision list induction.

To activate the program first load this file into emacs:
Then turn on the turkish mode:
M-x turkish-mode

When Turkish mode is enabled, the space, tab, and enter keys correct the previous word by adding Turkish accents. For corrections use C-t to toggle the accent of the character under the cursor.

## February 14, 2010

### Permutasyon, Kombinasyon ve 12 kardeşleri

Bilgisayar Olimpiyat Kampından Notlar I

Temel sayma problemlerinin çoğu iki sonlu kümenin elemanları arasında kaç çeşit eşleme yapılabileceği sorularına indirgenebilir. Örneğin bir basketbol takımının K oyuncusunu bir otelin N odasına yerleştirmeye çalıştırdığımızı düşünelim. Öncelikle bir odada kaç kişi kalabileceği sorusuna göre üç farklı tip problem tanımlayabiliriz:

A. Bir odada herhangi sayıda oyuncu kalabilir.
B. Bir odada en çok bir kişi kalmalı (K <= N).
C. Bir odada en az bir kişi kalmalı (K >= N).

Diğer yandan K ya da N küme elemanlarının kimliklerinin önemli olup olmadığı, ya da permutasyonlarının (değişik sıralamalarının) farklı çözüm olarak sayılıp sayılmadığı da bize dört farklı tip problem verir. Bu problem tiplerini değişik objelerle örneklendirecek olursak:

0. K ve N elemanlarının kimlikleri önemli. (K kişiyi N odaya dağıtıyoruz)
1. K'lar birbiriyle eş, N'ler farklı. (K topu N odaya dağıtıyoruz)
2. K'lar farklı, N'ler birbiriyle eş. (K kişiyi N takıma ayırıyoruz)
3. K'lar da N'ler de birbiriyle eş. (K topu N gruba ayırıyoruz)

Burada birbiriyle eşlenmesi daha kolay olduğu için kişi yerine top, oda yerine takım ya da grup kullandık. (0) probleminde kimin hangi odada kaldığı önemliyken, (1) probleminde sadece hangi odada kaç top olduğu önemli. (2) probleminde ABCD isminde dört kişiyi (AB) ve (CD) gibi iki takıma ayırmakla (CD) ve (AB) gibi iki takıma ayırmak aynı çözüm sayılıyor. (3) probleminde ise ne topların ne grupların kimliği önemli, burada örneğin 6 topu 3+2+1 olarak ayırmak ile 4+1+1 olarak ayırmak farklı çözümler sayılsa da 3+2+1 ile 1+2+3 aynı çözüm sayılıyor.

Matematikçi Gian-Carlo Rota bu seçenekleri göz önüne alarak temel sayma problemlerini A0, A1, ..., C2, C3 şeklinde 12 kategoride sınıflandırmayı öneriyor. Örneğin lisede öğrendiğimiz permutasyon B0, kombinasyon B1 problemlerine denk geliyor. Malesef bu sınıflamanın dışında kalan Catalan objelerini, permutasyon tiplerini, ağaç tiplerini sayma problemleri de var fakat onları başka bir notta ele alırız.

Bu problemlerin bazıları şu teknikle çözülebilir: saymamız istenilen çözümlerden herhangi birini oluşturmak için K aşama tanımlayalım. i'nci aşamada $n_i$ potansiyel farklı seçenek olsun. İzleyebileceğimiz farklı yol sayısı bu durumda $n_1 n_2 \ldots n_k$ çarpımı olarak ifade edilebilir. Hemen bir iki örnek:

A0. K kişiyi N odaya herhangi bir sınırlama olmadan dağıtmak için her kişiye tek tek hangi odayı istediklerini sorarız. K sorunun (aşamanın) her birine N farklı cevap (seçenek) gelme olasılığı var. Dolayısıyla potansiyel çözüm sayımız $N^K$ olur.

B0. K kişiyi N odaya her odada en fazla bir kişi kalacak şekilde dağıtmak için yine herkese tek tek hangi odayı istediklerini sorabiliriz. Fakat sırası gelen kişi kendinden önce seçilmiş odaları seçemez, boş odalardan birini seçebilir. Yani birinci kişi N oda, ikinci kişi (N-1) oda, üçüncü kişi (N-2) odadan birini seçebilir. Bu durumda çözüm sayımız N(N-1)...(N-K+1)=$N^\underline{K}$ olur.

Bazan yukarıda verdiğimiz sayma tekniği kendi başına yeterli olmaz, çözümleri bilerek fazla ya da eksik sayıp sonradan bir düzeltme yapmak gerekebilir.

B1. K topu N odaya her odada en fazla bir top olacak şekilde dağıtma problemine bakalım. Topların kimliği önemli olmadığına göre buna N odadan K tanesini seçme problemi olarak da bakabiliriz. Çözüme B0 gibi başlayıp her topa hangi odayı istediğini soralım. Örneğin 3 top ve 5 oda varsa 123, 231, 245 gibi $N^\underline{K}$ farklı cevap alabiliriz. Fakat topların kimlikleri önemli olmadığına göre 123, 132, 213, 231, 312, 321 cevaplarının hep aynı 3 odayı doldurdukları için aynı konfigürasyon olarak sayılmaları gerekir. Diğer bir deyişle aslında B0 çözümü kullanarak biz her konfigürasyonu 6=3!=K! defa saymış olduk. Doğru cevabı elde etmek için B0'daki cevabı K!'e bölerek $N^{\underline{K}} / K! = \tbinom{N}{K}$ formülünü elde ederiz.

Bazı problemlerde en pratik çözüm ise önce eldeki problemi saymasını bildiğimiz başka bir probleme dönüştürmekten geçer:

A1. K topu N odaya herhangi bir sınırlama olmaksızın dağıtmayı düşünelim. Bu topları N odaya koymakla onları sıraya dizip aralarına N-1 duvar yerleştirmek birbirine eş iki problem olarak görülebilir. İkinci problemde duvarlarla toplar K+N-1 pozisyon işgal ederler. Bu pozisyonlardan hangi K tanesini topların işgal edeceğini sorarsak $\binom{K+N-1}{K}$ çözümünü elde ederiz.

C1. K topu N odaya her odada en az bir top olacak şekilde dağıtma (K>=N) probleminde ise önce her odaya birer top koyarak boş oda kalmamasını garantiler, daha sonra kalan K-N top üzerinde A1 çözümünü uygulayabiliriz: $\binom{K-1}{K-N}$.

Son olarak bu problemlerin bir kısmının kapalı bir formülle çözümü yoktur. Bu durumlarda özyinelemeli (recursive) formüller aramak tek çıkar yol olabilir:

C2. K kişiyi N takıma (her takımda en az bir kişi olacak şekilde, K>=N) ayırma problemini ele alalım. Özyinelemeli çözümlerin ana fikri problemin daha küçük bir halini çözebildiğimizi varsaymak, ve buna dayanarak orijinal problemi çözmektir. C2 için önce K-1 kişiyi takımlara ayırmanın yollarını sayabildiğimizi varsayalım. K'inci kişiyi ne yapacağımızı düşünelim. Bu son kişi ya kendi başına bir takım olacak (bu durumda geri kalan K-1 kişi N-1 takım oluşturacak), ya da diğerlerinin oluşturduğu takımlardan birinin içine eklenecek (bu durumda geri kalan K-1 kişinin N takım oluşturması, ve son kişinin bunlardan birini seçmesi gerek). Bu dediklerimizi formüle dökersek: f(K, N) = f(K-1, N-1) + N f(K-1, N). Formülü tamamlamak için problemin en küçük halini elle çözebiliriz f(1,1) = 1. Bu çözüm bize genel olarak K elemanlı bir kümenin elemanlarının kaç değişik şekilde N parçaya ayrılabileceğini verir ve hesapladığımız f fonksiyonu da ikinci tip Stirling sayısı olarak bilinir.

C3. K topu N gruba ayırma problemi için ise en küçük grupta tek top olan çözümlerle birden fazla top olan çözümleri ayrı ayrı sayalım. En küçük grupta tek top olan çözümlerin sayısını geri kalan K-1 topu N-1 gruba ayırarak bulabiliriz. En küçük grupta birden fazla top olan çözümleri ise (daha önce C1'de yaptığımız gibi) önce her gruba garanti olan birer topu koyup geri kalan K-N topu N gruba ayırarak sayabiliriz. Formüle dökecek olursak: f(K, N) = f(K-1, N-1) + f(K-N, N) ve f(1, 1) = 1. Bu çözüm bize genel olarak bir K tamsayısının kaç değişik şekilde N parçaya ayrılabileceğini verir.

Diğer çözdüğümüz problemlerde de benzer tekniklerle özyinelemeli formüller bulabiliriz. Örneğin:

A0. K-1 kişi hangi odalarda kalacaklarına karar verdikten sonra son kişi yine N odadan birini seçer.
$N^K$: f(N,K) = N f(N,K-1)

$N^{\underline{K}}$: f(N,K) = f(N-1,K) + K f(N-1,K-1)
$\tbinom{N}{K}$: f(N,K) = f(N-1,K-1) + f(N-1,K)