## October 29, 2011

### Gamma distribution

The Gamma distribution is often used as a prior for positive random variables just like the Gaussian distribution for real valued random variables.  The purpose of this post is to build some intuition about how the two parameters, the shape parameter "a" and the scale parameter "b", effect the behavior of a Gamma random variable.  In particular we will show that for vague Gamma parameters (a<<1) the distribution almost acts like an upper bound on the random variable.

Here is the Gamma PDF:

$f(x) = \frac{1}{\Gamma(a) b} (\frac{x}{b})^{a-1} e^{-x/b} \;\; x\geq 0; a , b>0$

The mean is ab and the variance is ab².  When a=1 it is equivalent to the exponential distribution.  In fact when a is an integer, it is equivalent to the sum of (a) independent exponentially distributed random variables each of which has a mean of (b).  It is shaped like the exponential distribution with a spike at 0 for a<1, but has a mode at (a-1)b for a>1 (see the Wikipedia article).

MacKay suggests representing the positive real variable x in terms of its logarithm z=ln x (ITILA, pp. 314).  This will give us a better idea about the order of magnitude of typical x in terms of a and b.  The distribution in terms of z is:

$f(z) = \frac{1}{\Gamma(a)} (\frac{x}{b})^a e^{-x/b} \;\; z \in \Re; x=e^z; a, b>0$

We can get an idea about the shape of f(z) by looking at its first two derivatives with respect to z:

$f'(z) = f(z) (a-\frac{x}{b})$
$f''(z) = f(z) (a^2 - (2a+1)\frac{x}{b} + (\frac{x}{b})^2)$

The graph above shows f(z) and its two derivatives for a=1/10 and b=10. The first derivative tells us that f(z) has a single mode at x=ab. Note that x=ab is the mean of f(x) but only the mode (not the mean) of f(z). The curve raises slowly on the left of the mode and falls sharply on the right. The second derivative has two roots that give us the values with the minimum and the maximum slope:

$x = ab + \frac{b}{2} \pm \frac{b}{2} \sqrt{1+4a}$.

Now we are going to look at the limit where a<<1, typically used as a vague prior. The height of the mode at x=ab is aae-a/Γ(a). Γ(a) is well approximated by 1/a for small a, aa and e-a both go to 1, so f(z) ≈ a at the mode.

Next, let's look at the right side (x>ab) where f(z) seems to fall sharply.  According to the roots of the second derivative given above, the minimum slope occurs at around x=b (if we ignore the terms with a<<1).  The value of f(z) when x=b is 1/(e Γ(a)).  Γ(a) is well approximated by 1/a for small a, so this value is approximately a/e.  The slope at x=b is approximately -a/e and if we fit a line at that point the line would cross 0 at x=eb. Thus for small a, the probability can be considered negligible for x>eb.

Next, let's look at the left side (x < ab) where f(z) appears more flat.  The maximum slope occurs around x=a²b (if we approximate √ 1+4a with 1+2a-2a²).  The slope at x=a²b is approximately a² which gives a flat shape for x<ab when a<<1.

In summary, when used with a<<1, f(z) rises slowly for x<ab (with approximate slope a²) and falls sharply for x>ab (with approximate slope -a/e).  You are unlikely to see x values larger than eb from such a distribution, but you may see values much smaller than the mean ab.  Thus a vague Gamma prior is practically putting an upper bound on your positive value.  The figure below shows how the f(z) distribution starts looking like a step function as the shape parameter approaches 0 (b=1/a and the peak heights have been matched for comparison).

I should also note that in the limit where a→0 and ab=1, we get an improper prior where f(z) becomes flat and the Gamma distribution becomes indifferent to the order of magnitude of the random variable. However it flattens a lot faster on the left than on the right.

Full post...

## August 16, 2011

### Ergun Biçici, Ph.D. 2011

Current position: Researcher at Bogazici University, Istanbul. (webpage).
PhD Thesis:The Regression Model of Machine Translation. Koç University, Department of Computer Engineering. August, 2011. (PDF, Presentation).
Publications: bibtex.php, scholar

Abstract:
Machine translation is the task of automatically finding the translation of a source sentence in the target language. Statistical machine translation (SMT) use parallel corpora or bilingual paired corpora that are known to be translations of each other to find a likely translation for a given source sentence based on the observed translations. The task of machine translation can be seen as an instance of estimating the functions that map strings to strings.

Regression based machine translation (RegMT) approach provides a learning framework for machine translation, separating learning models for training, training instance selection, feature representation, and decoding. We use the transductive learning framework for making the RegMT approach computationally more scalable and consider the model building step independently for each test sentence. We develop training instance selection algorithms that not only make RegMT computationally more scalable but also improve the performance of standard SMT systems. We develop better training instance selection techniques than previous work from given parallel training sentences for achieving more accurate RegMT models using less training instances.

We introduce L_1 regularized regression as a better model than L_2 regularized regression for statistical machine translation. Our results demonstrate that sparse regression models are better than L_2 regularized regression for statistical machine translation in predicting target features, estimating word alignments, creating phrase tables, and generating translation outputs. We develop good evaluation techniques for measuring the performance of the RegMT model and the quality of the translations. We use F_1 measure, which performs good when evaluating translations into English according to human judgments. F_1 allows us to evaluate the performance of the RegMT models using the target feature prediction vectors or the coefficients matrices learned or a given SMT model using its phrase table without performing the decoding step, which can be computationally expensive.

Decoding is dependent on the representation of the training set and the features used. We use graph decoding on the prediction vectors represented in n-gram or word sequence counts space found in the training set. We also decode using Moses after transforming the learned weight matrix representing the mappings between the source and target features to a phrase table that can be used by Moses during decoding. We demonstrate that sparse L_1 regularized regression performs better than L_2 regularized regression in the German-English translation task and in the Spanish-English translation task when using small sized training sets. Graph based decoding can provide an alternative to phrase-based decoding in translation domains having low vocabulary.

Full post...

## July 30, 2011

### RegMT System for Machine Translation, System Combination, and Evaluation

Ergun Bicici; Deniz Yuret. Proceedings of the Sixth Workshop on Statistical Machine Translation. pp. 323-329. Edinburgh, Scotland. July, 2011. (PDF, BIB, Proceedings, Poster)

Abstract: We present the results we obtain using our RegMT system, which uses transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. Our training instance selection methods perform feature decay for proper selection of training instances, which plays an important role to learn correct feature mappings. RegMT uses L2 regularized regression as well as L1 regularized regression for sparse regression estimation of target features. We present translation results using our training instance selection methods, translation results using graph decoding, system combination results with RegMT, and performance evaluation with the
F1 measure over target features as a metric for evaluating translation quality.

### Instance Selection for Machine Translation using Feature Decay Algorithms

Ergun Bicici; Deniz Yuret. Proceedings of the Sixth Workshop on Statistical Machine Translation. pp. 272-283. Edinburgh, Scotland. July, 2011. (PDF, BIB, Proceedings, Presentation)

Abstract: We present an empirical study of instance selection techniques for machine translation. In an active learning setting, instance selection minimizes the human effort by identifying the most informative sentences for translation. In a transductive learning setting, selection of training instances relevant to the test set improves the final translation quality. After reviewing the state of the art in the field, we generalize the main ideas in a class of instance selection algorithms that use feature decay. Feature decay algorithms increase diversity of the training set by devaluing features that are already included. We show that the feature decay rate has a very strong effect on the final translation quality whereas the initial feature values, inclusion of higher order features, or sentence length normalizations do not. We evaluate the best instance selection methods using a standard Moses baseline using the whole 1.6 million sentence English-German section of the Europarl corpus. We show that selecting the best 3000 training sentences for a specific test sentence is sufficient to obtain a score within 1 BLEU of the baseline, using 5% of the training data is sufficient to exceed the baseline, and a ~ 2 BLEU improvement over the baseline is possible by optimally selected subset of the training data. In out-of-domain translation, we are able to reduce the training set size to about 7% and achieve a similar performance with the baseline.

## June 21, 2011

### ACL 2011 Tutorials

Here are some notes from the ACL tutorials:

Formal and Empirical Grammatical Inference
Jeffrey Heinz, Colin de la Higuera and Menno van Zaanen

Jeffrey and Colin motivated and presented the main results for formal grammatical inference. Even though many theoretical results are negative, learning is usually possible by restricting the model class (to a well defined subset used by natural languages) or assuming a non-distribution-free setting. Colin's book was recommended as a good introduction to the theory. It would be interesting to see if Turkish morphotactics or morphophonemics fall into one of the easy-to-learn model subclasses. You can download the slides.

There is a recent trend for encoding prior knowledge in learning problems not in the prior distributions but later in the learning process. The prior knowledge usually comes in the form of feature-class expectations and guiding the model toward the correct expectations is only possible after considering the input. Posterior regularization, constraint driven learning, and generalized expectation criteria seem to be related implementations of this idea. You can download the slides.

Dual Decomposition for Natural Language Processing
Michael Collins and Alexander M Rush

I did not attend this one but here are the slides.

Full post...

## May 23, 2011

### Semantic Structures by Ray Jackendoff

Jackendoff divides the study of the language faculty into three components: phonological, syntactic, and conceptual. In speech recognition a model that tells us what word sequences are likely is essential to recognition accuracy, because the acoustic information is often ambiguous (e.g. "She kissed this guy." vs. "She kissed the sky."). Stated in Jackendoff's terms, a probabilistic model of the syntactic component is helpful in disambiguating what is going on in the phonological component. Similarly I think a model of what is likely in the conceptual component is essential to resolving ambiguities in the syntactic component. If "what is likely to be said" is essential in interpreting "what is heard", then "what is likely to be meant" is similarly essential in interpreting "what is said".

Unfortunately we do not have good conceptual models yet, so computational linguists still try to make do with error prone hand tagging and shallow machine learning to disambiguate senses, references, and relations.

On a side note, each component in Jackendoff's work is modeled after the generative paradigm which, for the syntactic component, is described as follows:
1. Speakers can understand and create an indefinite number of sentences they have never heard before.
2. Therefore the repertoire of syntactic structures cannot be characterized as a finite list of sentences.
3. Nor can it be characterized as an infinite list of possible sentences because we have finite brains.
4. Thus it MUST be mentally encoded in terms of a finite set of primitives and a finite set of principles of combination that collectively generate the class of possible sentences.

Am I the only one befuddled by this argument? Primitives plus means of combination is certainly one way to create infinity using finite means, but why assume it is the only way? Dynamic systems, random processes, who knows what else can lead to infinite possible outcomes from a finite initial endowment. Why just present two strawmen, finite and infinite lists, as the only alternatives to discrete primitives and combination? Why after a couple of paragraphs further narrow the description to "the argument from creativity to the NECESSITY for principles or rules in syntactic knowledge"? Discrete primitives with finite and definite constraints and rules of combination is one way to build a representational system, unlikely to be the correct way for all three components of language, and certainly not the only way.

## May 11, 2011

### The Noisy Channel Model for Unsupervised Word Sense Disambiguation

Seminar presentation at ITU on our CL paper.
Full post...

## March 24, 2011

### From Eternity to Here by Sean Carroll

Why is the future different than the past? The fundamental rules of physics seem as symmetric in time as they are for different directions of space, yet you can't unscramble an egg and you can't remember the future. The answer seems to lie in the concept of entropy, which tends to increase over time in a closed system unlike energy and momentum which stay constant. Yet, entropy is one of these concepts that lead to quite a bit of confusion, sort of like (and sometimes for the same reasons as) probability. Sean Carrol's wonderful book tackles the mysteries of entropy and arrow of time. Here are some entropy puzzles and some further reading:

Entropy puzzles:
• If you find yourself in a universe with low entropy it is as likely to have come from a higher entropy state as it is to evolve into a higher entropy state. So how do you know you are not a Boltzmann brain?
• Take a box with a partition in it, with gas A on one side, gas B on the other side, and both gases are at the same temperature and pressure. Remove the partition. If gas A and B are different gases, there is an entropy that arises due to the mixing. If the gases are the same, no additional entropy is calculated. What if you thought they were the same gas and years later it was discovered that they happened to be two different isotopes? (See Gibbs paradox and E.T. Jaynes' paper). More generally this microstate / macrostate business seems completely user defined and arbitrary, so how can it have real physical effects?
• In a reversible system there must be just as many paths that decrease the entropy as that increase the entropy. Why don't we observe as many of the first type as the second?
• A gas squeezed in the corner of a room will tend to spread thereby increase its disorder and entropy. If we add an attractive force like gravity matter seems to clump together rather than spread out. How does clumping together increase entropy?
• A rotting plant turns into dust and gas which increases disorder and entropy. A seed turns a bunch of gas and dust into a full grown tree which seems to decrease entropy. This can only happen because the seed is not a closed system and is using the energy from the sun and ends up increasing the overall entropy of the universe at the end. When, how, and why does this type of thing happen?

## March 17, 2011

### Bologna Translation Service (2011-2013)

The Bologna Translation Service is designed for semi-automatic translation of course syllabi and study programmes from 9 languages - Dutch, English, Finnish, French, German, Portuguese, Spanish, Swedish and Turkish - into English. It helps speed up translation and revision processes, reducing costs and human resources at the same time. The project has received funding from the European Community (ICT-PSP 4th call) under Grant Agreement no 270915 from March 2011 to March 2013. Project partners included Cross Language (Belgium), Convertus (Sweden), ALS - Traslan Teoranta (Ireland), Eleka Ingeniaritza Linguisitikoa (Spain), and Koç University (Turkey).

Full post...

## January 03, 2011

### Plausibility vs. Inference

Here are two examples of common sense judgement:

(1) I saw the Statue of Liberty flying over New York. => I was flying over New York.
(2) I gave the book to Mary. => Mary has the book.

The first one involves syntactic disambiguation. Either the subject or the object could be doing the flying (consider "I saw the airplane flying over New York.") Our "common sense" tells us that the Statue is too big to fly, so I am the more likely one flying.

The second one involves a semantic inference. The meaning of give involves a physical transfer or a transfer of possession, as a consequence the item given ends up with the recipient. Our "common sense" is full of such little factoids (here is another: "Oswald killed Kennedy." => "Kennedy is dead.") which let us see beyond what is explicitly stated in the text.

I want to emphasize that these examples are qualitatively different and calling them both "common sense judgements" may be confusing. The first one is a plausibility judgement (which is more likely to fly: me or the statue?). The second one is an exact inference, i.e. "give" definitely causes transfer and "kill" definitely causes death. To solve the first one we need a model of what is more likely to be happening in the world. To solve the second one we need more traditional inference of what entails what.

Disambiguation problems in computational linguistics (word sense disambiguation, resolving syntactic ambiguities, etc.) rely on plausibility judgements, not exact inference. A lot of work in AI "common sense reasoning" will not help there because traditionally reasoning and inference work focus on exact judgements.

As far as I can see nobody is working on plausibility judgements explicitly. Researchers use corpus statistics as a proxy to solve disambiguation problems. This may be obscuring the real issue: I think the right way to do linguistic disambiguation is to have a model of what is plausible in the world.

Full post...