I am an associate professor in Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

August 08, 2014

Volkan Cirik, M.S. 2014

Current position: Masters in Language Technologies, Carnegie Mellon University, Language Technologies Institute (LinkedIn).
M.S. Thesis: Analysis of SCODE Word Embeddings based on Substitute Distributions in Supervised Tasks. Koç University, Department of Computer Engineering. August, 2014. (PDF, Presentation, word vectors (github), word vectors (dropbox))
Publications: bibtex.php

One of the interests of the Natural Language Processing (NLP) community is to find representations for lexical items using large amount of unlabeled data. Inducing low-dimensional, continuous, dense word vectors, or word embeddings, have become the principal technique to find representations for words. Word embeddings address the issues of the classical categorical representation of words by capturing syntactic and semantic information of words in the dimensions of a vector. These representations are shown to be successful across NLP tasks including Named Entity Recognition, Part-of-speech Tagging, Parsing, and Semantic Role Labeling.

In this work, I analyze a word embedding method in supervised Natural Language Processing (NLP) tasks. The framework maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. I compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. I examine the framework in a multilingual setup as well. The results show that the examined method achieves as good as or better results compared to the other word embeddings. The framework is consistent in improving the baseline systems across languages and achieves state-of-the-art results in multilingual dependency parsing.

Full post...

June 26, 2014

Probabilistic Modeling of Joint-context in Distributional Similarity

Oren Melamud, Ido Dagan, Jacob Goldberger, Idan Szpektor, and Deniz Yuret. In the Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL-2014). (Download PDF)

Abstract: Most traditional distributional similarity models fail to capture syntagmatic patterns that group together multiple word features within the same joint context. In this work we introduce a novel generic distributional similarity scheme under which the power of probabilistic models can be leveraged to effectively model joint contexts. Based on this scheme, we implement a concrete model which utilizes probabilistic n-gram language models. Our evaluations suggest that this model is particularly well-suited for measuring similarity for verbs, which are known to exhibit richer syntagmatic patterns, while maintaining comparable or better performance with respect to competitive baselines for nouns. Following this, we propose our scheme as a framework for future semantic similarity models leveraging the substantial body of work that exists in probabilistic language modeling.

Full post...

June 11, 2014

Online kernel methods for parsing

Here are the slides from a recent "work in progress" talk I gave at the AI lab.
Full post...

June 08, 2014

SVMs are not sparse

After getting bogged down by 300K support vectors trying a new model for parsing based on the structured kernel perceptron, I decided to look at the sparsity issue. Surely there must be models that express the same hypothesis using fewer support vectors, and possibly sophisticated algorithms like the SVM could find them. Alas, it was not meant to be. After a few weeks of fiddling around I found out that being sparse is not one of the strong suits of the SVM. For example using the same kernel (4th degree polynomial) on the MNIST 9vs4 (separable) problem, SVM ends up with 1047 support vectors, perceptron gives the same performance with 537. In hindsight maybe this is not surprising, for the perceptron every point on the wrong side of the boundary becomes a support vector, for SVM it is every point on the wrong side of the margin. More importantly, if the problem is not separable, both algorithms will keep growing their models linearly with the number of examples: every mistake becomes a support vector and we never run out of mistakes! But that's a topic for another post, for this one let's look at SVMs and whether they are supposed to be sparse.

The SVM algorithm solves the following constrained optimization problem: \[ \min_{w,b} \frac{1}{2}|w|^2 + C\sum_i \xi_i\; \mbox{ s.t. }\; y_i(w^\top x_i+b)\geq 1-\xi_i,\; \xi_i \geq 0 \] which is minimizing a combination of the L2 norm of \(w\) and the total hinge loss \(\xi_i=(1-y_i(w^\top x_i+b))_+\). We know the L2 regularization drives \(w_i\) toward zero but not the whole way, so \(w\) won't be sparse. To see about sparsity of support vectors we need to look at the dual form. The Lagrangian for the above constrained optimization is: \[ L = \frac{1}{2}|w|^2 + C\sum_i \xi_i - \sum_i \alpha_i (y_i(w^\top x_i+b)-1+\xi_i) - \sum_i \beta_i \xi_i \] where \(\alpha_i \geq 0, \beta_i \geq 0\) are called the dual variables, and by setting various derivatives to zero we learn what the solution looks like: \[\begin{eqnarray*} \nabla_w L &=& w - \sum_i \alpha_i y_i x_i = 0 \\ \partial_b L &=& \sum_i \alpha_i y_i = 0 \\ \partial_{\xi_i} L &=& C - \alpha_i - \beta_i = 0 \end{eqnarray*}\] The first equation tells us that we can write \(w\) as a linear combination of training instances with weights \(\alpha_i\), and the last equation tells us that \(\alpha_i + \beta_i = C\) for each training instance. In fact there are three types of instances:

  1. \(\alpha_i = 0,\; \beta_i = C,\; \xi_i = 0,\; y_i(w^\top x_i+b)\gt 1 \): These are the points that satisfy the margin.
  2. \(0\lt\alpha_i\lt C,\; 0\lt\beta_i\lt C,\; \xi_i = 0,\; y_i(w^\top x_i+b)=1 \): These are the points exactly at the margin.
  3. \(\alpha_i = C,\; \beta_i = 0,\; \xi_i \gt 0,\; y_i(w^\top x_i+b)\lt 1 \): These are the points that violate the margin.
By rearranging the terms in \(L\) and writing them in terms of \(\alpha\) we get the following dual problem: \[ \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j)\; \mbox{ s.t. } 0 \leq \alpha_i \leq C,\; \sum_i \alpha_i y_i = 0\] The first time I saw this I said "Aha!", \(\sum_i \alpha_i\) is the L1 norm of \(\alpha\) (all \(\alpha_i \geq 0\)), so here is the source of sparsity! Then I realized that there was a \(max\), not a \(min\) at the beginning :( In any case this does not look like a simple monotonic function of \(\alpha\), so it is not clear what effect maximizing \(F\) will have on \(\alpha\). We need to look a bit deeper.

It turns out the optimal value of the original minimization problem and the optimal value of the dual maximization problem are the same. The dual always lower bounds the original problem, and in this case the bound is tight. So the following holds at the optimum \(w, \alpha\): \[ \frac{1}{2}|w|^2 + C\sum_i \xi_i = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j) \] The last sum is equal to \(|w|^2\), so rearranging gives us: \[ \|\alpha\|_1 = \|w\|_2^2 + C \sum_i \xi_i \] which means for a separable problem where all \(\xi_i=0\), we have \(\|\alpha\|_1 = \|w\|_2^2\)! If we are minimizing the L2 norm of \(w\), we are also minimizing the L1 norm of the corresponding \(\alpha\)! So how come the perceptron solves the 9vs4 on MNIST with half the number of support vectors?

I think it has to do with the fact that the SVM support vectors are always at (or on the wrong side of) the margin. Sometimes instances on the correct side of the margin may let us express the same hypothesis more compactly. The following picture may help:

We see a bunch of points at the margin, (at least some of) which will be picked by the SVM as support vectors. We also see two points at the poles. A perceptron which encounters one of the two pole points first will never make a mistake again, so it will express the same hypothesis more compactly. RVMs are also known to pick interior points as support vectors and generate more compact models.

Questions for the curious reader:
  • Can the projectron solve the problem given in the figure compactly given a bad ordering of points?
  • Can we maximize the margin while minimizing the number of support vectors?
  • Can we prove learning bounds on hypotheses representable by a small number of support vectors?
  • Does L1 SVM sparsify the dual and does it lead to fewer support vectors?
  • Can we have nonparametric algorithms which stop growing when no longer necessary and still provide good bounds?

Full post...

May 18, 2014

Reddit AMA with Yann Lecun

Yann LeCun recently became the director of Facebook AI research and gave an online interview. Here are some interesting excerpts for students of machine learning:
  • If you are an undergrad, take as many math and physics course as you can, and learn to program. If you are an aspiring grad student: apply to schools where there is someone you want to work with. It's much more important that the ranking of the school (as long as the school is in the top 50).
  • Read, learn from on-line material, try things for yourself. As Feynman said: don't read everything about a topic before starting to work on it. Think about the problem for yourself, figure out what's important, then read the literature. This will allow you to interpret the literature and tell what's good from what's bad.
  • Don't get fooled by people who claim to have a solution to Artificial General Intelligence, who claim to have AI systems that work "just like the human brain", or who claim to have figured out how the brain works (well, except if it's Geoff Hinton making the claim). Ask them what error rate they get on MNIST or ImageNet.
  • We are using Torch for most of our research projects (and some of our development projects) at Facebook. Deep Mind is also using Torch in a big way (largely because my former student and Torch-co-maintainer Koray Kavukcuoglu sold them on it). Since the Deep Mind acquisition, folks in the Google Brain group in Mountain View have also started to use it.
  • Kernel methods are great for many purposes, but they are merely glorified template matching. Despite the beautiful math, a kernel machine is nothing more than one layer of template matchers (one per training sample) where the templates are the training samples, and one layer of linear combinations on top.
  • There is nothing magical about margin maximization. It's just another way of saying "L2 regularization" (despite the cute math).
  • There is no opposition between deep learning and graphical models. Many deep learning approaches can be seen as factor graphs. I posted about this in the past.
  • (On most promising current areas of research): Representation learning (the current crop of deep learning methods is just one way of doing it); learning long-term dependencies; marrying representation learning with structured prediction and/or reasoning; unsupervised representation learning, particularly prediction-based methods for temporal/sequential signals; marrying representation learning and reinforcement learning; using learning to speed up the solution of complex inference problems; theory: do theory (any theory) on deep learning/representation learning; understanding the landscape of objective functions in deep learning; in terms of applications: natural language understanding (e.g. for machine translation), video understanding; learning complex control.
  • (On HTM): Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it's the very idea of deep learning) and will use temporal prediction. But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?). I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice. As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.
  • Do you think having a PhD is important if you want to work in a good research team in the industry?

Full post...