Deniz Yuret's Homepage: 2014

November 12, 2014

Some starting points for deep learning and RNNs

This version 2016-03-22 cross posted at www.aistartups.org, older versions: 2014-11-12, 2014-05-17.

Bengio, LeCun, Jordan, Hinton, Schmidhuber, Ng, de Freitas and OpenAI have done reddit AMA's. These are nice places to start to get a Zeitgeist of the field.

Hinton and Ng lectures at Coursera, UFLDL, CS224d and CS231n at Stanford, the deep learning course at Udacity, and the summer school at IPAM have excellent tutorials, video lectures and programming exercises that should help you get started. (Update 2016-08-22: Unfortunately coursera closed down their old courses, but here are youtube links for Neural Nets by Geoffrey Hinton, Machine Learning by Andrew Ng, Reinforcement Learning by David Silver, Probabilistic Graphical Models by Daphne Koller, Convex Optimization by Stephen Boyd).

The online book by Nielsen, notes for CS231n, and blogs by Karpathy, Olah and Britz have clear explanations of MLPs, CNNs and RNNs. The tutorials at UFLDL and deeplearning.net give equations and code. The encyclopaedic book by Goodfellow et al. is a good place to dive into details. I have a draft book in progress.

Theano, Torch, Caffe, ConvNet, TensorFlow, MXNet, CNTK, Veles, CGT, Neon, Chainer, Blocks and Fuel, Keras, Lasagne, Mocha.jl, Deeplearning4j, DeepLearnToolbox, Currennt, Project Oxford, Autograd (for Torch), Warp-CTC are some of the many deep learning software libraries and frameworks introduced in the last 10 years. convnet-benchmarks and deepframeworks compare the performance of many existing packages. I am working on developing an alternative, Knet.jl, written in Julia supporting CNNs and RNNs on GPUs and supporting easy development of original architectures. More software can be found at deeplearning.net.

Deeplearning.net and homepages of Bengio, Schmidhuber have further information, background and links.

September 30, 2014

Mode coupling points to functionally important residues in myosin II.

Onur Varol, Deniz Yuret, Burak Erman and Alkan Kabakçıoğlu. 2014. Proteins: Structure, Function, and Bioinformatics, vol 82, no 9, pp 1777--1786, September. (PDF)

Abstract: Relevance of mode coupling to energy/information transfer during protein function, particularly in the context of allosteric interactions is widely accepted. However, existing evidence in favor of this hypothesis comes essentially from model systems. We here report a novel formal analysis of the near-native dynamics of myosin II, which allows us to explore the impact of the interaction between possibly non-Gaussian vibrational modes on fluctuational dynamics. We show that an information-theoretic measure based on mode coupling alone yields a ranking of residues with a statistically significant bias favoring the functionally critical locations identified by experiments on myosin II.

August 23, 2014

Unsupervised Instance-Based Part of Speech Induction Using Probable Substitutes

Mehmet Ali Yatbaz, Enis Sert, Deniz Yuret. COLING 2014. (PDF. This is a token based and multilingual extension of our EMNLP 2012 model. Up to date versions of the code can be found at github.) Abstract: We develop an instance (token) based extension of the state of the art word (type) based part-of-speech induction system introduced in (Yatbaz et al. 2012). Each word instance is represented by a feature vector that combines information from the target word and probable substitutes sampled from an n-gram model representing its context. Modeling ambiguity using an instance based model does not lead to significant gains in overall accuracy in part-of-speech tagging because most words in running text are used in their most frequent class (e.g. 93.69% in the Penn Treebank). However it is important to model ambiguity because most frequent words are ambiguous and not modeling them correctly may negatively affect upstream tasks. Our main contribution is to show that an instance based model can achieve significantly higher accuracy on ambiguous words at the cost of a slight degradation on unambiguous ones, maintaining a comparable overall accuracy. On the Penn Treebank, the overall many-to-one accuracy of the system is within 1% of the state-of-the-art (80%), while on highly ambiguous words it is up to 70% better. On multilingual experiments our results are significantly better than or comparable to the best published word or instance based systems on 15 out of 19 corpora in 15 languages. The vector representations for words used in our system are available for download for further experiments.

August 08, 2014

Volkan Cirik, M.S. 2014

Current position: Member of Technical Staff at Cohere, London (LinkedIn).
M.S. Thesis: Analysis of SCODE Word Embeddings based on Substitute Distributions in Supervised Tasks. Koç University, Department of Computer Engineering. August, 2014. (PDF, Presentation, word vectors (github), word vectors (dropbox))
Publications: bibtex.php

Abstract
One of the interests of the Natural Language Processing (NLP) community is to find representations for lexical items using large amount of unlabeled data. Inducing low-dimensional, continuous, dense word vectors, or word embeddings, have become the principal technique to find representations for words. Word embeddings address the issues of the classical categorical representation of words by capturing syntactic and semantic information of words in the dimensions of a vector. These representations are shown to be successful across NLP tasks including Named Entity Recognition, Part-of-speech Tagging, Parsing, and Semantic Role Labeling.

In this work, I analyze a word embedding method in supervised Natural Language Processing (NLP) tasks. The framework maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. I compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. I examine the framework in a multilingual setup as well. The results show that the examined method achieves as good as or better results compared to the other word embeddings. The framework is consistent in improving the baseline systems across languages and achieves state-of-the-art results in multilingual dependency parsing.

June 26, 2014

Probabilistic Modeling of Joint-context in Distributional Similarity

Oren Melamud, Ido Dagan, Jacob Goldberger, Idan Szpektor, and Deniz Yuret. In the Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL-2014). (Download W14-1619)

Abstract: Most traditional distributional similarity models fail to capture syntagmatic patterns that group together multiple word features within the same joint context. In this work we introduce a novel generic distributional similarity scheme under which the power of probabilistic models can be leveraged to effectively model joint contexts. Based on this scheme, we implement a concrete model which utilizes probabilistic n-gram language models. Our evaluations suggest that this model is particularly well-suited for measuring similarity for verbs, which are known to exhibit richer syntagmatic patterns, while maintaining comparable or better performance with respect to competitive baselines for nouns. Following this, we propose our scheme as a framework for future semantic similarity models leveraging the substantial body of work that exists in probabilistic language modeling.

June 11, 2014

Online kernel methods for parsing

Here are the slides from a recent "work in progress" talk I gave at the AI lab.

June 08, 2014

SVMs are not sparse

After getting bogged down by 300K support vectors trying a new model for parsing based on the structured kernel perceptron, I decided to look at the sparsity issue. Surely there must be models that express the same hypothesis using fewer support vectors, and possibly sophisticated algorithms like the SVM could find them. Alas, it was not meant to be. After a few weeks of fiddling around I found out that being sparse is not one of the strong suits of the SVM. For example using the same kernel (4th degree polynomial) on the MNIST 9vs4 (separable) problem, SVM ends up with 1047 support vectors, perceptron gives the same performance with 537. In hindsight maybe this is not surprising, for the perceptron every point on the wrong side of the boundary becomes a support vector, for SVM it is every point on the wrong side of the margin. More importantly, if the problem is not separable, both algorithms will keep growing their models linearly with the number of examples: every mistake becomes a support vector and we never run out of mistakes! But that's a topic for another post, for this one let's look at SVMs and whether they are supposed to be sparse.

The SVM algorithm solves the following constrained optimization problem: \[ \min_{w,b} \frac{1}{2}|w|^2 + C\sum_i \xi_i\; \mbox{ s.t. }\; y_i(w^\top x_i+b)\geq 1-\xi_i,\; \xi_i \geq 0 \] which is minimizing a combination of the L2 norm of \(w\) and the total hinge loss \(\xi_i=(1-y_i(w^\top x_i+b))_+\). We know the L2 regularization drives \(w_i\) toward zero but not the whole way, so \(w\) won't be sparse. To see about sparsity of support vectors we need to look at the dual form. The Lagrangian for the above constrained optimization is: \[ L = \frac{1}{2}|w|^2 + C\sum_i \xi_i - \sum_i \alpha_i (y_i(w^\top x_i+b)-1+\xi_i) - \sum_i \beta_i \xi_i \] where \(\alpha_i \geq 0, \beta_i \geq 0\) are called the dual variables, and by setting various derivatives to zero we learn what the solution looks like: \[\begin{eqnarray*} \nabla_w L &=& w - \sum_i \alpha_i y_i x_i = 0 \\ \partial_b L &=& \sum_i \alpha_i y_i = 0 \\ \partial_{\xi_i} L &=& C - \alpha_i - \beta_i = 0 \end{eqnarray*}\] The first equation tells us that we can write \(w\) as a linear combination of training instances with weights \(\alpha_i\), and the last equation tells us that \(\alpha_i + \beta_i = C\) for each training instance. In fact there are three types of instances:

\(\alpha_i = 0,\; \beta_i = C,\; \xi_i = 0,\; y_i(w^\top x_i+b)\gt 1 \): These are the points that satisfy the margin.
\(0\lt\alpha_i\lt C,\; 0\lt\beta_i\lt C,\; \xi_i = 0,\; y_i(w^\top x_i+b)=1 \): These are the points exactly at the margin.
\(\alpha_i = C,\; \beta_i = 0,\; \xi_i \gt 0,\; y_i(w^\top x_i+b)\lt 1 \): These are the points that violate the margin.

By rearranging the terms in \(L\) and writing them in terms of \(\alpha\) we get the following dual problem: \[ \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j)\; \mbox{ s.t. } 0 \leq \alpha_i \leq C,\; \sum_i \alpha_i y_i = 0\] The first time I saw this I said "Aha!", \(\sum_i \alpha_i\) is the L1 norm of \(\alpha\) (all \(\alpha_i \geq 0\)), so here is the source of sparsity! Then I realized that there was a \(max\), not a \(min\) at the beginning :( In any case this does not look like a simple monotonic function of \(\alpha\), so it is not clear what effect maximizing \(F\) will have on \(\alpha\). We need to look a bit deeper.

It turns out the optimal value of the original minimization problem and the optimal value of the dual maximization problem are the same. The dual always lower bounds the original problem, and in this case the bound is tight. So the following holds at the optimum \(w, \alpha\): \[ \frac{1}{2}|w|^2 + C\sum_i \xi_i = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j) \] The last sum is equal to \(|w|^2\), so rearranging gives us: \[ \|\alpha\|_1 = \|w\|_2^2 + C \sum_i \xi_i \] which means for a separable problem where all \(\xi_i=0\), we have \(\|\alpha\|_1 = \|w\|_2^2\)! If we are minimizing the L2 norm of \(w\), we are also minimizing the L1 norm of the corresponding \(\alpha\)! So how come the perceptron solves the 9vs4 on MNIST with half the number of support vectors?

I think it has to do with the fact that the SVM support vectors are always at (or on the wrong side of) the margin. Sometimes instances on the correct side of the margin may let us express the same hypothesis more compactly. The following picture may help:

We see a bunch of points at the margin, (at least some of) which will be picked by the SVM as support vectors. We also see two points at the poles. A perceptron which encounters one of the two pole points first will never make a mistake again, so it will express the same hypothesis more compactly. RVMs are also known to pick interior points as support vectors and generate more compact models.

Questions for the curious reader:

Can the projectron solve the problem given in the figure compactly given a bad ordering of points?
Can we maximize the margin while minimizing the number of support vectors?
Can we prove learning bounds on hypotheses representable by a small number of support vectors?
Does L1 SVM sparsify the dual and does it lead to fewer support vectors?
Can we have nonparametric algorithms which stop growing when no longer necessary and still provide good bounds?

May 18, 2014

Reddit AMA with Yann Lecun

Yann LeCun recently became the director of Facebook AI research and gave an online interview. Here are some interesting excerpts for students of machine learning:

If you are an undergrad, take as many math and physics course as you can, and learn to program. If you are an aspiring grad student: apply to schools where there is someone you want to work with. It's much more important that the ranking of the school (as long as the school is in the top 50).
Read, learn from on-line material, try things for yourself. As Feynman said: don't read everything about a topic before starting to work on it. Think about the problem for yourself, figure out what's important, then read the literature. This will allow you to interpret the literature and tell what's good from what's bad.
Don't get fooled by people who claim to have a solution to Artificial General Intelligence, who claim to have AI systems that work "just like the human brain", or who claim to have figured out how the brain works (well, except if it's Geoff Hinton making the claim). Ask them what error rate they get on MNIST or ImageNet.
We are using Torch for most of our research projects (and some of our development projects) at Facebook. Deep Mind is also using Torch in a big way (largely because my former student and Torch-co-maintainer Koray Kavukcuoglu sold them on it). Since the Deep Mind acquisition, folks in the Google Brain group in Mountain View have also started to use it.
Kernel methods are great for many purposes, but they are merely glorified template matching. Despite the beautiful math, a kernel machine is nothing more than one layer of template matchers (one per training sample) where the templates are the training samples, and one layer of linear combinations on top.
There is nothing magical about margin maximization. It's just another way of saying "L2 regularization" (despite the cute math).
There is no opposition between deep learning and graphical models. Many deep learning approaches can be seen as factor graphs. I posted about this in the past.
(On most promising current areas of research): Representation learning (the current crop of deep learning methods is just one way of doing it); learning long-term dependencies; marrying representation learning with structured prediction and/or reasoning; unsupervised representation learning, particularly prediction-based methods for temporal/sequential signals; marrying representation learning and reinforcement learning; using learning to speed up the solution of complex inference problems; theory: do theory (any theory) on deep learning/representation learning; understanding the landscape of objective functions in deep learning; in terms of applications: natural language understanding (e.g. for machine translation), video understanding; learning complex control.
(On HTM): Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it's the very idea of deep learning) and will use temporal prediction. But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?). I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice. As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.
Do you think having a PhD is important if you want to work in a good research team in the industry?
Yes.

May 17, 2014

How to learn about deep learning

In this post I will list resources I found useful when I was trying to find out about deep learning. Feel free to add more in the comments.

Neural Networks for Machine Learning: Hinton's course at Coursera is a wonderful introduction to deep learning. I recommend lectures 11-15 that cover Hopfield nets, Boltzmann machines, sigmoid belief nets and sparse auto-encoders.
UFLDL tutorial: Tutorial on unsupervised feature learning and deep learning by Andrew Ng. Includes lecture notes and programming exercises. The first programming exercise shows how sparse auto-encoders can self-learn features found in the visual cortex, and the rest shows how these unsupervised features improve performance on supervised visual recognition tasks like MNIST and STL10.
IPAM graduate summer school on deep learning and feature learning: has video lectures by many of the leaders of the field.
Torch7: is a Matlab-like environment for state-of-the-art machine learning algorithms used at the IPAM summer school and several research groups on deep learning.
deeplearning.net lists these and many other resources.

May 15, 2014

That pesky Z: multinomial vs log-linear generative models

I was thinking about when and how unsupervised learning is possible and practical. First of all we seem to need a generative model (can we even define a discriminative unsupervised model?). A generative model defines a joint probability distribution for all your data, in contrast to conditional models which only define p(y|x) where x is the "input" and y is the "output". LSP defines two types (parametrizations) of generative model: multinomial and log-linear. Multinomial models (Naive Bayes, NGRAM, HMM, PCFG, etc.), are hard to define: you need to imagine an exact sequence of choices that construct each example. Log-linear models are easier to define, they focus more on description than construction: you can pick any property you like as a "feature" and make the probability of each example be defined by the state of its features. When it comes to generating instances or learning parameters though, multinomial models are easier: they can be learnt by simple counting, log-linear models require convex optimization. On the other hand, didn't I show that Naive Bayes, for example, can be seen both as a multinomial model and a generative log-linear model in a previous post? What is going on?

Let us start with the log-linear model. Let \(x\) be the data we are trying to model, \(\phi(x)\) a vector of features, and \(w\) a vector of model parameters. (Here I use \(x\) to denote the whole training set, more typically denoted as \(\{(x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\}\), to reduce notational clutter). The generative log-linear model defines the probability of a particular dataset we observed, \(x_o\), as: \[ \begin{eqnarray*} p(x_o) &=& (1/Z) \exp w^\top \phi(x_o) \\ Z &=& \sum_{x} \exp w^\top \phi(x) \end{eqnarray*} \] where \(Z\), the normalization constant, or as physicists call it the partition function, is computed as a sum over all possible observations \(x\). To maximize the log likelihood of the observed data \(x_o\) we compute its gradient with respect to coefficients \(w_k\) that correspond to features \(\phi_k(x)\): \[ \begin{eqnarray*} \log p(x_o) &=& w^\top \phi(x_o) - \log Z \\ \frac{\partial}{\partial w_k} w^\top \phi(x_o) &=& \phi_k(x_o) \\ \frac{\partial}{\partial w_k}\log Z &=& (1/Z) \sum_{x} (\exp w^\top \phi(x)) \, \phi_k(x) \\ &=& \sum_{x} p(x) \phi_k(x) = \langle\phi_k\rangle \\ \frac{\partial}{\partial w_k}\log p(x_o) &=& \phi_k(x_o) - \langle\phi_k\rangle \end{eqnarray*} \] which shows that at the maximum, where the derivative is zero, the model expectation of the k'th feature \(\langle\phi_k\rangle\) should be equal to its value in the observed data \(\phi_k(x_o)\). That is very nice, but it does not tell us how to set \(w_k\) to achieve that maximum, for that we still need to run an optimizer. To compute the gradient for the optimizer, or even the likelihood of a particular \(w\) we need to compute a sum over all possible \(x\), which is completely impractical. The typical solution is to sample a random \(x\) from the distribution \(p(x)\) and use \(\phi_k(x)\) as an unbiased estimate to replace \(\langle\phi_k\rangle\). But sampling from a distribution with an unknown \(Z\) is no easy matter either. Now you know why generative log-linear models are not very popular.

Let us now turn to multinomial models like HMM. How do they define a distribution over all possible observations, yet avoid computing very large sums? They (very carefully) get rid of the big \(Z\). In particular, they come up with a generative story where the observed data is built piece by piece, and the choice for each piece depends only on a subset of the previously selected pieces. For example, an HMM is a simple machine which picks a current state based on its previous state (or last few states), and its current output based on the current state. To compute the probability of a particular state-output sequence we just multiply the conditional probabilities for each of the choices.

It is difficult to find a notation to express multinomial models that is simple, general, and that makes their relation to log-linear models obvious. Here is an attempt: let \(d\) be a decision, \(c\) be a condition, \(p(d|c)\) the probability of making decision \(d\) under condition \(c\), and \(n_o(c,d)\) the number of times we made decision \(d\) under condition \(c\) while constructing the observed data \(x_o\). For example in an HMM, \(d\) would be a member of states and outputs, \(c\) would be a member of states. We can then express the probability of the observed data \(x_o\) as: \[ \begin{eqnarray*} p(x_o) &=& \prod_{c,d} p(d|c)^{n_o(c,d)} \\ \log p(x_o) &=& \sum_{c,d} n_o(c,d) \log p(d|c) \end{eqnarray*} \] If we define \(\phi\) and \(w\) vectors indexed by \((c,d)\) pairs as: \[ \begin{eqnarray*} \phi_{c,d}(x_o) &=& n_o(c,d)\\ w_{c,d} &=& \log p(d|c) \end{eqnarray*} \] we can write the log likelihood as: \[ \log p(x_o) = w^\top \phi(x_o) \] which looks exactly like the log-linear model except for the missing \(Z\) term! Of course that doesn't mean we can pick any old \(w\) vector we want. In particular the probability of our decisions under each condition should sum to 1: \[ \forall c \sum_d p(d|c) = \sum_d \exp w_{c,d} = 1 \] To find the \(w\) that satisfies these constraints and maximizes log-likelihood, we use Lagrange multipliers: \[ \begin{eqnarray*} L(w,\lambda) &=& w^\top \phi(x_o) + \sum_c \lambda_c (1 - \sum_d \exp w_{c,d}) \\ \frac{\partial}{\partial w_{c,d}} L &=& \phi_{c,d}(x_o) - \lambda_c \exp w_{c,d} \\ &=& n_o(c,d) - \lambda_c p(d|c) = 0 \\ p(d|c) &=& n_o(c,d) / \lambda_c \end{eqnarray*} \] which shows that the maximum likelihood estimates for conditional probabilities \(p(d|c)\) are proportional to the observed counts \(n_o(c,d)\). To satisfy the sum to 1 constraint, we obtain \(p(d|c) = n_o(c,d)/n_o(c)\).

So, multinomial models are specific cases of log-linear models after all. However they avoid the big \(Z\) calculation by carefully defining the features and the corresponding parameters so as to ensure \(Z=1\). They do this by decomposing the observation into many small parts and making sure the decisions that generate each part are individually normalized. If each decision involves one of 10 choices and we need to make 20 decisions to construct the whole observation the set of all possible observations would have \(10^{20}\) elements. Normalizing 20 decisions with 10 elements each is easier than normalizing a set with \(10^{20}\) elements.

Questions for the interested reader:

Show that normalization of each part ensures the normalization of the full observation.
The log-linear solution above used an explicit Z, whereas the multinomial solution used constrained optimization. Can you use constrained optimization for log-linear models or an explicit Z for multinomial models, and does that make a difference?
LSP mentions disadvantages of locally normalized conditional models compared to globally normalized conditional models. Do the same arguments apply to globally normalized generative log-linear models vs locally normalized generative multinomial models?
Unsupervised learning can be thought of as hiding part of x from view and learning weights that maximize the likelihood given the visible portion. Is inference with multinomial models still easy in the unsupervised case? How about log-linear models?
In the world of neural networks, log-linear models are analogous to "energy based", "undirected" or "symmetric" models like Hopfield nets and Boltzmann machines whereas multinomial models are analogous to "causal", "directed" models like feed-forward neural nets or sigmoid belief nets. Show that analogous arguments apply to these models.
In the world of probabilistic graphical models, log-linear models are instances of "undirected" or "Markov" networks, whereas multinomial models are instances of "directed, acyclic" or "Bayesian" networks. Show that analogous arguments apply to these networks.

April 14, 2014

On the emergence of visual cortex receptive field properties

I have always found the story of the development of our understanding of the visual cortex fascinating. Within a span of four decades we went from "The visual neurons do what?" to "That's exactly what we would do if we were to engineer a visual system." to "Look: if I show some random network some random images, it self organizes exactly in that way!" Successful scientific theories seem to end up showing us the inevitability of what we observe: "Oh, it had to be that way!" The trick is to come up with the right explanation (a la David Deutsch) for why things are the way they are...

The light sensitive neurons of the retina (120 million rods and 6 million cones) pass their electrical signals through several intermediate layers to ganglion cells, which extend about 1.5 million cell fibers (the optic nerve) into the brain for further processing. Passing through the LGN in mid-brain, the signals end up in the visual cortex at the back where their processing eventually lead to our perception of visual objects and events.

Now, the rods and cones fire when light falls on them, they are simple photoreceptors. Ganglions, on the other hand, receive inputs from a large number of photoreceptors, so we'd like to know their receptive field. The receptive field of a ganglion is the region of the retina that effects its firing when stimulated. Kuffler (1953) showed that cat ganglion cells have a receptive field with a "center surround" pattern:

Note that these cells will respond strongly to a patch of light with the right size and location, but unlike rods and cones, they will not respond strongly to a uniformly bright surface because the excitatory (+) and the inhibitory (-) areas will approximately cancel out.

Hubel and Wiesel (1959) went further and recorded from the neurons of the cat visual cortex to identify their receptive fields. Their attempts at trying to elicit response from cortical neurons with spots of light were unsuccessful in the beginning. A couple of hours into the experiment they accidentally discovered a neuron that "went off like a machine gun" when inserting a glass slide into the projector. It turned out the neuron liked the straight edge of the slide that was moving on the screen. They discovered cortical neurons (which they called simple cells) respond best not to spots, but to bars of light oriented in a particular direction.

So far we have covered the "what?" part of our story. In the following decades research on computer vision took off at the MIT AI Lab and elsewhere. The engineering approach of trying to build systems that could actually see helped us understand why nature may have "designed" the receptive fields the way it has. For example Marr and Hildreth (1980) argue that the center surround receptive fields can be thought of as convolution with the second derivative of a Gaussian and the simple cells detect the zero-crossings of this convolution which facilitate edge detection.

Engineering approaches may answer the "why" but still leave open the question of how neurons know to connect in this intricate pattern. Embryo development is still not completely understood and while it may be reasonable that the same developmental processes that can build "fingers and toes" may create some functional regions in the brain, it is probably not reasonable to expect instructions for neuron #134267542 to connect to neuron # 49726845. At this point in our story came Olshausen and Field (1996) who showed that if you train a neural network with patches of natural images and bias it to preserve information and promote sparseness, you automagically get receptive fields similar to those of Hubel and Wiesel's simple cells:

So the poor neurons had no choice in the matter. It turns out to be inevitable that under some simple conditions random networks exposed to natural images would connect up in a way that an engineer would want them to connect and perform functions similar to that of the neurons in mammalian brains!

April 12, 2014

Our Mathematical Universe by Max Tegmark

Consider a computer simulation of a mini universe, a universe complex enough to give rise to sentient beings. As Greg Egan points out in Permutation City, (and as is obvious to anybody who has played The Sims), the subjective flow of time in the simulated universe could be many times faster or slower than ours. In fact, once we have the simulation going, we can fast forward, rewind back, play the frames out of order, and it wouldn't matter one bit to the subjective experience of the simulated folk. What if we stopped the simulation? What if we saved the whole history on a DVD and put it on a shelf? Again, the subjective experience of the simulated, which arise from the relationships between their successive moments, would be uneffected. We would just be missing the opportunity to observe them in action, and interact with them.

That brings us to the next natural question: do they need the simulation, the DVD, do they need us at all? Don't they just "exist" independently whether or not we happen to be watching, recording, interacting or anybody has ever thought of them? Max Tegmark advances the Mathematical Universe Hypothesis, which says (1) our universe at the most basic level IS (not just described by) a mathematical structure (an electron or a photon can be defined exactly by a handful of numbers and there is nothing left over), and that (2) any mathematically consistent (computable?) structure must exist in the same sense as we do!

Needless to say this has led to quite a bit of spirited discussion around the web, I would recommend Scott Aaronson's blog post and the comments therein.

April 06, 2014

Monument Valley

For all of you Escher fans out there...

March 25, 2014

Emre Ünal, M.S. 2014

Current position: Co-founder at Manolin, Istanbul. (email, LinkedIn).
M.S. Thesis: A Language Visualization System. Koç University Department of Computer Engineering, March 2014. (PDF, Presentation, Related video, paper)
Publications: bibtex.php

Abstract: In this thesis, a novel language visualization system is presented that converts natural language text into 3D scenes. The system is capable of understanding some concrete nouns, visualizable adjectives and spatial prepositions from full natural language sentences and generating 3D static scenes using these sentences. It is a rule based system that uses natural language processing tools, 3D model galleries and language resources during the process. Several techniques are shown that deals with the generality and ambiguity of the language in order to visualize the natural language text. A question answering module is built as well to answer certain types of spatial inference questions after the scene generation process is completed. The system demonstrates a new way of solving spatial inference problems by not only using the language itself but with the extra information provided by the visualization process.

February 14, 2014

Machine learning in 10 pictures

I find myself coming back to the same few pictures when explaining basic machine learning concepts. Below is a list I find most illuminating.

1. Test and training error: Why lower training error is not always a good thing: ESL Figure 2.11. Test and training error as a function of model complexity.

2. Under and overfitting: PRML Figure 1.4. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve.

3. Occam's razor: ITILA Figure 28.3. Why Bayesian inference embodies Occam’s razor. This figure gives the basic intuition for why complex models can turn out to be less probable. The horizontal axis represents the space of possible data sets D. Bayes’ theorem rewards models in proportion to how much they predicted the data that occurred. These predictions are quantified by a normalized probability distribution on D. This probability of the data given model Hi, P (D | Hi), is called the evidence for Hi. A simple model H1 makes only a limited range of predictions, shown by P(D|H1); a more powerful model H2, that has, for example, more free parameters than H1, is able to predict a greater variety of data sets. This means, however, that H2 does not predict the data sets in region C1 as strongly as H1. Suppose that equal prior probabilities have been assigned to the two models. Then, if the data set falls in region C1, the less powerful model H1 will be the more probable model.

4. Feature combinations: (1) Why collectively relevant features may look individually irrelevant, and also (2) Why linear methods may fail. From Isabelle Guyon's feature extraction slides.

5. Irrelevant features: Why irrelevant features hurt kNN, clustering, and other similarity based methods. The figure on the left shows two classes well separated on the vertical axis. The figure on the right adds an irrelevant horizontal axis which destroys the grouping and makes many points nearest neighbors of the opposite class.

6. Basis functions: How non-linear basis functions turn a low dimensional classification problem without a linear boundary into a high dimensional problem with a linear boundary. From SVM tutorial slides by Andrew Moore: a one dimensional non-linear classification problem with input x is turned into a 2-D problem z=(x, x^2) that is linearly separable.

7. Discriminative vs. Generative: Why discriminative learning may be easier than generative: PRML Figure 1.27. Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate.

8. Loss functions: Learning algorithms can be viewed as optimizing different loss functions: PRML Figure 7.5. Plot of the ‘hinge’ error function used in support vector machines, shown in blue, along with the error function for logistic regression, rescaled by a factor of 1/ln(2) so that it passes through the point (0, 1), shown in red. Also shown are the misclassification error in black and the squared error in green.

9. Geometry of least squares: ESL Figure 3.2. The N-dimensional geometry of least squares regression with two predictors. The outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection yˆ represents the vector of the least squares predictions.

10. Sparsity: Why Lasso (L1 regularization or Laplacian prior) gives sparse solutions (i.e. weight vectors with more zeros): ESL Figure 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions |β1| + |β2| ≤ t and β1² + β2² ≤ t2, respectively, while the red ellipses are the contours of the least squares error function.

February 12, 2014

Mehmet Ali Yatbaz, Ph.D. 2014

Current position: Member of technical staff at OpenAI, San Francisco. (Linkedin, Publications).
PhD Thesis: Linguistic Category Induction and Tagging Using the Paradigmatic Context Representations with Substitute Words. Koç University Department of Computer Engineering, February 2014. (PDF, Presentation, Keynote)
M.S. Thesis: Stretch: A Feature Weighting Method for The k Nearest Neighbor Algorithms. Koç University Department of Computer Engineering, October 2007. (PDF, Presentation).
Publications: bibtex.php

Linguistic Category Induction and Tagging Using the Paradigmatic Context Representations with Substitute Words

This thesis introduces a new paradigmatic representation of word contexts. Paradigmatic representations of word context are constructed from the potential substitutes of a word, in contrast to syntagmatic representations, which are constructed from the properties of neighboring words. The potential substitutes are calculated by using a statistical language model that is trained on raw text without any annotation or supervision. Thus, each context is represented as a distribution of substitute words. This thesis introduces models with different properties that can incorporate the new paradigmatic representation, and discusses the applications of these models to the tagging task in natural language processing (NLP).

In a standard NLP tagging task, the goal is to build a model in which the input is a sequence of observed words, and the output, depending on the level of supervision, is a sequence of cluster-ids or predefined tags. We define 5 different models with different properties and supervision requirements. The first model ignores the identity of the word, and clusters the substitute distributions without requiring supervision at any level. The second model represents the co-occurrences of words with their substitute words, and thus incorporates the word identity and context information at the same time. To construct the co-occurrence representation, this model discretizes the substitute distribution. The third model uses probabilistic voting to estimate the distribution of tags in a given context. Unlike the first and second models, this model requires the availability of a word-tag dictionary which can provide all possible tags of each given word. The fourth model proposes two extensions to the standard HMM-based tagging models in which both the word identity and the dependence between consecutive tags are taken into consideration. The last one introduces a generative probabilistic model, the noisy channel model, for the taggin tasks in which the word-tag frequencies are available. In this model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular tag T using a possibly ambiguous word W. To reconstruct the intended message (T), the hearer uses the distribution of possible tags in the given context Pr(TjC) and the possible words that can express each tag Pr(WjT).

The models are applied and analyzed on NLP tagging tasks with different characteristics. The first two models are tested on unsupervised part-of-speech (POS) induction in which the objective is to cluster syntactically similar words into the same group. The probabilistic voting model is tested on the morphological disambiguation of Turkish, with the objective of disambiguating the correct morphological parse of a word, given the available parses. The HMM-based model is applied to the part-of-speech tagging of English, with the objective of determining the correct POS tag of a word, given the available tags. Finally, the last model is tested on the word-sense disambiguation of English, with the objective of determining the correct sense of a word, given the word-sense frequencies.

January 27, 2014

Turkish Language Resources

This post contains links to various Turkish language resources that I have collected. Please send a comment if you find Turkish resources that you would like to see on this page.

Turkish Data Depository (TDD): Our new website that we hope will act as a central location for all computational resources and models for Turkish. Most resources listed below and more will be found there.
ITU Turkish Natural Language Processing Pipeline: This webpage provides the Turkish NLP Tools and API developed in Istanbul Technical University by our Natural Language Processing group led by Asst. Prof. Dr. Gülşen Eryiğit including a Tokenizer, Normalization tools, Deasciifier, Vocalizer, Spelling Corrector, Turkish word detector, Morphological Analyzer, Morphological Disambiguator, Named Entity Recognizer, and Dependency Parser.
TS Corpus: Taner Sezer's TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.
BounWebCorpus: Hasim Sak's page contains some useful Turkish language resources and code in addition to a large web corpus. (2021-08-26: link broken)
Bibliography: Özgür Yılmazel's Bibliography on Turkish Information Retrieval and Natural Language Processing.
tr-disamb.tgz: Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.
correctparses_03.txt.gz, train.merge.gz: Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct). A newer version can be found in TrMor2006.
test.1.2.dis.gz, test.merge.gz: Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy. A newer version can be found in TrMor2006.
tr-tagger.tgz: Turkish morphological tagger, includes Oflazer's finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software. For a better tagger use Morse.
turklex.tgz, pc_kimmo.tgz: Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu
Milliyet.clean.bz2: Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, originally was in three 11MB parts. From Kemal Oflazer. Please use with permission.
Turkish wordnet: From Kemal Oflazer. Please use with permission. (2021-08-26: link broken)
METU-Sabanci Turkish Treebank: Turkish treebank with dependency annotations. Please use with permission.
sozluk.txt.gz: English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.
sozluk-boun.txt.gz
: Turkish word list (25822 words, 73K) Originally from www.cmpe.boun.edu.tr/courses/cmpe230
Avrupa Birliği Temel Terimler Sözlüğü: (Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)
BilisimSozlugu.zip: Bilişim Sözlüğü by Bülent Sankur (Originally from: www.bilisimsozlugu.com, Oct 9, 2006)
turkish.el: Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.
en-tr.zip, lm.tr.gz: Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.
Zemberek (blog) (github): Zemberek doğal dil işleme kütüphanesi.

January 20, 2014

dlib: Deniz's C Library

(Download the latest version from Github)

Abstract: The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?

Introduction
Compilation
File input
Tokenization
Dynamic arrays
Hash tables

Introduction

The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?

As a motivational example, let's say we want to "count the number of times each word appears in a file". It takes 11 words to express this program in English (7 if you don't count the function words). Perl is just as concise (7 tokens excluding punctuation):

while(<>) {
  for (split) {
    $cnt{$_}++;
  }
}

Imagine what a nightmare this tiny program is in standard C (or especially Java). The main feature that distinguishes Perl among high level languages and enables a Perl programmer to write concise code is that it provides reasonable defaults for everything, so that things you express most often can be written concisely (human languages and compression algorithms use the same trick). And of course as with every modern language these days Perl has built in dynamic arrays, hashes, and regexps and makes them practical to use with its concise syntax. This is my feeble, incomplete, and evolving attempt to import some of these ideas into C. Among other things, dlib introduces some iteration constructs for file I/O and string tokenization and built-in hashes to make programs of this sort almost as concise:

forline (str, NULL) {
  fortok (tok, str) {
    cnt(tok)++;
  }
}

Compilation

To compile your C code with dlib, #include "dlib.h" at the beginning of your files and add dlib.c to the files to be compiled. My typical gcc options are:

-O3 -D_GNU_SOURCE -std=c99 -pedantic -Wall

I tried to stick with the C99 standard but used some extensions that can be turned off. Use the -D_GNU_SOURCE compiler flag if you want to compile with these extensions without warnings. Define the following flags with -D compiler options if you don't have, or don't want these extensions:

_NO_POPEN   Do not use pipes in File I/O.
_NO_GETLINE Do not use GNU getline.
_NO_PROC    Do not use the proc filesystem for memory reporting.
_NO_MUSABLE Do not use GNU malloc_usable_size for memory reporting.
NDEBUG      Turn off debug output and assert checks (from assert.h).

File input

forline(s, f), is an iteration construct which executes the statements in its body with the undeclared string variable s set to each line in file f. All the file handle creation, opening and closing, allocating and freeing of string buffers etc. are taken care of behind the scenes. The following example prints the length of each line in "file.txt":

forline (str, "file.txt") {
  printf("%d\n", strlen(str));
}

There are some special f arguments:

If f==NULL, stdin is read.
If pipes are available, and f starts with <, as in f=="< cmd args", the cmd is run with args and its stdout is read.
If pipes are available, compressed files with .gz, .xz, and .bz2 extensions are automatically handled.
Otherwise a regular file with path f is read.

Tokenization

fortok(t, s) is an iteration construct which executes the statements in its body with the undeclared string variable t set to each whitespace separated token in string s. Empty tokens are skipped. It modifies and tokenizes s the same way strtok does, but unlike strtok it is reentry safe (i.e. multiple nested fortok loops are ok). fortok3(t, s, d) takes an additional character array d to specify delimiter characters any one of which will act as a delimiter. fortok(t, s) is equivalent to fortok3(t, s, " \f\n\r\t\v"). Examples:

char *str = strdup("  To be    or not");
// need strdup because fortok won't work with constant strings
fortok (tok, str) {
  printf("[%s]", tok); // prints "[To][be][or][not]"
}

char *pwd = strdup(":root::/root:/bin/bash:");
fortok3 (tok, pwd, ":/") {
  printf("[%s]", tok); // prints "[root][root][bin][bash]"
}

split(char *str, const char *delim, char **argv, size_t argv_len) returns the tokens of a string in an array. This is useful because one often needs to refer to the n'th token in a string rather than iterating over them. split splits the str into tokens delimited by single characters from delim (including empty tokens) and sets the pointers in the argv array to successive tokens. argv should have enough space to hold argv_len pointers. Split stops when argv_len tokens are reached or str runs out. It modifies str by replacing delimiter characters with '\0'. Returns the number of tokens placed in argv. Example:

char *pwd = strdup(":root::/root:/bin/bash:");
char **toks = malloc(10 * sizeof(char *));
int n = split(pwd, ":/", toks, 10);
for (int i = 0; i < n; i++) {
  printf("[%s]", toks[i]); // prints "[][root][][][root][][bin][bash][]"
}

Note the difference in delimiter handling between split and fortok. fortok and fortok3 follow strtok and see multiple delimiter characters in a row as a single delimiter, whereas split, following strsep, will perceive empty fields. fortok is intended for free text, split for structured data.

NOTES: Neither split nor fortok can handle a multi-character delimiter like "::". Not sure if using different delimiter handling strategies in the two will be confusing for the user. Probably need something like chop or trim eventually.

Dynamic arrays

darr_t (pronounced "dar-ty") is a general purpose dynamic array type in dlib.

darr_t darr(size_t n, type t) returns a new array that has an initial capacity of at least n and element type t. n==0 is ok. The elements are not initialized.
void darr_free(darr_t a) frees the space allocated for a.
size_t len(darr_t a) gives the number of elements in a. A new array has len(a) == 0.
size_t cap(darr_t a) gives the current capacity of a. It will grow as needed, up to a maximum of (1<<58) elements.

A regular C array reference, such as a[i], can be used as an l-value (a[i] = 10), an r-value (x = a[i]), or both (a[i]++). I think this type of access makes the code more readable and I wanted the same flexibility with darr_t. So instead of the usual set, get, add etc. functions we have a single macro val(darr_t a, size_t i, type t) which gives a reference to the i'th element of a that can be used as an l-value or an r-value. All of the following are valid expressions and do what they look like they are supposed to:

darr_t a = darr(0, int);  // initialize array
val(a, i, int) = 10;      // set an element
int x = val(a, i, int);   // get an element
val(a, i, int)++;         // increment an element
int *p = &val(a, i, int); // get a pointer to an element
val(a, len(a), int) = 5;  // add a new element, increments len(a)

The user can request or set any index of a darr_t from 0 to (1<<58-1). The darr_t will never complain and resize itself to permit the access (if memory allows). len(a) will be one plus the largest index accessed (read or write). Some may think this gives too much power to the user to shoot themselves in the foot. A read access to a never initialized element will return a random value. An accidental read or write to a very large index may blow up the memory. Oh well, don't do it.

Hash tables

Hash tables are implemented as dynamic arrays (darr_t) with some search logic. You can initialize and free a hash table using darr and darr_free exactly like you do with dynamic arrays. The macro D_HASH defines an inline function xget (the prefix x is user specified) that searches the array for an element matching a given key and returns a pointer to it:

etype *xget(darr_t htable, ktype key, bool insert);

The etype and ktype are user specified types for array elements and their keys respectively. The boolean argument insert determines what happens when a matching element cannot be found. If insert == true one will be created (in a user specified manner) and added to the array and xget will return its pointer. If insert == false, xget will return NULL.

forhash(etype, eptr, htable, isnull)

is an iteration construct for hash tables which executes the statements in its body with etype *eptr bound to each element of darr_t htable for which the macro or function isnull returns false.

In order to define the xget function, the macro D_HASH takes the following nine arguments (I am open to suggestions to reduce this number). The last six can be macros (preferrable) or functions.

prefix: The prefix added to get to allow multiple hash table types (e.g. x for xget).
etype: Type of array element.
ktype: Type of key.
keyof: ktype keyof(etype e) returns the key for element e.
kmatch: bool kmatch(ktype a, ktype b) returns true if two keys match.
khash: size_t khash(ktype k) is a hash function for keys.
einit: etype einit(ktype k) returns a new element with key matching k.
isnull: bool isnull(etype e) returns true if array element e is empty.
mknull: void mknull(etype e) modifies array element e to become empty.

NOTE: This horribly convoluted interface is necessary to have a general enough implementation that can support sets (where the key and the element are one and the same) and maps (where the key is a function of the element); element types that are primitive (uint32_t, char*) or compound (structs, bit-fields), keys that can be components of compound elements (e.key) or arbitrary functions of primitive ones (e & mask), arrays that store the elements themselves or their pointers etc. It is not intended for daily use. Think of it as your hash table code generator. Once you correctly generate the code for the few hash table types you use, you will hopefully never need D_HASH again.

Here is an example hash table for counting strings:

#include <stdio.h>
#include "dlib.h"

typedef struct strcnt_s { char *key; size_t cnt; } strcnt_t;
#define keyof(e) ((e).key)
extern size_t fnv1a(const char *k);  // string hash fn defined in dlib
#define strmatch(a,b) (!strcmp((a),(b)))
#define newcnt(k) ((strcnt_t) { strdup(k), 0 })
#define keyisnull(e) ((e).key == NULL)
#define keymknull(e) ((e).key = NULL)

D_HASH(s, strcnt_t, char *, keyof, strmatch, fnv1a, newcnt, keyisnull, keymknull)

Given the function sget defined by D_HASH above, we can now write our word counting example from the introduction.

#define cnt(k) sget(htable, (k), true)->cnt

int main() {
  darr_t htable = darr(0, strcnt_t);
  forline (str, NULL) {
    fortok (tok, str) {
      cnt(tok)++;
    }
  }

And print the counts:

  forhash (strcnt_t, e, htable, keyisnull) {
    printf("%s\t%lu\n", e->key, e->cnt);
  }
}