I am an associate professor in Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

April 14, 2014

On the emergence of visual cortex receptive field properties

I have always found the story of the development of our understanding of the visual cortex fascinating. Within a span of four decades we went from "The visual neurons do what?" to "That's exactly what we would do if we were to engineer a visual system." to "Look: if I show some random network some random images, it self organizes exactly in that way!" Successful scientific theories seem to end up showing us the inevitability of what we observe: "Oh, it had to be that way!" The trick is to come up with the right explanation (a la David Deutsch) for why things are the way they are...

The light sensitive neurons of the retina (120 million rods and 6 million cones) pass their electrical signals through several intermediate layers to ganglion cells, which extend about 1.5 million cell fibers (the optic nerve) into the brain for further processing. Passing through the LGN in mid-brain, the signals end up in the visual cortex at the back where their processing eventually lead to our perception of visual objects and events.

Now, the rods and cones fire when light falls on them, they are simple photoreceptors. Ganglions, on the other hand, receive inputs from a large number of photoreceptors, so we'd like to know their receptive field. The receptive field of a ganglion is the region of the retina that effects its firing when stimulated. Kuffler (1953) showed that cat ganglion cells have a receptive field with a "center surround" pattern:

Note that these cells will respond strongly to a patch of light with the right size and location, but unlike rods and cones, they will not respond strongly to a uniformly bright surface because the excitatory (+) and the inhibitory (-) areas will approximately cancel out.

Hubel and Wiesel (1959) went further and recorded from the neurons of the cat visual cortex to identify their receptive fields. Their attempts at trying to elicit response from cortical neurons with spots of light were unsuccessful in the beginning. A couple of hours into the experiment they accidentally discovered a neuron that "went off like a machine gun" when inserting a glass slide into the projector. It turned out the neuron liked the straight edge of the slide that was moving on the screen. They discovered cortical neurons (which they called simple cells) respond best not to spots, but to bars of light oriented in a particular direction.

So far we have covered the "what?" part of our story. In the following decades research on computer vision took off at the MIT AI Lab and elsewhere. The engineering approach of trying to build systems that could actually see helped us understand why nature may have "designed" the receptive fields the way it has. For example Marr and Hildreth (1980) argue that the center surround receptive fields can be thought of as convolution with the second derivative of a Gaussian and the simple cells detect the zero-crossings of this convolution which facilitate edge detection.

Engineering approaches may answer the "why" but still leave open the question of how neurons know to connect in this intricate pattern. Embryo development is still not completely understood and while it may be reasonable that the same developmental processes that can build "fingers and toes" may create some functional regions in the brain, it is probably not reasonable to expect instructions for neuron #134267542 to connect to neuron # 49726845. At this point in our story came Olshausen and Field (1996) who showed that if you train a neural network with patches of natural images and bias it to preserve information and promote sparseness, you automagically get receptive fields similar to those of Hubel and Wiesel's simple cells:

So the poor neurons had no choice in the matter. It turns out to be inevitable that under some simple conditions random networks exposed to natural images would connect up in a way that an engineer would want them to connect and perform functions similar to that of the neurons in mammalian brains!

Full post...

April 12, 2014

Our Mathematical Universe by Max Tegmark

Consider a computer simulation of a mini universe, a universe complex enough to give rise to sentient beings. As Greg Egan points out in Permutation City, (and as is obvious to anybody who has played The Sims), the subjective flow of time in the simulated universe could be many times faster or slower than ours. In fact, once we have the simulation going, we can fast forward, rewind back, play the frames out of order, and it wouldn't matter one bit to the subjective experience of the simulated folk. What if we stopped the simulation? What if we saved the whole history on a DVD and put it on a shelf? Again, the subjective experience of the simulated, which arise from the relationships between their successive moments, would be uneffected. We would just be missing the opportunity to observe them in action, and interact with them.

That brings us to the next natural question: do they need the simulation, the DVD, do they need us at all? Don't they just "exist" independently whether or not we happen to be watching, recording, interacting or anybody has ever thought of them? Max Tegmark advances the Mathematical Universe Hypothesis, which says (1) our universe at the most basic level IS (not just described by) a mathematical structure (an electron or a photon can be defined exactly by a handful of numbers and there is nothing left over), and that (2) any mathematically consistent (computable?) structure must exist in the same sense as we do!

Needless to say this has led to quite a bit of spirited discussion around the web, I would recommend Scott Aaronson's blog post and the comments therein.


Full post... Related link

April 06, 2014

Monument Valley

For all of you Escher fans out there...

Full post... Related link

March 25, 2014

Emre Ünal, M.S. 2014

Current position: Co-founder at Manolin (email).
M.S. Thesis: A Language Visualization System. Koç University Department of Computer Engineering, March 2014. (PDF, Presentation, Related video, paper) Abstract: In this thesis, a novel language visualization system is presented that converts natural language text into 3D scenes. The system is capable of understanding some concrete nouns, visualizable adjectives and spatial prepositions from full natural language sentences and generating 3D static scenes using these sentences. It is a rule based system that uses natural language processing tools, 3D model galleries and language resources during the process. Several techniques are shown that deals with the generality and ambiguity of the language in order to visualize the natural language text. A question answering module is built as well to answer certain types of spatial inference questions after the scene generation process is completed. The system demonstrates a new way of solving spatial inference problems by not only using the language itself but with the extra information provided by the visualization process.
Full post...

February 14, 2014

Machine learning in 10 pictures

I find myself coming back to the same few pictures when explaining basic machine learning concepts. Below is a list I find most illuminating.

1. Test and training error: Why lower training error is not always a good thing: ESL Figure 2.11. Test and training error as a function of model complexity.

2. Under and overfitting: PRML Figure 1.4. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve.

3. Occam's razor: ITILA Figure 28.3. Why Bayesian inference embodies Occam’s razor. This figure gives the basic intuition for why complex models can turn out to be less probable. The horizontal axis represents the space of possible data sets D. Bayes’ theorem rewards models in proportion to how much they predicted the data that occurred. These predictions are quantified by a normalized probability distribution on D. This probability of the data given model Hi, P (D | Hi), is called the evidence for Hi. A simple model H1 makes only a limited range of predictions, shown by P(D|H1); a more powerful model H2, that has, for example, more free parameters than H1, is able to predict a greater variety of data sets. This means, however, that H2 does not predict the data sets in region C1 as strongly as H1. Suppose that equal prior probabilities have been assigned to the two models. Then, if the data set falls in region C1, the less powerful model H1 will be the more probable model.

4. Feature combinations: (1) Why collectively relevant features may look individually irrelevant, and also (2) Why linear methods may fail. From Isabelle Guyon's feature extraction slides.

5. Irrelevant features: Why irrelevant features hurt kNN, clustering, and other similarity based methods. The figure on the left shows two classes well separated on the vertical axis. The figure on the right adds an irrelevant horizontal axis which destroys the grouping and makes many points nearest neighbors of the opposite class.

6. Basis functions: How non-linear basis functions turn a low dimensional classification problem without a linear boundary into a high dimensional problem with a linear boundary. From SVM tutorial slides by Andrew Moore: a one dimensional non-linear classification problem with input x is turned into a 2-D problem z=(x, x^2) that is linearly separable.

7. Discriminative vs. Generative: Why discriminative learning may be easier than generative: PRML Figure 1.27. Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate.

8. Loss functions: Learning algorithms can be viewed as optimizing different loss functions: PRML Figure 7.5. Plot of the ‘hinge’ error function used in support vector machines, shown in blue, along with the error function for logistic regression, rescaled by a factor of 1/ln(2) so that it passes through the point (0, 1), shown in red. Also shown are the misclassification error in black and the squared error in green.

9. Geometry of least squares: ESL Figure 3.2. The N-dimensional geometry of least squares regression with two predictors. The outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection yˆ represents the vector of the least squares predictions.

10. Sparsity: Why Lasso (L1 regularization or Laplacian prior) gives sparse solutions (i.e. weight vectors with more zeros): ESL Figure 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions |β1| + |β2| ≤ t and β12 + β22 ≤ t2, respectively, while the red ellipses are the contours of the least squares error function.

Full post...