I am an associate professor in Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

January 21, 2015

Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms

Ergun Biçici and Deniz Yuret. 2015. Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms. IEEE Transactions on Audio, Speech and Language Processing, vol 23, no 2, pp 339--350, February. IEEE. (URL, PDF)

Abstract: We introduce FDA5 for efficient parameterization, optimization, and implementation of feature decay algorithms (FDA), a class of instance selection algorithms that use feature decay. FDA increase the diversity of the selected training set by devaluing features (i.e. n-grams) that have already been included. FDA5 decides which instances to select based on three functions used for initializing and decaying feature values and scaling sentence scores controlled with 5 parameters. We present optimization techniques that allow FDA5 to adapt these functions to in-domain and out-of-domain translation tasks for different language pairs. In a transductive learning setting, selection of training instances relevant to the test set can improve the final translation quality. In machine translation experiments performed on the 2 million sentence English-German section of the Europarl corpus, we show that a subset of the training set selected by FDA5 can gain up to 3.22 BLEU points compared to a randomly selected subset of the same size, can gain up to 0.41 BLEU points compared to using all of the available training data using only 15% of it, and can reach within 0.5 BLEU points to the full training set result by using only 2.7% of the full training data. FDA5 peaks at around 8M words or 15% of the full training set. In an active learning setting, FDA5 minimizes the human effort by identifying the most informative sentences for translation and FDA gains up to 0.45 BLEU points using 3/5 of the available training data compared to using all of it and 1.12 BLEU points compared to random training set. In translation tasks involving English and Turkish, a morphologically rich language, FDA5 can gain up to 11.52 BLEU points compared to a randomly selected subset of the same size, can achieve the same BLEU score using as little as 4% of the data compared to random instance selection, and can exceed the full dataset result by 0.78 BLEU points. FDA5 is able to reduce the time to build a statistical machine translation system to about half with 1M words using only 3% of the space for the phrase table and 8% of the overall space when compared with a baseline system using all of the training data available yet still obtain only 0.58 BLEU points difference with the baseline system in out-of-domain translation.


Full post...

January 20, 2015

Parallel processing for natural language

In this post I will explore how to parallelize certain types of machine learning / natural language processing code in an environment with multiple cpu cores and/or a gpu. The running example I will use is a transition based parser, but the same techniques should apply to other similar models used for sequence labeling, chunking, etc. We will see the relative contributions of mini-batching, parallel processing, and using the gpu. The ~24x speed-up that we get means we can parse the ~1M words of Penn Treebank in 9 minutes rather than 3.5 hours.

Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".


To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.

To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).

A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:


This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:


This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.

Here is a summary of the results:

baseline10.9
gpu6.24
parfor3.70
gpu+parfor5.03
minibatch2.80
minibatch+gpu1.67
minibatch+parfor1.08
minibatch+gpu+parfor  0.46

Full post...

November 12, 2014

Some starting points for deep learning and RNNs

HintonBengioLeCunnJordan have done reddit AMA's.

A Google search for "fast neural network code" reveals cuda-convnet and cuda-convnet2 by Alex Khrizevsky, Hinton's student (convnet2 supports K20 and K40). There is also fann with no GPU support.  Radford Neal's (another Hinton student) fbm implements Bayesian neural nets and has wonderful documentation. Can also do gradient descent but I doubt it is competitive speedwise. UFLDL (or the old version) is a great tutorial (with matlab code) by Andrew Ng.  More software can be found at deeplearning.net.
Bengio's group wrote Theano and PyLearn2. Bengio also has a draft deep learning book. His research page summarizes the main problems nicely and gives pointers to a couple of review papers (one on neural net language models).
LeCunn uses Torch, written by Collobert, as does Deep Mind and some Google groups. Torch is based on a simple language Lua, and a descendent of machine learning language Lush (and Lush2) developed by LeCunn and Bottou.  
Schmidhuber invented LSTM. He is working on an RNN book, no draft yet. His student Alex Graves wrote RNNLIB. DeepMind (Alex Graves et al.) invented NTM.
Learnxinyminutes.com is a great resource to quickly start learning new languages including Lua. Too bad nobody is using Julia and Julia doesn't support GPUs yet (except for experimental add-ons). Is Matlab (e.g. Ng's ufldl code) still competitive? Ng uses LBFGS for optimization instead of gradient descent which is faster, does convnet use sophisticated optimization or gradient descent? Will explore in next post.

Also see my earlier post.


Full post...

September 30, 2014

Mode coupling points to functionally important residues in myosin II.

Onur Varol, Deniz Yuret, Burak Erman and Alkan Kabakçıoğlu. 2014. Proteins: Structure, Function, and Bioinformatics, vol 82, no 9, pp 1777--1786, September. (PDF)

Abstract: Relevance of mode coupling to energy/information transfer during protein function, particularly in the context of allosteric interactions is widely accepted. However, existing evidence in favor of this hypothesis comes essentially from model systems. We here report a novel formal analysis of the near-native dynamics of myosin II, which allows us to explore the impact of the interaction between possibly non-Gaussian vibrational modes on fluctuational dynamics. We show that an information-theoretic measure based on mode coupling alone yields a ranking of residues with a statistically significant bias favoring the functionally critical locations identified by experiments on myosin II.

Full post...

August 23, 2014

Unsupervised Instance-Based Part of Speech Induction Using Probable Substitutes

Mehmet Ali Yatbaz, Enis Sert, Deniz Yuret. COLING 2014. (This is a token based and multilingual extension of our EMNLP 2012 model. Up to date versions of the code can be found at github.) Abstract: We develop an instance (token) based extension of the state of the art word (type) based part-of-speech induction system introduced in (Yatbaz et al. 2012). Each word instance is represented by a feature vector that combines information from the target word and probable substitutes sampled from an n-gram model representing its context. Modeling ambiguity using an instance based model does not lead to significant gains in overall accuracy in part-of-speech tagging because most words in running text are used in their most frequent class (e.g. 93.69% in the Penn Treebank). However it is important to model ambiguity because most frequent words are ambiguous and not modeling them correctly may negatively affect upstream tasks. Our main contribution is to show that an instance based model can achieve significantly higher accuracy on ambiguous words at the cost of a slight degradation on unambiguous ones, maintaining a comparable overall accuracy. On the Penn Treebank, the overall many-to-one accuracy of the system is within 1% of the state-of-the-art (80%), while on highly ambiguous words it is up to 70% better. On multilingual experiments our results are significantly better than or comparable to the best published word or instance based systems on 15 out of 19 corpora in 15 languages. The vector representations for words used in our system are available for download for further experiments.
Full post...