January 21, 2015

Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms

Ergun Biçici and Deniz Yuret. 2015. Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms. IEEE Transactions on Audio, Speech and Language Processing, vol 23, no 2, pp 339--350, February. IEEE. (URL, PDF, code)

Abstract: We introduce FDA5 for efficient parameterization, optimization, and implementation of feature decay algorithms (FDA), a class of instance selection algorithms that use feature decay. FDA increase the diversity of the selected training set by devaluing features (i.e. n-grams) that have already been included. FDA5 decides which instances to select based on three functions used for initializing and decaying feature values and scaling sentence scores controlled with 5 parameters. We present optimization techniques that allow FDA5 to adapt these functions to in-domain and out-of-domain translation tasks for different language pairs. In a transductive learning setting, selection of training instances relevant to the test set can improve the final translation quality. In machine translation experiments performed on the 2 million sentence English-German section of the Europarl corpus, we show that a subset of the training set selected by FDA5 can gain up to 3.22 BLEU points compared to a randomly selected subset of the same size, can gain up to 0.41 BLEU points compared to using all of the available training data using only 15% of it, and can reach within 0.5 BLEU points to the full training set result by using only 2.7% of the full training data. FDA5 peaks at around 8M words or 15% of the full training set. In an active learning setting, FDA5 minimizes the human effort by identifying the most informative sentences for translation and FDA gains up to 0.45 BLEU points using 3/5 of the available training data compared to using all of it and 1.12 BLEU points compared to random training set. In translation tasks involving English and Turkish, a morphologically rich language, FDA5 can gain up to 11.52 BLEU points compared to a randomly selected subset of the same size, can achieve the same BLEU score using as little as 4% of the data compared to random instance selection, and can exceed the full dataset result by 0.78 BLEU points. FDA5 is able to reduce the time to build a statistical machine translation system to about half with 1M words using only 3% of the space for the phrase table and 8% of the overall space when compared with a baseline system using all of the training data available yet still obtain only 0.58 BLEU points difference with the baseline system in out-of-domain translation.

Full post...

January 20, 2015

Parallel processing for natural language

In this post I will explore how to parallelize certain types of machine learning / natural language processing code in an environment with multiple cpu cores and/or a gpu. The running example I will use is a transition based parser, but the same techniques should apply to other similar models used for sequence labeling, chunking, etc. We will see the relative contributions of mini-batching, parallel processing, and using the gpu. The ~24x speed-up that we get means we can parse the ~1M words of Penn Treebank in 9 minutes rather than 3.5 hours. (This post uses Matlab, here is a Julia version).

Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".

To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.

To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).

A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:

This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:

This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.

Here is a summary of the results:

minibatch+gpu+parfor  0.46

Full post...