Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".
To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.
To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).
A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:
This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:
This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.
Here is a summary of the results: