March 06, 2015

Alec Radford's animations for optimization algorithms

Alec Radford has created some great animations comparing optimization algorithms SGD, Momentum, NAG, Adagrad, Adadelta, RMSprop (unfortunately no Adam) on low dimensional problems. Also check out his presentation on RNNs.

"Noisy moons: This is logistic regression on noisy moons dataset from sklearn which shows the smoothing effects of momentum based techniques (which also results in over shooting and correction). The error surface is visualized as an average over the whole dataset empirically, but the trajectories show the dynamics of minibatches on noisy data. The bottom chart is an accuracy plot."
"Beale's function: Due to the large initial gradient, velocity based techniques shoot off and bounce around - adagrad almost goes unstable for the same reason. Algos that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability."
"Long valley: Algos without scaling based on gradient information really struggle to break symmetry here - SGD gets no where and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimization direction. Algos that scale step size based on the gradient quickly break symmetry and begin descent."
"Saddle point: Behavior around a saddle point. NAG/Momentum again like to explore around, almost taking a different path. Adadelta/Adagrad/RMSProp proceed like accelerated SGD."

Full post...

Parsing the Penn Treebank in 60 seconds

Parsers (as well as many other natural language processing algorithms) work by (1) extracting features for the current state, and (2) using a machine learning model to predict the best action / structure based on those features. The feature extraction code is typically messy and irregular and best performed on (possibly multiple) CPUs. The machine learning models can typically be accelerated significantly using GPUs. In this post I will use a greedy transition based parser with a neural network model and figure out how to use both the GPU and the multiple CPU cores effectively. We will take the parser speed from 55 ms/word (with a single CPU) to 0.055 ms/word (using 20 CPU cores and two K40 GPUs). At this speed we can parse the whole Penn Treebank (approx 1M words) in less than 60 seconds.

Some related links:
The code used in this post
Parallel processing for natural language (same idea, Matlab version)
Beginning deep learning with 500 lines of Julia

A greedy transition based parser parses a sentence using the following steps:

function gparse(s::Sentence, n::Net, f::Features)
    p = ArcHybrid(wcnt(s))       # initialize parser state
    while (v = valid(p); any(v)) # while we have valid moves
        x = features(p, s, f)    # extract features
        y = predict(n, x)        # score moves
        y[!v] = -Inf             # ignore invalid moves
        move!(p, indmax(y))      # make the max score move
    return p.head

Here n is a machine learning model, f is a specification of what features to extract, and p represents the parser state. The parser works by extracting features representing the sentence and the current parser state, using the model to score possible moves, and executing the highest scoring move until no valid moves are left.

To parse a whole corpus (array of sentences), we just map gparse to each sentence. Julia can distinguish which gparse we mean by looking at the types of arguments.

typealias Corpus AbstractVector{Sentence}
gparse(c::Corpus, n::Net, f::Features)=map(s->gparse(s,n,f), c)

For our first experiment we will parse some sentences on a single CPU core (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz):

julia> nworkers()  # we use a single core
julia> blas_set_num_threads(1)  # including blas
julia> using KUnet
julia> KUnet.gpu(false)  # no gpu yet
julia> using KUparser
julia> @time KUparser.gparse(dev, net, feats);
elapsed time: 2244.3076923076924 seconds

The corpus, dev, is the Penn Treebank WSJ section 22 (1700 sentences, 40117 words); net is a standard feed forward neural network with 1326 input units, 20000 hidden units in a single layer, and 3 output units; feats is a specification of features to be extracted. The parsing speed is 55.944 ms/word. More than 99% of that time is spent on "predict".

In order to speed up "predict", we will use the GPU (NVIDIA Tesla K20m):

julia> gnet=copy(net,:gpu)
julia> @time KUparser.gparse(dev, gnet, feats);
elapsed time: 148.56374417550305 seconds

This gives us 3.704 ms/word, a 15x speed-up. However the GPU can be better utilized if we somehow manage to batch our feature vectors and compute scores for multiple instances in parallel. The problem is parsing a sentence is a serial process, you need the state resulting from the last move in order to compute the features for the next move. The solution is to parse multiple sentences in parallel (thanks to Alkan Kabakcioglu for suggesting this). Different sentences have no dependencies on each other and we can keep track of their states and predict their moves in bulk. The second version of gparse takes an additional "batchsize" argument specifying how many sentences to parse in parallel. This needs some bookkeeping (requiring 80 lines of code for gparse v2 instead of the 10 line beauty you see above), so I won't cut-and-paste it here, you can see the source code if you wish. Here are some experiments with the batched gparse:

julia> @time KUparser.gparse(dev, gnet, feats, 1);
elapsed time: 148.725787323 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 10);
elapsed time: 48.573996933 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 100);
elapsed time: 25.502507879 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 1700); 
elapsed time: 22.079269825 seconds

As we increase the number of sentences processed in parallel (doing all 1700 sentences in the corpus in parallel in the last example), we get 0.550 ms/word, a 100x speedup from where we started. At this point the time spent on prediction is about a third of the time spent on feature extraction, so let's take another look at features. We will use Julia's parallel computing primitives to group the sentences to be processed on separate cores. The third version of gparse takes yet another argument specifying the number of cpu cores:

function gparse(corpus::Corpus, net::Net, fmat::Features, batch::Integer, ncpu::Integer)
    d = distribute(corpus, workers()[1:ncpu])
    n = copy(net, :cpu)
    p = pmap(procs(d)) do x
        gparse(localpart(d), copy(n, :gpu), fmat, batch)

The distribute command distributes the corpus equally among ncpu workers, and localpart gives each worker its own subset. We copy the net back and forth between the CPU and the GPU because I couldn't figure out how to pass GPU pointers between different workers. Finally pmap is the parallel map which calls gparse v2 on each worker for the appropriate subset of the corpus, pmerge merges the results. This time we will run the parser on the training set (Sections 02-21, ~40k sentences, ~950k words)

julia> addprocs(20)
julia> require("CUDArt")
julia> @everywhere CUDArt.device((myid()-1)%CUDArt.devcount())
julia> require("KUparser")
julia> @time KUparser.gparse(trn, gnet, feats, 2000, 20);
elapsed time: 52.13701401 seconds

The server has 20 CPU cores and 2 GPUs. We create 20 workers, and assign equal numbers to each GPU. Parsing 950k words takes 52 seconds (0.055 ms/word), a 1000x speedup from where we started.

Full post...