I am an associate professor in Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

May 17, 2015

Beginning deep learning with 500 lines of Julia (Version 0.1)

Click here for Version 0.0 of this tutorial. OK, first a disclaimer: this version of KUnet.jl, my deep learning code for Julia, is a bit more than 500 lines, but it is still under 1000 lines and it supports convolution and pooling, new activation and loss functions, arrays of arbitrary dimensionality with 32 and 64 bit floats etc. See here for installation instructions. We will use the MNIST dataset to illustrate basic usage of KUnet:

julia> include(Pkg.dir("KUnet/test/mnist.jl"))

This may take a bit the first time you run to download the data.

Next we tell Julia we intend to use KUnet, and some variables from MNIST:

julia> using KUnet
julia> using MNIST: xtrn, ytrn, xtst, ytst

The MNIST variables are Float32 matrices. The x matrices have pixel values scaled to [0.0:1.0] for a 28x28 image on each column. The y matrices have 10 rows indicating the 10 classes with a single nonzero entry for the correct class in each column.

julia> xtrn, ytrn, xtst, ytst
(
784x60000 Array{Float32,2}: ...
10x60000 Array{Float32,2}: ...
784x10000 Array{Float32,2}: ...
10x10000 Array{Float32,2}: ...
)

Before using KUnet, we should specify the array type and the element type we want to use. The array type determines whether KUnet uses the GPU, and the element type should match that of the data.

julia> KUnet.atype(CudaArray)   # CudaArray or Array
julia> KUnet.ftype(Float32)     # Float32 or Float64

Let's construct a neural net with a single layer of 64 hidden units using the relu activation function and the cross entropy loss function.

julia> net = [ Mmul(64,784), Bias(64), Relu(),
               Mmul(10,64),  Bias(10), XentLoss() ]

Each element of the net array represents an operation, e.g. Mmul multiplies its input with a weight matrix, Bias adds a bias vector, Relu applies the rectified linear transformation to each element etc. They are subtypes of an abstract type called Layer. The full list of Layers currently implemented are: Bias, Conv, Drop, Logp, Mmul, Pool, Relu, Sigm, Soft, Tanh, LogpLoss, QuadLoss, SoftLoss, XentLoss.

A Net is simply a 1-D array of Layers. Here are the definitions from net.jl and bias.jl:

abstract Layer
typealias Net Array{Layer,1}
type Bias <: Layer; b::Param; Bias(b::Param)=new(b); end

If you are not happy with the default Layer constructors, you can specify your own parameters. For example, the Mmul(64,784) constructor fills a (64,784) weight matrix with random weights from a Gaussian distribution with std=0.01. If we want a different initialization, we could create a weight matrix any way we want and pass it to the Mmul constructor instead. Note that the weight matrix for an Mmul layer with 784 inputs and 64 outputs has size (64, 784).

julia> w1 = randn(64, 784) * 0.05
julia> l1 = Mmul(w1)

Training parameters like the learning rate (lr) can be specified at layer construction, or using setparam! on the whole network or on individual layers.

julia> l1 = Mmul(64,784; lr=0.01)
julia> setparam!(l1; lr=0.01)
julia> setparam!(net; lr=0.01)

It is also possible to save nets to JLD files using savenet(fname::String, n::Net) and read them using loadnet(fname::String). Let's save our initial random network for replicatibility.

julia> savenet("net0.jld", net)

OK, now that we have some data and a network, let's proceed with training. Here is a convenience function to measure the classification accuracy:

julia> accuracy(y,z)=mean(findmax(y,1)[2] .== findmax(z,1)[2])

Let's do 100 epochs of training:

@time for i=1:100
    train(net, xtrn, ytrn; batch=128)
    println((i, accuracy(ytst, predict(net, xtst)), 
                accuracy(ytrn, predict(net, xtrn))))
end

If you take a look at the definition of train in net.jl, you will see that it takes a network net, the input x, and the desired output y, and after splitting the data into minibatches it just calls backprop and update. Here is a simplified version:

train(net, x, y)=(backprop(net, x, y); update(net))

The backprop function calls forw which computes the network output, and back which computes the gradients of the network parameters with respect to the loss function:

backprop(net, x, y)=(forw(net, x); back(net, y))

The forw and back functions for a Net simply call the forw and back functions of each layer in order, feeding the output of one to the input of the next:

forw(n::Net, x)=(for i=1:length(n); x=forw(n[i], x); end)
back(n::Net, y)=(for i=length(n):-1:1; y=back(n[i], y); end)

The forw function for a layer takes the layer input x, and returns the layer output y. The back function for a layer takes the loss gradient wrt its output dy and returns the loss gradient wrt its input dx. If the layer has a parameter p, back also computes the loss gradient p.diff wrt its current value p.data. You can take a look at individual layer definitions (e.g. in mmul.jl, bias.jl, relu.jl, etc.) to see how this is done for each layer.

The final layer of the network (XentLoss in our case) is a subtype of LossLayer. LossLayer is a special type of layer: its forw does nothing but record the network output y. Its back expects the desired output z (not a gradient) and computes the loss gradient wrt the network output dy. A LossLayer also implements the function loss(l::LossLayer,z) which returns the actual loss value given the desired output z. KUnet currently implements the following LossLayers: LogpLoss, QuadLoss, SoftLoss, XentLoss.

The update function for a net calls the update function for each of its layers, which in turn calls the update function on layer parameters:

update(n::Net)=(for l in n; update(l); end)
update(l::Bias)=update(l.b)

The update function for a parameter p is used to update its values (p.data) given the loss gradients (p.diff). Its behavior is controlled by the following parameters: lr, l1reg, l2reg, adagrad, momentum, nesterov. Here is a simplified definition of update from param.jl (p.ada, p.mom, and p.nes are temporary arrays initialized to 0):

function update(p::Param; o...)
    isdefined(p,:l1reg)    && (p.diff += p.l1reg * sign(p.data))
    isdefined(p,:l2reg)    && (p.diff += p.l2reg * p.data)
    isdefined(p,:adagrad)  && (p.ada += p.diff.^2; p.diff /= p.adagrad + sqrt(p.ada))
    isdefined(p,:momentum) && (p.diff += p.momentum * p.mom; p.mom[:] = p.diff)
    isdefined(p,:nesterov) && (p.nes *= p.nesterov; p.nes += p.diff; p.diff += p.nesterov * p.nes)
    isdefined(p,:lr)       && (p.diff *= p.lr)
    p.data -= p.diff
end

Our training should print out the test set and training set accuracy at the end of every epoch.

(1,0.3386,0.3356)
(2,0.7311,0.7226666666666667)
(3,0.821,0.8157333333333333)
...
(99,0.9604,0.9658166666666667)
(100,0.9604,0.96605)
elapsed time: 39.738191211 seconds (1526525108 bytes allocated, 3.05% gc time)

Note that for actual research we should not be looking at the test set accuracy at this point. We should instead split the training set into a training and a development portion and do all our playing around with those. We should also run each experiment 10 times with different random seeds and measure standard errors, etc. But, this is just a KUnet tutorial.

It seems the training set accuracy is not that great. Maybe increasing the learning rate may help:

net = loadnet("net0.jld")
setparam!(net, lr=0.5)

# same for loop...

(1,0.9152,0.9171833333333334)
(2,0.9431,0.9440333333333333)
(3,0.959,0.9611666666666666)
...
(59,0.9772,0.9999833333333333)
(60,0.9773,1.0)
...
(100,0.9776,1.0)

Wow! We got 100% training set accuracy in 60 epochs. This should drive home the importance of setting a good learning rate.

But the test set is still lagging behind. What if we try increasing the number of hidden units:

for h in (128, 256, 512, 1024)
    net = [Mmul(h,784), Bias(h), Relu(), Mmul(10,h),  Bias(10), XentLoss()]
    setparam!(net; lr=0.5)
    for i=1:100
        train(net, xtrn, ytrn; batch=128)
        println((i, accuracy(ytst, predict(net, xtst)), 
                    accuracy(ytrn, predict(net, xtrn))))
    end
end

# Number of epochs and test accuracy when training accuracy reaches 1.0:
# 128:  (43,0.9803,1.0)
# 256:  (42,0.983,1.0)
# 512:  (36,0.983,1.0)
# 1024: (30,0.9833,1.0)

This improvement is unexpected, we were already overfitting with 64 hidden units, and common wisdom is not to increase the capacity of the network by increasing the hidden units in that situation. Maybe we should try dropout:

net = [Drop(0.2), Mmul(1024,784), Bias(1024), Relu(), 
       Drop(0.5), Mmul(10,1024),  Bias(10), XentLoss()]

# lr=0.5, same for loop
...
(100,0.9875,0.9998166666666667)
elapsed time: 122.898730432 seconds (1667849932 bytes allocated, 0.96% gc time)

Or bigger and bigger nets:

net = [Drop(0.2), Mmul(4096,784),  Bias(4096), Relu(), 
       Drop(0.5), Mmul(4096,4096), Bias(4096), Relu(), 
       Drop(0.5), Mmul(10,4096),   Bias(10), XentLoss()]

# lr=0.5, same for loop
...
(100,0.9896,0.9998166666666667)
elapsed time: 804.242212488 seconds (1080 MB allocated, 0.02% gc time in 49 pauses with 0 full sweep)

Or maybe we should try convolution. Here is an implementation of LeNet:

net = [Conv(5,5,1,20), Bias(20), Relu(), Pool(2),
       Conv(5,5,20,50), Bias(50), Relu(), Pool(2),
       Mmul(500,800), Bias(500), Relu(),
       Mmul(10,500), Bias(10), XentLoss()]
setparam!(net; lr=0.1)

# Need to reshape the input arrays for convolution:
xtrn2 = reshape(xtrn, 28, 28, 1, size(xtrn, 2))
xtst2 = reshape(xtst, 28, 28, 1, size(xtst, 2))

# same for loop
...
(100,0.9908,1.0)
elapsed time: 360.722851006 seconds (5875158944 bytes allocated, 1.95% gc time)

OK, that's enough fiddling around. I hope this gave you enough to get your hands dirty. We are already among the better results on the MNIST website. I am sure you can do better playing around with the learning rate, dropout probabilities, momentum, adagrad, regularization, and numbers, sizes, types of layers etc. But be careful, it could become addictive :)


Full post...

May 07, 2015

CUDNN Julia port

I just finished implementing CUDNN.jl, a Julia wrapper for the NVIDIA cuDNN GPU accelerated deep learning library. It consists of a low level interface and a high level interface. The low level interface wraps each function from libcudnn.so in a Julia function in libcudnn.jl and each data type from cudnn.h in a Julia datatype in types.jl. These were generated semi-automatically using Clang and are documented in the cuDNN Library User Guide. For the high level interface defined in CUDNN.jl, I kept the original names from the C library and provided more convenient type signatures, return values, and keyword arguments with reasonable defaults.  Next step will be integrating this with KUnet.jl.
Full post...

April 24, 2015

Big questions

While browsing some interesting talks on the web, I realized some of the "big questions" that cause much confusion fall into one of two categories.
  1. Physics does not have X. Is X something real and missing in physics, or is X an illusion, a modeling tool, a result of our limited information, agent based perspective, or stupidity? Examples: consciousness, free will, causality and directionality of time.
  2. Physics does have X which doesn't make sense. Is X something real, or a modeling tool? Examples: randomness, entropy, various sorts of quantum weirdness.
Here are some of the related links:
  • I was trying to find out what fellow MIT AI Lab alum Gary Drescher is up to after writing Good and Real (which covers most of the big questions mentioned above). I found this video of a talk on causality and choice he gave at the Singularity Summit in 2009. He argues convincingly that determinism and choice are not at odds with each other.
  • Speaking of causality, Michael Nielsen (author of Quantum Computation and Quantum Information, as well as a free online book on Neural Networks and Deep Learning) has a nice article summarizing the key points of Judea Pearl's Causality book. The epilogue of the book (well worth the read if you don't have the time or patience for the whole book) is a lecture Pearl gave some years ago where he covers the history of the confusion about causality, why physics (if we model the whole universe) does not require it, how economics and other social sciences (where lack of controlled experiments make causality detection difficult) still do not adequately model it, and how it is actually possible in some cases to derive causality from purely observational data without resorting to controlled experiments.
  • Another regular at the Singularity Summit is Eliezer Yudkowsky. I first discovered his posts on Less Wrong, which has a lot of good material that identifies common sources of confusion when thinking about the big questions.

Full post...

March 06, 2015

Alec Radford's animations for optimization algorithms

Alec Radford has created some great animations comparing optimization algorithms SGD, Momentum, NAG, Adagrad, Adadelta, RMSprop (unfortunately no Adam) on low dimensional problems. Also check out his presentation on RNNs.

"Noisy moons: This is logistic regression on noisy moons dataset from sklearn which shows the smoothing effects of momentum based techniques (which also results in over shooting and correction). The error surface is visualized as an average over the whole dataset empirically, but the trajectories show the dynamics of minibatches on noisy data. The bottom chart is an accuracy plot."
"Beale's function: Due to the large initial gradient, velocity based techniques shoot off and bounce around - adagrad almost goes unstable for the same reason. Algos that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability."
"Long valley: Algos without scaling based on gradient information really struggle to break symmetry here - SGD gets no where and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimization direction. Algos that scale step size based on the gradient quickly break symmetry and begin descent."
"Saddle point: Behavior around a saddle point. NAG/Momentum again like to explore around, almost taking a different path. Adadelta/Adagrad/RMSProp proceed like accelerated SGD."

Full post...

Parsing the Penn Treebank in 60 seconds

Parsers (as well as many other natural language processing algorithms) work by (1) extracting features for the current state, and (2) using a machine learning model to predict the best action / structure based on those features. The feature extraction code is typically messy and irregular and best performed on (possibly multiple) CPUs. The machine learning models can typically be accelerated significantly using GPUs. In this post I will use a greedy transition based parser with a neural network model and figure out how to use both the GPU and the multiple CPU cores effectively. We will take the parser speed from 55 ms/word (with a single CPU) to 0.055 ms/word (using 20 CPU cores and two K40 GPUs). At this speed we can parse the whole Penn Treebank (approx 1M words) in less than 60 seconds.

Some related links:
The code used in this post
Parallel processing for natural language (same idea, Matlab version)
Beginning deep learning with 500 lines of Julia

A greedy transition based parser parses a sentence using the following steps:

function gparse(s::Sentence, n::Net, f::Features)
    p = ArcHybrid(wcnt(s))       # initialize parser state
    while (v = valid(p); any(v)) # while we have valid moves
        x = features(p, s, f)    # extract features
        y = predict(n, x)        # score moves
        y[!v] = -Inf             # ignore invalid moves
        move!(p, indmax(y))      # make the max score move
    end
    return p.head
end

Here n is a machine learning model, f is a specification of what features to extract, and p represents the parser state. The parser works by extracting features representing the sentence and the current parser state, using the model to score possible moves, and executing the highest scoring move until no valid moves are left.

To parse a whole corpus (array of sentences), we just map gparse to each sentence. Julia can distinguish which gparse we mean by looking at the types of arguments.

typealias Corpus AbstractVector{Sentence}
gparse(c::Corpus, n::Net, f::Features)=map(s->gparse(s,n,f), c)

For our first experiment we will parse some sentences on a single CPU core (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz):

julia> nworkers()  # we use a single core
1
julia> blas_set_num_threads(1)  # including blas
julia> using KUnet
julia> KUnet.gpu(false)  # no gpu yet
julia> using KUparser
julia> @time KUparser.gparse(dev, net, feats);
elapsed time: 2244.3076923076924 seconds

The corpus, dev, is the Penn Treebank WSJ section 22 (1700 sentences, 40117 words); net is a standard feed forward neural network with 1326 input units, 20000 hidden units in a single layer, and 3 output units; feats is a specification of features to be extracted. The parsing speed is 55.944 ms/word. More than 99% of that time is spent on "predict".

In order to speed up "predict", we will use the GPU (NVIDIA Tesla K20m):

julia> gnet=copy(net,:gpu)
julia> @time KUparser.gparse(dev, gnet, feats);
elapsed time: 148.56374417550305 seconds

This gives us 3.704 ms/word, a 15x speed-up. However the GPU can be better utilized if we somehow manage to batch our feature vectors and compute scores for multiple instances in parallel. The problem is parsing a sentence is a serial process, you need the state resulting from the last move in order to compute the features for the next move. The solution is to parse multiple sentences in parallel (thanks to Alkan Kabakcioglu for suggesting this). Different sentences have no dependencies on each other and we can keep track of their states and predict their moves in bulk. The second version of gparse takes an additional "batchsize" argument specifying how many sentences to parse in parallel. This needs some bookkeeping (requiring 80 lines of code for gparse v2 instead of the 10 line beauty you see above), so I won't cut-and-paste it here, you can see the source code if you wish. Here are some experiments with the batched gparse:

julia> @time KUparser.gparse(dev, gnet, feats, 1);
elapsed time: 148.725787323 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 10);
elapsed time: 48.573996933 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 100);
elapsed time: 25.502507879 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 1700); 
elapsed time: 22.079269825 seconds

As we increase the number of sentences processed in parallel (doing all 1700 sentences in the corpus in parallel in the last example), we get 0.550 ms/word, a 100x speedup from where we started. At this point the time spent on prediction is about a third of the time spent on feature extraction, so let's take another look at features. We will use Julia's parallel computing primitives to group the sentences to be processed on separate cores. The third version of gparse takes yet another argument specifying the number of cpu cores:

function gparse(corpus::Corpus, net::Net, fmat::Features, batch::Integer, ncpu::Integer)
    d = distribute(corpus, workers()[1:ncpu])
    n = copy(net, :cpu)
    p = pmap(procs(d)) do x
        gparse(localpart(d), copy(n, :gpu), fmat, batch)
    end
end

The distribute command distributes the corpus equally among ncpu workers, and localpart gives each worker its own subset. We copy the net back and forth between the CPU and the GPU because I couldn't figure out how to pass GPU pointers between different workers. Finally pmap is the parallel map which calls gparse v2 on each worker for the appropriate subset of the corpus, pmerge merges the results. This time we will run the parser on the training set (Sections 02-21, ~40k sentences, ~950k words)

julia> addprocs(20)
julia> require("CUDArt")
julia> @everywhere CUDArt.device((myid()-1)%CUDArt.devcount())
julia> require("KUparser")
julia> @time KUparser.gparse(trn, gnet, feats, 2000, 20);
elapsed time: 52.13701401 seconds

The server has 20 CPU cores and 2 GPUs. We create 20 workers, and assign equal numbers to each GPU. Parsing 950k words takes 52 seconds (0.055 ms/word), a 1000x speedup from where we started.


Full post...