Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

There are a number of deep learning packages out there. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov's accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner's neural network tutorial (modeled after Honnibal's excellent parsing example).#### Contents

Layers and netsThe layer function

Prediction

Loss functions

Backpropagation Training

Activation functions

Preprocessing functions

Why Julia?

Future work

#### Layers and nets

A feed forward neural network consists of layers.typealias Net Array{Layer,1}

type Layer w; b; f; fx; ... end

#### The layer function (forw)

Given its parameters and functions, here is how a layer computes its output from its input:function forw(l::Layer, x) l.x = l.fx(l, x) # preprocess the input l.y = l.w * l.x # multiply with weight matrix l.y = l.y .+ l.b # add the bias vector (to every column) l.y = l.f(l,l.y) # apply the activation fn to the output end

#### Prediction

Once we have forw for a single layer, calculating a prediction for the whole network is literally a one-liner:forw(n::Net, x)=(for l=n x=forw(l,x) end; x)

#### Loss functions

In order to train the network, we need a loss function. A loss function takes y, the network output, and dy, the desired output, and returns a loss value which we try to minimize during training. In addition, it calculates the gradient of the loss with respect to y, which tells us how small changes in the network output will effect the loss. As an example, here is the implementation of softmaxloss (albeit a bit low level) in func.jl:function softmaxloss(y, dy) yrows,ycols = size(y) loss = zero(eltype(y)) prob = similar(y, yrows) for j=1:ycols ymax = y[1,j] for i=2:yrows y[i,j] > ymax && (ymax = y[i,j]) end psum = zero(ymax) for i=1:yrows yij = y[i,j] - ymax prob[i] = exp(yij) psum += prob[i] dy[i,j] == 1 && (loss += yij) end loss -= log(psum) for i=1:yrows prob[i] /= psum dy[i,j] = (prob[i] - dy[i,j]) / ycols end end return loss end

`dy[i,j] = (prob[i] - dy[i,j]) / ycols`

. We should note a couple of important design decisions here: (1) y and dy should have the same dimensionality, so use one-of-k encoding for desired classification outputs. (2) The loss function overwrites dy with the gradient of the loss with respect to y.#### Backpropagation

Backpropagation is the algorithm that calculates the gradient of the loss with respect to network weights, given the gradient of the loss with respect to the network output. This can be accomplished taking a backward pass through the layers after the forward pass and the loss calculation:function backprop(net::Net, x, dy, loss=softmaxloss) y = forw(net, x) # y: network output loss(y, dy) # dy: desired output -> gradient back(net, dy) # calculate derivatives end

back(n::Net, dy) = (for i=length(n):-1:1 dy=back(n[i],dy) end)

function back(l::Layer, dy) dy = l.f(l,l.y,dy) l.dw = dy * l.x' l.db = sum!(l.db, dy) l.dx = l.w' * dy l.dx = l.fx(l,l.x,l.dx) end

#### Training

The gradients calculated by backprop, l.dw and l.db, tell us how much small changes in corresponding entries in l.w and l.b will effect the loss (for the last instance, or minibatch). Small steps in the gradient direction will increase the loss, steps in the opposite direction will decrease the loss. This suggests the following update rule:w = w - dw

Over the years, people have noted many subtle problems with this approach and suggested improvements:

**Step size:**If the step sizes are too small, the SGD algorithm will take too long to converge. If they are too big it will overshoot the optimum and start to oscillate. So we scale the gradients with an adjustable parameter called the learning rate:

w = w - learningRate * dw

**Step direction:**More importantly, it turns out the gradient (or its opposite) is often NOT the direction you want to go in order to minimize loss. Let us illustrate with a simple picture:

The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.

Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is

*the loss for the last minibatch/instance only*, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.

So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a standard textbook on numerical optimization, and peruse the latest papers on optimization in deep learning.

**Minimize what?**The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see Radford Neal's fbm), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.

KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer:

type UpdateParam learningRate; l1reg; l2reg; maxnorm; adagrad; momentum; nesterov; ... end

setparam!(p::UpdateParam,k,v) setparam!(l::Layer,k,v) setparam!(net::Net,k,v)

function update(w, dw, o::UpdateParam) initupdate(w, dw, o) nz(o,:l1reg) && l1reg!(o.l1reg, w, dw) nz(o,:l2reg) && l2reg!(o.l2reg, w, dw) nz(o,:adagrad) && adagrad!(o.adagrad, o.ada, dw) nz(o,:learningRate,1f0) && (dw = dw .* o.learningRate) nz(o,:momentum) && momentum!(o.momentum, o.mom, dw) nz(o,:nesterov) && nesterov!(o.nesterov, o.nes, dw) w = w .- dw nz(o,:maxnorm) && maxnorm!(o.maxnorm, w) end nz(o,n,v=0f0)=(isdefined(o,n) && (o.(n) != v))

The "train" function in net.jl is just a wrapper around "backprop" and "update". It goes through the training set once, splitting it into minibatches, and feeding each minibatch through backprop and update.

#### Activation functions

The activation function follows the linear transformation (wx+b) in a layer and is typically a non-linear element-wise function. Without an activation function, multiple layers in a neural network would be useless because the composition of several linear functions is still a linear function. As an example activation function l.f, here is the definition of the relu (rectified linear unit) from func.jl:relu(l,y)=for i=1:length(y) (y[i]<0) && (y[i]=0) end relu(l,y,dy)=for i=1:length(y) (y[i]==0) && (dy[i]=0) end

relu(l,y::CudaArray)=ccall((:reluforw,libkunet), Void,(Cint,Cmat),length(y),y) relu(l,y::CudaArray,dy::CudaArray)=ccall((:reluback,libkunet), Void,(Cint,Cmat,Cmat),length(dy),y,dy)

#### Preprocessing functions

A preprocessing function precedes the linear transformation (wx+b) modifying the input of the layer. Preprocessing functions typically add some noise to the input to improve generalization. For example dropout can be implemented as a preprocessing function of a layer where each input is dropped (replaced by 0) with a given probability. Adding Gaussian noise, or elastic transformations of image inputs can also be implemented as preprocessing functions. Here is the (simplified) dropout implementation from func.jl:function drop(l, x) rand!(l.xdrop) drop(x, l.xdrop, l.dropout, 1/(1-l.dropout)) end function drop(l, x, dx) drop(dx, l.xdrop, l.dropout, 1/(1-l.dropout)) end function drop(x, xdrop, dropout, scale) for i=1:length(x) x[i] = (xdrop[i] < dropout ? 0 : scale * x[i]) end end function drop(x::CudaArray, xdrop::CudaArray, dropout, scale) ccall((:drop,libkunet),Void,(Cint,Cmat,Cmat,Cfloat,Cfloat), length(x),x,xdrop,dropout,scale) end

## 18 comments:

Thanks a lot for this.

Any idea how much effort may be involved in making this work under Windows?

Also, have you tested with more a challenging dataset like ImageNet?

One of few approachable short tutorials I have been able to digest as a rank beginner, thanks!

Thank you!

Alok, not much work at all: https://www.virtualbox.org/

This is amazing, exactly what is needed and the idea of readable and efficient code. It would deb amazing if we can try to implement the new ideas as they come out for people to play around with. Did you see the infinite RBM paper, about growing the layers adaptively?

Many thanks

Amazing, this is exactly what is needed, short readable efficient code for learning and implementations. I hope this can implement the new ideas for people to try out as they are published. Many thanks.

Alok: I haven't tested the code under Windows, but I know Julia and CUDArt support Windows so it should not be too difficult. Regarding ImageNet: as the size of the input grows, the number of weights for fully connected layers grow with it. So we would have to implement convolutional nets first to make large image processing feasible.

Jon: thanks for your kind comments. Trying new ideas with minimal coding was one of the main motivations for starting KUnet.jl, I hope it serves that purpose. I hadn't seen the infinite RBM paper, I will check it out (http://arxiv.org/abs/1502.02476).

Check out the KUnet discussion on ycombinator and the references therein: Mocha.jl, deeplearning4j, Boltzmann.jl.

Hi Deniz, thanks for this very useful piece of work!

I had some problems getting the library to work in Julia Version 0.3.6 (2015-02-17 22:12 UTC), Darwin build. In case anybody can find it useful, I had to change

adagrad!(eps, dw2, dw)=(for i=1:length(dw) (dw2[i] += dw[i] * dw[i]; dw[i] /= (eps + sqrt(dw2[i]))) end)

to

adagrad!(eps, dw2, dw) = for i=1:length(dw)

dw2[i] += dw[i] * dw[i]

dw[i] /= (eps + sqrt(dw2[i]))

end

in update.jl (line 17), and also rename the Net function in types.jl (line 16) so that it doesn't clash with the type name.

Thanks Dario: I have now fixed the v0.3 compatibility issues, the latest from the repo should work out of the box.

Hi,

I think it is clearer to write

for l in n

rather than

for l=n

Alok,

You can try out the code at https://try.jupyter.org/. Create a new Julia notebook in the top right and go from there.

I find this code extremely succinct and easy to follow. Very well done. Its great to see an example like this in Julia. Real boost for the language.

Do you have any stats on the performance benefits you refer to at the top of the post?

Andy: I just posted some benchmark results here.

Terrific post. Very timely and elegant indeed.

Thanks Deniz for sharing this with the rest of us. What kind of license do you have in mind for general use?

Emrah: The repo uses an MIT license.

Matthew's parsing tutorial taught me how to do parsing, yours would hopefully will teach me about this misterious thing called deep learning. However, would have loved it if it was in python. Thanks anyways!

Post a Comment