Showing posts with label Downloads. Show all posts
Showing posts with label Downloads. Show all posts

March 22, 2022

Turkish Data Depository (TDD)

Talk by Ali Safaya and Taner Sezer on the Turkish Data Depository (TDD) project, which aims to collect Turkish NLP resources such as corpora, labeled data, model weights under https://tdd.ai. You can register on the website and download resources, sign up for the mailing list to get updates.
Full post...

January 22, 2019

Knet v1.2.0: iterators, iterators, iterators...

The new Knet release is all about iterators: iterators for minibatching, iterators for training, iterators for monitoring, convergence etc. Why am I so excited about iterators all of a sudden? Allow me to explain:

Knet has used iterators for data generation since 2015. That was about it until recently when I was looking for a way to improve the training interface. See, at the core of every deep learning project there is a training loop that looks like this:

function train(model,data)
    for (x,y) in data
        # improve model parameters so model(x) approaches y
    end
end
And these things can run for hours or days. You want the user to have full control of this loop: how many iterations to go, how to detect convergence and quit, how to monitor progress, how to take model snapshots or measure dev accuracy every n iterations etc.

My original (non)solution was to write a new `train` function for every experiment. Why restrict the user with a bad interface when they can write their own 5 line loop? (of course then why write any package at all but that's another discussion).

My next (pseudo)solution was to provide a `train` function with lots of keyword arguments. I soon gave up on that idea when it became clear that I was on my way to implementing a Turing complete programming language using keyword arguments.

Then I thought I had a brilliant flash of insight based on callback functions. See if `train` just accepts a callback function that gets called inside the for loop, the user can implement any behavior:
function train(model,data,callback)
    for (x,y) in data
        callback() || break
        # improve model parameters so model(x) approaches y
    end
end
You want to display a progress bar, do something every n iterations, or quit after N iterations? Just implement some callback function with state and you are all set! Brilliant? Everybody hated it. Including me. It turns out callback functions are awkward to write and do not lead to very readable code.

Then finally I rediscovered iterators and iterators that wrap other iterators (inspired by Tqdm.jl). I knew iterators can be these lazy collections that produce their next element only when asked. (Here is a summary with doc links to refresh your memory). See, once you implement the training loop as an iterator you can pause, restart and terminate it whenever you want:
train(model,data) = ((update model and return loss) for (x,y) in data)
What I realized iterators also do is turn the for loop inside out! Make its guts visible so one has explicit control: You can monitor and display its progress, take snapshots or whatever all with very explicit and readable code. Here are some actual examples from Knet v1.2.0. (`sgd` is a train iterator, f is the model, d is the data):

* To display a progress bar use progress(sgd(f,d)).
* To run until convergence use converge(sgd(f,cycle(d))).
* To run multiple epochs use sgd(f,repeat(d,n)).
* To run a given number of iterations use sgd(f,take(cycle(d),n)).
* To do a task every n iterations use:
(task(x) for x in every(n, sgd(f,cycle(d)))).

Each of the functions like `progress`, `converge`, `sgd` etc. take and return iterators. So they can be composed like crazy. Here is how to (1) train a model on dtrn, (2) measuring loss on dtst every 100 iterations, (3) quitting when dtst performance converges, and (4) displaying a progress bar from the Knet tutorial:
a = adam(model,cycle(dtrn))
b = (model(dtst) for _ in every(100,a))
c = converge(b, alpha=0.1)
progress!(c, alpha=1)
The code reads like the English description! Imagine trying to implement this using keyword arguments or callback functions... and that is why I am excited about iterators.

Notes:
* the more nitpicky reader will probably point out that I should have called these things generators or coroutines or streams or something rather than iterators, but you get the idea.
* every(n,itr) = (x for (i,x) in enumerate(itr) if i%n == 0) should be a Julia primitive! (Thank you @CarloLucibello for pointing out that `IterTools.takenth` does the same thing.)
* @lostella has a wonderful post on iterators.
* Here are the relevant links in Julia docs: Interfaces, Collections, Iteration Utilities and Generator expressions.
* Here is a link to the discussion on Julia discourse.

Full post... Related link

May 29, 2018

Wasserstein GAN: a Julia/Knet implementation

Written by Cem Eteke (@ceteke). This repository contains implementation of WGAN and DCGAN in Julia using Knet. Here is a detailed report about WGAN.
Full post...

May 28, 2018

Relational networks: a Julia/Knet implementation

Written by Erenay Dayanık (@ereday). Knet implementation of "A simple neural network module for relational reasoning" by Santoro et al. (2017). (Relational Networks, arXiv:1706.01427, blog post)
Full post...

May 27, 2018

Neural Style Transfer: a Julia notebook

Written by Cemil Cengiz (@cemilcengiz).

This notebook implements deep CNN based image style transfer algorithm from "Image Style Transfer Using Convolutional Neural Networks" (Gatys et al., CVPR 2016). The proposed technique takes two images as input, i.e. a content image (generally a photograph) and a style image (generally an artwork painting). Then, it produces an output image such that the content(objects in the image) resembles the "content image" whereas the style i.e. the texture is similar to the "style image". In order words, it re-draws the "content image" using the artistic style of the "style image".

The images below show an original photograph followed by two different styles applied by the network.


Full post...

September 20, 2016

Introducing Knet8: beginning deep learning with 100 lines of Julia

It has been a year and a half since I wrote the first version of this tutorial and it is time for an update.

Knet (pronounced “kay-net”) is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports construction of high-performance deep learning models in plain Julia by combining automatic differentiation with efficient GPU kernels and memory management. Models can be defined and trained using arbitrary Julia code with helper functions, loops, conditionals, recursion, closures, array indexing and concatenation. The training can be performed on the GPU by simply using KnetArray instead of Array for parameters and data. Check out the full documentation and the examples directory for more information.

Installation

You can install Knet using Pkg.add("Knet"). Some of the examples use additional packages such as ArgParse, GZip, and JLD. These are not required by Knet and can be installed when needed using additional Pkg.add() commands. See the detailed installation instructions as well as the section on using Amazon AWS to experiment with GPU machines on the cloud with pre-installed Knet images.

Examples

In Knet, a machine learning model is defined using plain Julia code. A typical model consists of a prediction and a loss function. The prediction function takes model parameters and some input, returns the prediction of the model for that input. The loss function measures how bad the prediction is with respect to some desired output. We train a model by adjusting its parameters to reduce the loss. In this section we will see the prediction, loss, and training functions for five models: linear regression, softmax classification, fully-connected, convolutional and recurrent neural networks.

Linear regression

Here is the prediction function and the corresponding quadratic loss function for a simple linear regression model:

predict(w,x) = w[1]*x .+ w[2]

loss(w,x,y) = sumabs2(y - predict(w,x)) / size(y,2)

The variable w is a list of parameters (it could be a Tuple, Array, or Dict), x is the input and y is the desired output. To train this model, we want to adjust its parameters to reduce the loss on given training examples. The direction in the parameter space in which the loss reduction is maximum is given by the negative gradient of the loss. Knet uses the higher-order function grad from AutoGrad.jl to compute the gradient direction:

using Knet

lossgradient = grad(loss)

Note that grad is a higher-order function that takes and returns other functions. The lossgradient function takes the same arguments as loss, e.g. dw = lossgradient(w,x,y). Instead of returning a loss value, lossgradient returns dw, the gradient of the loss with respect to its first argument w. The type and size of dw is identical to w, each entry in dw gives the derivative of the loss with respect to the corresponding entry in w. See @doc grad for more information.

Given some training data = [(x1,y1),(x2,y2),...], here is how we can train this model:

function train(w, data; lr=.1)
    for (x,y) in data
        dw = lossgradient(w, x, y)
        for i in 1:length(w)
            w[i] -= lr * dw[i]
        end
    end
    return w
end

We simply iterate over the input-output pairs in data, calculate the lossgradient for each example, and move the parameters in the negative gradient direction with a step size determined by the learning rate lr.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/housing.jpeg?raw=true

Let’s train this model on the Housing dataset from the UCI Machine Learning Repository.

julia> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
julia> rawdata = readdlm(download(url))
julia> x = rawdata[:,1:13]'
julia> x = (x .- mean(x,2)) ./ std(x,2)
julia> y = rawdata[:,14:14]'
julia> w = Any[ 0.1*randn(1,13), 0 ]
julia> for i=1:10; train(w, [(x,y)]); println(loss(w,x,y)); end
366.0463078055053
...
29.63709385230451

The dataset has housing related information for 506 neighborhoods in Boston from 1978. Each neighborhood is represented using 13 attributes such as crime rate or distance to employment centers. The goal is to predict the median value of the houses given in $1000’s. After downloading, splitting and normalizing the data, we initialize the parameters randomly and take 10 steps in the negative gradient direction. We can see the loss dropping from 366.0 to 29.6. See housing.jl for more information on this example.

Note that grad was the only function used that is not in the Julia standard library. This is typical of models defined in Knet.

Softmax classification

In this example we build a simple classification model for the MNIST handwritten digit recognition dataset. MNIST has 60000 training and 10000 test examples. Each input x consists of 784 pixels representing a 28x28 image. The corresponding output indicates the identity of the digit 0..9.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/firsteightimages.jpg?raw=true

Classification models handle discrete outputs, as opposed to regression models which handle numeric outputs. We typically use the cross entropy loss function in classification models:

function loss(w,x,ygold)
    ypred = predict(w,x)
    ynorm = ypred .- log(sum(exp(ypred),1))
    -sum(ygold .* ynorm) / size(ygold,2)
end

Other than the change of loss function, the softmax model is identical to the linear regression model. We use the same predict, same train and set lossgradient=grad(loss) as before. To see how well our model classifies let’s define an accuracy function which returns the percentage of instances classified correctly:

function accuracy(w, data)
    ncorrect = ninstance = 0
    for (x, ygold) in data
        ypred = predict(w,x)
        ncorrect += sum(ygold .* (ypred .== maximum(ypred,1)))
        ninstance += size(ygold,2)
    end
    return ncorrect/ninstance
end

Now let’s train a model on the MNIST data:

julia> include(Pkg.dir("Knet/examples/mnist.jl"))
julia> using MNIST: xtrn, ytrn, xtst, ytst, minibatch
julia> dtrn = minibatch(xtrn, ytrn, 100)
julia> dtst = minibatch(xtst, ytst, 100)
julia> w = Any[ -0.1+0.2*rand(Float32,10,784), zeros(Float32,10,1) ]
julia> println((:epoch, 0, :trn, accuracy(w,dtrn), :tst, accuracy(w,dtst)))
julia> for epoch=1:10
           train(w, dtrn; lr=0.5)
           println((:epoch, epoch, :trn, accuracy(w,dtrn), :tst, accuracy(w,dtst)))
       end
(:epoch,0,:trn,0.11761667f0,:tst,0.121f0)
(:epoch,1,:trn,0.9005f0,:tst,0.9048f0)
...
(:epoch,10,:trn,0.9196f0,:tst,0.9153f0)

Including mnist.jl loads the MNIST data, downloading it from the internet if necessary, and provides a training set (xtrn,ytrn), test set (xtst,ytst) and a minibatch utility which we use to rearrange the data into chunks of 100 instances. After randomly initializing the parameters we train for 10 epochs, printing out training and test set accuracy at every epoch. The final accuracy of about 92% is close to the limit of what we can achieve with this type of model. To improve further we must look beyond linear models.

Multi-layer perceptron

A multi-layer perceptron, i.e. a fully connected feed-forward neural network, is basically a bunch of linear regression models stuck together with non-linearities in between.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/neural_net2.jpeg?raw=true

We can define a MLP by slightly modifying the predict function:

function predict(w,x)
    for i=1:2:length(w)-2
        x = max(0, w[i]*x .+ w[i+1])
    end
    return w[end-1]*x .+ w[end]
end

Here w[2k-1] is the weight matrix and w[2k] is the bias vector for the k’th layer. max(0,a) implements the popular rectifier non-linearity. Note that if w only has two entries, this is equivalent to the linear and softmax models. By adding more entries to w, we can define multi-layer perceptrons of arbitrary depth. Let’s define one with a single hidden layer of 64 units:

w = Any[ -0.1+0.2*rand(Float32,64,784), zeros(Float32,64,1),
         -0.1+0.2*rand(Float32,10,64),  zeros(Float32,10,1) ]

The rest of the code is the same as the softmax model. We use the same cross-entropy loss function and the same training script. The code for this example is available in mnist.jl. The multi-layer perceptron does significantly better than the softmax model:

(:epoch,0,:trn,0.10166667f0,:tst,0.0977f0)
(:epoch,1,:trn,0.9389167f0,:tst,0.9407f0)
...
(:epoch,10,:trn,0.9866f0,:tst,0.9735f0)

Convolutional neural network

To improve the performance further, we can use convolutional neural networks. We will implement the LeNet model which consists of two convolutional layers followed by two fully connected layers.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/le_net.png?raw=true

Knet provides the conv4(w,x) and pool(x) functions for the implementation of convolutional nets (see @doc conv4 and @doc pool for more information):

function predict(w,x0)
    x1 = pool(max(0, conv4(w[1],x0) .+ w[2]))
    x2 = pool(max(0, conv4(w[3],x1) .+ w[4]))
    x3 = max(0, w[5]*mat(x2) .+ w[6])
    return w[7]*x3 .+ w[8]
end

The weights for the convolutional net can be initialized as follows:

w = Any[ -0.1+0.2*rand(Float32,5,5,1,20),  zeros(Float32,1,1,20,1),
         -0.1+0.2*rand(Float32,5,5,20,50), zeros(Float32,1,1,50,1),
         -0.1+0.2*rand(Float32,500,800),   zeros(Float32,500,1),
         -0.1+0.2*rand(Float32,10,500),    zeros(Float32,10,1) ]

Currently convolution and pooling are only supported on the GPU for 4-D and 5-D arrays. So we reshape our data and transfer it to the GPU along with the parameters by converting them into KnetArrays (see @doc KnetArray for more information):

dtrn = map(d->(KnetArray(reshape(d[1],(28,28,1,100))), KnetArray(d[2])), dtrn)
dtst = map(d->(KnetArray(reshape(d[1],(28,28,1,100))), KnetArray(d[2])), dtst)
w = map(KnetArray, w)

The training proceeds as before giving us even better results. The code for the LeNet example can be found in lenet.jl.

(:epoch,0,:trn,0.12215f0,:tst,0.1263f0)
(:epoch,1,:trn,0.96963334f0,:tst,0.971f0)
...
(:epoch,10,:trn,0.99553335f0,:tst,0.9879f0)

Recurrent neural network

In this section we will see how to implement a recurrent neural network (RNN) in Knet. An RNN is a class of neural network where connections between units form a directed cycle, which allows them to keep a persistent state over time. This gives them the ability to process sequences of arbitrary length one element at a time, while keeping track of what happened at previous elements.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/RNN-unrolled.png?raw=true

As an example, we will build a character-level language model inspired by “The Unreasonable Effectiveness of Recurrent Neural Networks” from the Andrej Karpathy blog. The model can be trained with different genres of text, and can be used to generate original text in the same style.

It turns out simple RNNs are not very good at remembering things for a very long time. Currently the most popular solution is to use a more complicated unit like the Long Short Term Memory (LSTM). An LSTM controls the information flow into and out of the unit using gates similar to digital circuits and can model long term dependencies. See Understanding LSTM Networks by Christopher Olah for a good overview of LSTMs.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/LSTM3-chain.png?raw=true

The code below shows one way to define an LSTM in Knet. The first two arguments are the parameters, the weight matrix and the bias vector. The next two arguments hold the internal state of the LSTM: the hidden and cell arrays. The last argument is the input. Note that for performance reasons we lump all the parameters of the LSTM into one matrix-vector pair instead of using separate parameters for each gate. This way we can perform a single matrix multiplication, and recover the gates using array indexing. We represent input, hidden and cell as row vectors rather than column vectors for more efficient concatenation and indexing. sigm and tanh are the sigmoid and the hyperbolic tangent activation functions. The LSTM returns the updated state variables hidden and cell.

function lstm(weight,bias,hidden,cell,input)
    gates   = hcat(input,hidden) * weight .+ bias
    hsize   = size(hidden,2)
    forget  = sigm(gates[:,1:hsize])
    ingate  = sigm(gates[:,1+hsize:2hsize])
    outgate = sigm(gates[:,1+2hsize:3hsize])
    change  = tanh(gates[:,1+3hsize:end])
    cell    = cell .* forget + ingate .* change
    hidden  = outgate .* tanh(cell)
    return (hidden,cell)
end

The LSTM has an input gate, forget gate and an output gate that control information flow. Each gate depends on the current input value, and the last hidden state hidden. The memory value cell is computed by blending a new value change with the old cell value under the control of input and forget gates. The output gate decides how much of the cell is shared with the outside world.

If an input gate element is close to 0, the corresponding element in the new input will have little effect on the memory cell. If a forget gate element is close to 1, the contents of the corresponding memory cell can be preserved for a long time. Thus the LSTM has the ability to pay attention to the current input, or reminisce in the past, and it can learn when to do which based on the problem.

To build a language model, we need to predict the next character in a piece of text given the current character and recent history as encoded in the internal state. The predict function below implements a multi-layer LSTM model. s[2k-1:2k] hold the hidden and cell arrays and w[2k-1:2k] hold the weight and bias parameters for the k’th LSTM layer. The last three elements of w are the embedding matrix and the weight/bias for the final prediction. predict takes the current character encoded in x as a one-hot row vector, multiplies it with the embedding matrix, passes it through a number of LSTM layers, and converts the output of the final layer to the same number of dimensions as the input using a linear transformation. The state variable s is modified in-place.

function predict(w, s, x)
    x = x * w[end-2]
    for i = 1:2:length(s)
        (s[i],s[i+1]) = lstm(w[i],w[i+1],s[i],s[i+1],x)
        x = s[i]
    end
    return x * w[end-1] .+ w[end]
end

To train the language model we will use Backpropagation Through Time (BPTT) which basically means running the network on a given sequence and updating the parameters based on the total loss. Here is a function that calculates the total cross-entropy loss for a given (sub)sequence:

function loss(param,state,sequence,range=1:length(sequence)-1)
    total = 0.0; count = 0
    atype = typeof(getval(param[1]))
    input = convert(atype,sequence[first(range)])
    for t in range
        ypred = predict(param,state,input)
        ynorm = logp(ypred,2) # ypred .- log(sum(exp(ypred),2))
        ygold = convert(atype,sequence[t+1])
        total += sum(ygold .* ynorm)
        count += size(ygold,1)
        input = ygold
    end
    return -total / count
end

Here param and state hold the parameters and the state of the model, sequence and range give us the input sequence and a possible range over it to process. We convert the entries in the sequence to inputs that have the same type as the parameters one at a time (to conserve GPU memory). We use each token in the given range as an input to predict the next token. The average cross-entropy loss per token is returned.

To generate text we sample each character randomly using the probabilities predicted by the model based on the previous character:

function generate(param, state, vocab, nchar)
    index_to_char = Array(Char, length(vocab))
    for (k,v) in vocab; index_to_char[v] = k; end
    input = oftype(param[1], zeros(1,length(vocab)))
    index = 1
    for t in 1:nchar
        ypred = predict(param,state,input)
        input[index] = 0
        index = sample(exp(logp(ypred)))
        print(index_to_char[index])
        input[index] = 1
    end
    println()
end

Here param and state hold the parameters and state variables as usual. vocab is a Char->Int dictionary of the characters that can be produced by the model, and nchar gives the number of characters to generate. We initialize the input as a zero vector and use predict to predict subsequent characters. sample picks a random index based on the normalized probabilities output by the model.

At this point we can train the network on any given piece of text (or other discrete sequence). For efficiency it is best to minibatch the training data and run BPTT on small subsequences. See charlm.jl for details. Here is a sample run on ‘The Complete Works of William Shakespeare’:

$ cd .julia/Knet/examples
$ wget http://www.gutenberg.org/files/100/100.txt
$ julia charlm.jl --data 100.txt --epochs 10 --winit 0.3 --save shakespeare.jld
... takes about 10 minutes on a GPU machine
$ julia charlm.jl --load shakespeare.jld --generate 1000

    Pand soping them, my lord, if such a foolish?
  MARTER. My lord, and nothing in England's ground to new comp'd.
    To bless your view of wot their dullst. If Doth no ape;
    Which with the heart. Rome father stuff
    These shall sweet Mary against a sudden him
    Upon up th' night is a wits not that honour,
    Shouts have sure?
  MACBETH. Hark? And, Halcance doth never memory I be thou what
    My enties mights in Tim thou?
  PIESTO. Which it time's purpose mine hortful and
    is my Lord.
  BOTTOM. My lord, good mine eyest, then: I will not set up.
  LUCILIUS. Who shall

Under the hood

Coming soon...

Benchmarks

Coming soon...

Contributing

Knet is an open-source project and we are always open to new contributions: bug reports and fixes, feature requests and contributions, new machine learning models and operators, inspiring examples, benchmarking results are all welcome. If you need help or would like to request a feature, please consider joining the knet-users mailing list. If you find a bug, please open a GitHub issue. If you would like to contribute to Knet development, check out the knet-dev mailing list and tips for developers. If you use Knet in your own work, the suggested citation is:

@misc{knet,
  author={Yuret, Deniz},
  title={Knet: Ko\c{c} University deep learning framework.},
  year={2016},
  howpublished={\url{https://github.com/denizyuret/Knet.jl}}
}

Full post...

August 23, 2016

AutoGrad.jl

AutoGrad.jl is an automatic differentiation package for Julia. It is a Julia port of the popular Python autograd package. It can differentiate regular Julia code that includes loops, conditionals, helper functions, closures etc. by keeping track of the primitive operations and using this execution trace to compute gradients. It uses reverse mode differentiation (a.k.a. backpropagation) so it can efficiently handle functions with array inputs and scalar outputs. It can compute gradients of gradients to handle higher order derivatives. Please see the comments in core.jl for a description of how the code works in detail.

Installation

You can install AutoGrad in Julia using:

julia> Pkg.add("AutoGrad")

In order to use it in your code start with:

using AutoGrad
Example

Here is a linear regression example simplified from housing.jl:

using AutoGrad

function loss(w)
    global xtrn,ytrn
    ypred = w[1]*xtrn .+ w[2]
    sum(abs2(ypred - ytrn)) / size(ypred,2)
end

function train(w; lr=.1, epochs=20)
    gradfun = grad(loss)
    for epoch=1:epochs
        g = gradfun(w)
        for i in 1:length(w)
            w[i] -= lr * g[i]
        end
    end
    return w
end

The loss function takes parameters as input and returns the loss to be minimized. The parameter w for this example is a pair: w[1] is a weight matrix, and w[2] is a bias vector. The training data xtrn,ytrn are in global variables. ypred is the predicted output, and the last line computes the quadratic loss. The loss function is implemented in regular Julia.

The train function takes initial parameters and returns optimized parameters. grad is the only AutoGrad function used: it creates a function gradfun that takes the same arguments as loss, but returns the gradient instead. The returned gradient will have the same type and shape as the input argument. The for loop implements gradient descent, where we calculate the gradient and subtract a scaled version of it from the weights.

See the examples directory for more examples, and the extensively documented core.jl for details.

Extending AutoGrad

AutoGrad can only handle a function if the primitives it uses have known gradients. You can add your own primitives with gradients as described in detail in core.jl or using the @primitive and @zerograd macros in util.jl Here is an example:

@primitive hypot(x1::Number,x2::Number)::y  (dy->dy*x1/y)  (dy->dy*x2/y)

The @primitive macro marks the hypot(::Number,::Number) method as a new primitive and the next two expressions define gradient functions wrt the first and second argument. The gradient expressions can refer to the parameters and the return variable (indicated after the final ::) of the method declaration.

Note that Julia supports multiple-dispatch, i.e. a function may have multiple methods each supporting different argument types. For example hypot(x1::Array,x2::Array) is another hypot method. In AutoGrad.jl each method can independently be defined as a primitive and can have its own specific gradient.

Code structure

core.jl implements the main functionality and acts as the main documentation source. util.jl has some support functions to define and test new primitives. interfaces.jl sets up support for common data structures including Arrays, Tuples, and Dictionaries. The numerical gradients are defined in files such as base/math.jl, special/trig.jl that mirror the organization under julia/base.

Current status and future work

The gradient coverage is spotty, I am still adding more gradients to cover the Julia base. Next steps are to make models faster by providing support for GPU operations and overwriting functions (to avoid memory allocation). I should also find out about the efficiency of closures and untyped functions in Julia which are used extensively in the code.

Acknowledgments and references

AutoGrad.jl was written by Deniz Yuret. Large parts of the code are directly ported from the Python autograd package. I'd like to thank autograd author Dougal Maclaurin for his support. See (Baydin et al. 2015) for a general review of automatic differentiation, autograd tutorial for some Python examples, and Dougal's PhD thesis for design principles. JuliaDiff has alternative differentiation tools for Julia. I would like to thank my students Ozan Arkan Can and Emre Yolcu for helpful contributions.

Also see: A presentation, A demo.

Full post...

May 17, 2015

Beginning deep learning with 500 lines of Julia (Version 0.1)

Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

Click here for an older version (v0.0) of this tutorial.

OK, first a disclaimer: this version of KUnet.jl, my deep learning code for Julia, is a bit more than 500 lines, but it is still under 1000 lines and it supports convolution and pooling, new activation and loss functions, arrays of arbitrary dimensionality with 32 and 64 bit floats etc. See here for installation instructions. We will use the MNIST dataset to illustrate basic usage of KUnet:

julia> include(Pkg.dir("KUnet/test/mnist.jl"))

This may take a bit the first time you run to download the data.

Next we tell Julia we intend to use KUnet, and some variables from MNIST:

julia> using KUnet
julia> using MNIST: xtrn, ytrn, xtst, ytst

The MNIST variables are Float32 matrices. The x matrices have pixel values scaled to [0.0:1.0] for a 28x28 image on each column. The y matrices have 10 rows indicating the 10 classes with a single nonzero entry for the correct class in each column.

julia> xtrn, ytrn, xtst, ytst
(
784x60000 Array{Float32,2}: ...
10x60000 Array{Float32,2}: ...
784x10000 Array{Float32,2}: ...
10x10000 Array{Float32,2}: ...
)

Before using KUnet, we should specify the array type and the element type we want to use. The array type determines whether KUnet uses the GPU, and the element type should match that of the data.

julia> KUnet.atype(CudaArray)   # CudaArray or Array
julia> KUnet.ftype(Float32)     # Float32 or Float64

Let's construct a neural net with a single layer of 64 hidden units using the relu activation function and the cross entropy loss function.

julia> net = [ Mmul(64,784), Bias(64), Relu(),
               Mmul(10,64),  Bias(10), XentLoss() ]

Each element of the net array represents an operation, e.g. Mmul multiplies its input with a weight matrix, Bias adds a bias vector, Relu applies the rectified linear transformation to each element etc. They are subtypes of an abstract type called Layer. The full list of Layers currently implemented are: Bias, Conv, Drop, Logp, Mmul, Pool, Relu, Sigm, Soft, Tanh, LogpLoss, QuadLoss, SoftLoss, XentLoss.

A Net is simply a 1-D array of Layers. Here are the definitions from net.jl and bias.jl:

abstract Layer
typealias Net Array{Layer,1}
type Bias <: Layer; b::Param; Bias(b::Param)=new(b); end

If you are not happy with the default Layer constructors, you can specify your own parameters. For example, the Mmul(64,784) constructor fills a (64,784) weight matrix with random weights from a Gaussian distribution with std=0.01. If we want a different initialization, we could create a weight matrix any way we want and pass it to the Mmul constructor instead. Note that the weight matrix for an Mmul layer with 784 inputs and 64 outputs has size (64, 784).

julia> w1 = randn(64, 784) * 0.05
julia> l1 = Mmul(w1)

Training parameters like the learning rate (lr) can be specified at layer construction, or using setparam! on the whole network or on individual layers.

julia> l1 = Mmul(64,784; lr=0.01)
julia> setparam!(l1; lr=0.01)
julia> setparam!(net; lr=0.01)

It is also possible to save nets to JLD files using savenet(fname::String, n::Net) and read them using loadnet(fname::String). Let's save our initial random network for replicatibility.

julia> savenet("net0.jld", net)

OK, now that we have some data and a network, let's proceed with training. Here is a convenience function to measure the classification accuracy:

julia> accuracy(y,z)=mean(findmax(y,1)[2] .== findmax(z,1)[2])

Let's do 100 epochs of training:

@time for i=1:100
    train(net, xtrn, ytrn; batch=128)
    println((i, accuracy(ytst, predict(net, xtst)), 
                accuracy(ytrn, predict(net, xtrn))))
end

If you take a look at the definition of train in net.jl, you will see that it takes a network net, the input x, and the desired output y, and after splitting the data into minibatches it just calls backprop and update. Here is a simplified version:

train(net, x, y)=(backprop(net, x, y); update(net))

The backprop function calls forw which computes the network output, and back which computes the gradients of the network parameters with respect to the loss function:

backprop(net, x, y)=(forw(net, x); back(net, y))

The forw and back functions for a Net simply call the forw and back functions of each layer in order, feeding the output of one to the input of the next:

forw(n::Net, x)=(for i=1:length(n); x=forw(n[i], x); end)
back(n::Net, y)=(for i=length(n):-1:1; y=back(n[i], y); end)

The forw function for a layer takes the layer input x, and returns the layer output y. The back function for a layer takes the loss gradient wrt its output dy and returns the loss gradient wrt its input dx. If the layer has a parameter p, back also computes the loss gradient p.diff wrt its current value p.data. You can take a look at individual layer definitions (e.g. in mmul.jl, bias.jl, relu.jl, etc.) to see how this is done for each layer.

The final layer of the network (XentLoss in our case) is a subtype of LossLayer. LossLayer is a special type of layer: its forw does nothing but record the network output y. Its back expects the desired output z (not a gradient) and computes the loss gradient wrt the network output dy. A LossLayer also implements the function loss(l::LossLayer,z) which returns the actual loss value given the desired output z. KUnet currently implements the following LossLayers: LogpLoss, QuadLoss, SoftLoss, XentLoss.

The update function for a net calls the update function for each of its layers, which in turn calls the update function on layer parameters:

update(n::Net)=(for l in n; update(l); end)
update(l::Bias)=update(l.b)

The update function for a parameter p is used to update its values (p.data) given the loss gradients (p.diff). Its behavior is controlled by the following parameters: lr, l1reg, l2reg, adagrad, momentum, nesterov. Here is a simplified definition of update from param.jl (p.ada, p.mom, and p.nes are temporary arrays initialized to 0):

function update(p::Param; o...)
    isdefined(p,:l1reg)    && (p.diff += p.l1reg * sign(p.data))
    isdefined(p,:l2reg)    && (p.diff += p.l2reg * p.data)
    isdefined(p,:adagrad)  && (p.ada += p.diff.^2; p.diff /= p.adagrad + sqrt(p.ada))
    isdefined(p,:momentum) && (p.diff += p.momentum * p.mom; p.mom[:] = p.diff)
    isdefined(p,:nesterov) && (p.nes *= p.nesterov; p.nes += p.diff; p.diff += p.nesterov * p.nes)
    isdefined(p,:lr)       && (p.diff *= p.lr)
    p.data -= p.diff
end

Our training should print out the test set and training set accuracy at the end of every epoch.

(1,0.3386,0.3356)
(2,0.7311,0.7226666666666667)
(3,0.821,0.8157333333333333)
...
(99,0.9604,0.9658166666666667)
(100,0.9604,0.96605)
elapsed time: 39.738191211 seconds (1526525108 bytes allocated, 3.05% gc time)

Note that for actual research we should not be looking at the test set accuracy at this point. We should instead split the training set into a training and a development portion and do all our playing around with those. We should also run each experiment 10 times with different random seeds and measure standard errors, etc. But, this is just a KUnet tutorial.

It seems the training set accuracy is not that great. Maybe increasing the learning rate may help:

net = loadnet("net0.jld")
setparam!(net, lr=0.5)

# same for loop...

(1,0.9152,0.9171833333333334)
(2,0.9431,0.9440333333333333)
(3,0.959,0.9611666666666666)
...
(59,0.9772,0.9999833333333333)
(60,0.9773,1.0)
...
(100,0.9776,1.0)

Wow! We got 100% training set accuracy in 60 epochs. This should drive home the importance of setting a good learning rate.

But the test set is still lagging behind. What if we try increasing the number of hidden units:

for h in (128, 256, 512, 1024)
    net = [Mmul(h,784), Bias(h), Relu(), Mmul(10,h),  Bias(10), XentLoss()]
    setparam!(net; lr=0.5)
    for i=1:100
        train(net, xtrn, ytrn; batch=128)
        println((i, accuracy(ytst, predict(net, xtst)), 
                    accuracy(ytrn, predict(net, xtrn))))
    end
end

# Number of epochs and test accuracy when training accuracy reaches 1.0:
# 128:  (43,0.9803,1.0)
# 256:  (42,0.983,1.0)
# 512:  (36,0.983,1.0)
# 1024: (30,0.9833,1.0)

This improvement is unexpected, we were already overfitting with 64 hidden units, and common wisdom is not to increase the capacity of the network by increasing the hidden units in that situation. Maybe we should try dropout:

net = [Drop(0.2), Mmul(1024,784), Bias(1024), Relu(), 
       Drop(0.5), Mmul(10,1024),  Bias(10), XentLoss()]

# lr=0.5, same for loop
...
(100,0.9875,0.9998166666666667)
elapsed time: 122.898730432 seconds (1667849932 bytes allocated, 0.96% gc time)

Or bigger and bigger nets:

net = [Drop(0.2), Mmul(4096,784),  Bias(4096), Relu(), 
       Drop(0.5), Mmul(4096,4096), Bias(4096), Relu(), 
       Drop(0.5), Mmul(10,4096),   Bias(10), XentLoss()]

# lr=0.5, same for loop
...
(100,0.9896,0.9998166666666667)
elapsed time: 804.242212488 seconds (1080 MB allocated, 0.02% gc time in 49 pauses with 0 full sweep)

Or maybe we should try convolution. Here is an implementation of LeNet:

net = [Conv(5,5,1,20), Bias(20), Relu(), Pool(2),
       Conv(5,5,20,50), Bias(50), Relu(), Pool(2),
       Mmul(500,800), Bias(500), Relu(),
       Mmul(10,500), Bias(10), XentLoss()]
setparam!(net; lr=0.1)

# Need to reshape the input arrays for convolution:
xtrn2 = reshape(xtrn, 28, 28, 1, size(xtrn, 2))
xtst2 = reshape(xtst, 28, 28, 1, size(xtst, 2))

# same for loop
...
(100,0.9908,1.0)
elapsed time: 360.722851006 seconds (5875158944 bytes allocated, 1.95% gc time)

OK, that's enough fiddling around. I hope this gave you enough to get your hands dirty. We are already among the better results on the MNIST website. I am sure you can do better playing around with the learning rate, dropout probabilities, momentum, adagrad, regularization, and numbers, sizes, types of layers etc. But be careful, it could become addictive :)


Full post...

May 07, 2015

CUDNN Julia port

I just finished implementing CUDNN.jl, a Julia wrapper for the NVIDIA cuDNN GPU accelerated deep learning library. It consists of a low level interface and a high level interface. The low level interface wraps each function from libcudnn.so in a Julia function in libcudnn.jl and each data type from cudnn.h in a Julia datatype in types.jl. These were generated semi-automatically using Clang and are documented in the cuDNN Library User Guide. For the high level interface defined in CUDNN.jl, I kept the original names from the C library and provided more convenient type signatures, return values, and keyword arguments with reasonable defaults.  Next step will be integrating this with Knet.jl.
Full post...

March 06, 2015

Parsing the Penn Treebank in 60 seconds

Parsers (as well as many other natural language processing algorithms) work by (1) extracting features for the current state, and (2) using a machine learning model to predict the best action / structure based on those features. The feature extraction code is typically messy and irregular and best performed on (possibly multiple) CPUs. The machine learning models can typically be accelerated significantly using GPUs. In this post I will use a greedy transition based parser with a neural network model and figure out how to use both the GPU and the multiple CPU cores effectively. We will take the parser speed from 55 ms/word (with a single CPU) to 0.055 ms/word (using 20 CPU cores and two K40 GPUs). At this speed we can parse the whole Penn Treebank (approx 1M words) in less than 60 seconds.

Some related links:
The code used in this post
Parallel processing for natural language (same idea, Matlab version)
Beginning deep learning with 500 lines of Julia

A greedy transition based parser parses a sentence using the following steps:

function gparse(s::Sentence, n::Net, f::Features)
    p = ArcHybrid(wcnt(s))       # initialize parser state
    while (v = valid(p); any(v)) # while we have valid moves
        x = features(p, s, f)    # extract features
        y = predict(n, x)        # score moves
        y[!v] = -Inf             # ignore invalid moves
        move!(p, indmax(y))      # make the max score move
    end
    return p.head
end

Here n is a machine learning model, f is a specification of what features to extract, and p represents the parser state. The parser works by extracting features representing the sentence and the current parser state, using the model to score possible moves, and executing the highest scoring move until no valid moves are left.

To parse a whole corpus (array of sentences), we just map gparse to each sentence. Julia can distinguish which gparse we mean by looking at the types of arguments.

typealias Corpus AbstractVector{Sentence}
gparse(c::Corpus, n::Net, f::Features)=map(s->gparse(s,n,f), c)

For our first experiment we will parse some sentences on a single CPU core (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz):

julia> nworkers()  # we use a single core
1
julia> blas_set_num_threads(1)  # including blas
julia> using KUnet
julia> KUnet.gpu(false)  # no gpu yet
julia> using KUparser
julia> @time KUparser.gparse(dev, net, feats);
elapsed time: 2244.3076923076924 seconds

The corpus, dev, is the Penn Treebank WSJ section 22 (1700 sentences, 40117 words); net is a standard feed forward neural network with 1326 input units, 20000 hidden units in a single layer, and 3 output units; feats is a specification of features to be extracted. The parsing speed is 55.944 ms/word. More than 99% of that time is spent on "predict".

In order to speed up "predict", we will use the GPU (NVIDIA Tesla K20m):

julia> gnet=copy(net,:gpu)
julia> @time KUparser.gparse(dev, gnet, feats);
elapsed time: 148.56374417550305 seconds

This gives us 3.704 ms/word, a 15x speed-up. However the GPU can be better utilized if we somehow manage to batch our feature vectors and compute scores for multiple instances in parallel. The problem is parsing a sentence is a serial process, you need the state resulting from the last move in order to compute the features for the next move. The solution is to parse multiple sentences in parallel (thanks to Alkan Kabakcioglu for suggesting this). Different sentences have no dependencies on each other and we can keep track of their states and predict their moves in bulk. The second version of gparse takes an additional "batchsize" argument specifying how many sentences to parse in parallel. This needs some bookkeeping (requiring 80 lines of code for gparse v2 instead of the 10 line beauty you see above), so I won't cut-and-paste it here, you can see the source code if you wish. Here are some experiments with the batched gparse:

julia> @time KUparser.gparse(dev, gnet, feats, 1);
elapsed time: 148.725787323 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 10);
elapsed time: 48.573996933 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 100);
elapsed time: 25.502507879 seconds

julia> @time KUparser.gparse(dev, gnet, feats, 1700); 
elapsed time: 22.079269825 seconds

As we increase the number of sentences processed in parallel (doing all 1700 sentences in the corpus in parallel in the last example), we get 0.550 ms/word, a 100x speedup from where we started. At this point the time spent on prediction is about a third of the time spent on feature extraction, so let's take another look at features. We will use Julia's parallel computing primitives to group the sentences to be processed on separate cores. The third version of gparse takes yet another argument specifying the number of cpu cores:

function gparse(corpus::Corpus, net::Net, fmat::Features, batch::Integer, ncpu::Integer)
    d = distribute(corpus, workers()[1:ncpu])
    n = copy(net, :cpu)
    p = pmap(procs(d)) do x
        gparse(localpart(d), copy(n, :gpu), fmat, batch)
    end
end

The distribute command distributes the corpus equally among ncpu workers, and localpart gives each worker its own subset. We copy the net back and forth between the CPU and the GPU because I couldn't figure out how to pass GPU pointers between different workers. Finally pmap is the parallel map which calls gparse v2 on each worker for the appropriate subset of the corpus, pmerge merges the results. This time we will run the parser on the training set (Sections 02-21, ~40k sentences, ~950k words)

julia> addprocs(20)
julia> require("CUDArt")
julia> @everywhere CUDArt.device((myid()-1)%CUDArt.devcount())
julia> require("KUparser")
julia> @time KUparser.gparse(trn, gnet, feats, 2000, 20);
elapsed time: 52.13701401 seconds

The server has 20 CPU cores and 2 GPUs. We create 20 workers, and assign equal numbers to each GPU. Parsing 950k words takes 52 seconds (0.055 ms/word), a 1000x speedup from where we started.


Full post...

February 28, 2015

Beginning deep learning with 500 lines of Julia (20150228)

Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

There are a number of deep learning packages out there. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov's accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner's neural network tutorial (modeled after Honnibal's excellent parsing example).

Contents

Layers and nets
The layer function
Prediction
Loss functions
Backpropagation Training
Activation functions
Preprocessing functions
Why Julia?
Future work

Layers and nets

A feed forward neural network consists of layers.
typealias Net Array{Layer,1}
Each layer represents a function which takes an input vector (or a matrix whose columns are input vectors) x, and outputs another vector (or matrix) y. The layer function is determined by a weight matrix w, and optionally a bias vector b, a preprocessing function fx (e.g. dropout), and an activation function f (e.g. sigmoid or relu).
type Layer w; b; f; fx; ... end

The layer function (forw)

Given its parameters and functions, here is how a layer computes its output from its input:
function forw(l::Layer, x)
    l.x = l.fx(l, x)   # preprocess the input
    l.y = l.w * l.x    # multiply with weight matrix
    l.y = l.y .+ l.b   # add the bias vector (to every column)
    l.y = l.f(l,l.y)   # apply the activation fn to the output
end
I hope that the code is self-explanatory. A couple of things to note: the layer has fields l.x, l.y to remember its last input and output, these will come in handy when calculating gradients. The ".+" operation is the "broadcasting" version of "+", which means if l.y is a matrix, then the vector l.b will be added to each of its columns. Finally Julia automatically returns the value of the last expression in a function, so there is no need for an explicit return statement. The version in the actual source, net.jl, is a bit uglier because (1) it tries to make each operation (except multiplying with the weight matrix) optional, and (2) it uses macros from InplaceOps.jl to avoid allocating matrices for each operation (which hopefully will become unnecessary once Julia decides on a standard syntax for in-place operations).

Prediction

Once we have forw for a single layer, calculating a prediction for the whole network is literally a one-liner:
forw(n::Net, x)=(for l=n x=forw(l,x) end; x)
This is the compact function definition syntax of Julia. There is no need to give this function a different name, Julia can distinguish it from forw(l::Layer,x) using the type of the first argument. I find myself thinking of silly names for related functions a lot less in Julia. The loop "for l=n" sets the variable l to each layer of the network n in turn, and we keep feeding the output of one layer as input to the next layer. The output of the final layer is returned. The source code also provides a "predict" function which is just a wrapper around "forw(n::Net,x)", its only difference is to chop up the input into minibatches and feed them to forw one batch at a time, mainly to allow the limited GPU memory to process large datasets.

Loss functions

In order to train the network, we need a loss function. A loss function takes y, the network output, and dy, the desired output, and returns a loss value which we try to minimize during training. In addition, it calculates the gradient of the loss with respect to y, which tells us how small changes in the network output will effect the loss. As an example, here is the implementation of softmaxloss (albeit a bit low level) in func.jl:
function softmaxloss(y, dy)
    yrows,ycols = size(y)
    loss = zero(eltype(y))
    prob = similar(y, yrows)
    for j=1:ycols
        ymax = y[1,j]
        for i=2:yrows y[i,j] > ymax && (ymax = y[i,j]) end
        psum = zero(ymax)
        for i=1:yrows
            yij = y[i,j] - ymax
            prob[i] = exp(yij)
            psum += prob[i]
            dy[i,j] == 1 && (loss += yij)
        end
        loss -= log(psum)
        for i=1:yrows
            prob[i] /= psum
            dy[i,j] = (prob[i] - dy[i,j]) / ycols
        end
    end
    return loss
end
By convention KUnet does not use an activation function in the last layer, so y is just the linear output of the last layer (prediction only cares about the max y). Softmax treats these y values as unnormalized log-probabilities, and all but the last few lines of the code is for normalizing them in a numerically stable manner. In fact only this single line calculates the gradient: dy[i,j] = (prob[i] - dy[i,j]) / ycols. We should note a couple of important design decisions here: (1) y and dy should have the same dimensionality, so use one-of-k encoding for desired classification outputs. (2) The loss function overwrites dy with the gradient of the loss with respect to y.

Backpropagation

Backpropagation is the algorithm that calculates the gradient of the loss with respect to network weights, given the gradient of the loss with respect to the network output. This can be accomplished taking a backward pass through the layers after the forward pass and the loss calculation:
function backprop(net::Net, x, dy, loss=softmaxloss)
    y = forw(net, x)  # y: network output
    loss(y, dy)         # dy: desired output -> gradient
    back(net, dy)       # calculate derivatives
end
Remember that the loss function overwrites its second argument dy with the loss gradient. The backward pass through the network simply calls the backward pass through each layer, from the output to the input:
back(n::Net, dy) = (for i=length(n):-1:1 dy=back(n[i],dy) end)
The backward pass through each layer takes the gradient with respect to its output, calculates the gradient with respect to its weights, and returns the gradient with respect to its input (i.e. the output of the previous layer):
function back(l::Layer, dy)
    dy = l.f(l,l.y,dy)
    l.dw = dy * l.x'
    l.db = sum!(l.db, dy)
    l.dx = l.w' * dy
    l.dx = l.fx(l,l.x,l.dx)
end
Note how the operations in back mirror the ones in forw. The arrays dy, l.dw, l.db and l.dx are the loss gradients with respect to l.y, l.w, l.b, and l.x. We saw l.f and l.fx as the activation and preprocessing functions in forw. Here, their three argument versions calculate gradients instead. That way we don't need to invent new names for the backward versions and run the risk of the user pairing the wrong forward and backward functions. Specifically l.f(l,l.y,dy) takes dy, the gradient wrt the layer output f(wx+b), and returns the gradient wrt the linear output wx+b. From that we compute the gradient with respect to w, b, and x. Finally the three argument version of l.fx calculates the gradient of the original input x (i.e. the output of the previous layer) from the gradient of the preprocessed x so the backpropagation can continue with the previous layer. By the end of this process, each layer has the gradients of its parameters stored in l.dw and l.db. I really think this is one case where the code is easier to understand than the math or the English explanation.

Training

The gradients calculated by backprop, l.dw and l.db, tell us how much small changes in corresponding entries in l.w and l.b will effect the loss (for the last instance, or minibatch). Small steps in the gradient direction will increase the loss, steps in the opposite direction will decrease the loss. This suggests the following update rule:
w = w - dw
This is the basic idea behind Stochastic Gradient Descent (SGD): Go over the training set instance by instance (or minibatch by minibatch). Run the backpropagation algorithm to calculate the loss gradients. Update the weights and biases in the opposite direction of these gradients. Rinse and repeat...
Over the years, people have noted many subtle problems with this approach and suggested improvements:
Step size: If the step sizes are too small, the SGD algorithm will take too long to converge. If they are too big it will overshoot the optimum and start to oscillate. So we scale the gradients with an adjustable parameter called the learning rate:
w = w - learningRate * dw
Step direction: More importantly, it turns out the gradient (or its opposite) is often NOT the direction you want to go in order to minimize loss. Let us illustrate with a simple picture:
The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.
Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is the loss for the last minibatch/instance only, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.
So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a standard textbook on numerical optimization, and peruse the latest papers on optimization in deep learning.
Minimize what? The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see Radford Neal's fbm), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.
KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer:
type UpdateParam learningRate; l1reg; l2reg; maxnorm; adagrad; momentum; nesterov; ... end
Even though it is possible to set these parameters for each w and each b separately, KUnet provides convenience functions to set them for the whole layer, or the whole network:
setparam!(p::UpdateParam,k,v)
setparam!(l::Layer,k,v)
setparam!(net::Net,k,v)
The update function takes a parameter w, its gradient dw, and its UpdateParams o, and performs the necessary update. Here is the definition from update.jl:
function update(w, dw, o::UpdateParam)
    initupdate(w, dw, o)
    nz(o,:l1reg) && l1reg!(o.l1reg, w, dw)
    nz(o,:l2reg) && l2reg!(o.l2reg, w, dw)
    nz(o,:adagrad) && adagrad!(o.adagrad, o.ada, dw)
    nz(o,:learningRate,1f0) && (dw = dw .* o.learningRate)
    nz(o,:momentum) && momentum!(o.momentum, o.mom, dw)
    nz(o,:nesterov) && nesterov!(o.nesterov, o.nes, dw)
    w = w .- dw
    nz(o,:maxnorm) && maxnorm!(o.maxnorm, w)
end

nz(o,n,v=0f0)=(isdefined(o,n) && (o.(n) != v))
The reader can peruse update.jl for the definitions of each of the helpers. This is one part of the code I haven't had time to "mathematize". I am hoping the helpers will be unnecessary one day and update will be expressed in easy to understand natural mathematical notation once Julia has in-place array operation syntax that is generic for GPUs.
The "train" function in net.jl is just a wrapper around "backprop" and "update". It goes through the training set once, splitting it into minibatches, and feeding each minibatch through backprop and update.

Activation functions

The activation function follows the linear transformation (wx+b) in a layer and is typically a non-linear element-wise function. Without an activation function, multiple layers in a neural network would be useless because the composition of several linear functions is still a linear function. As an example activation function l.f, here is the definition of the relu (rectified linear unit) from func.jl:
relu(l,y)=for i=1:length(y) (y[i]<0) && (y[i]=0) end
relu(l,y,dy)=for i=1:length(y) (y[i]==0) && (dy[i]=0) end
The two argument version handles the forward calculation, i.e. replacing negative y values with 0. The three argument version handles the backward calculation, i.e. replacing dy gradients with 0 where y was 0. Unlike some other high-level languages, for loops are very efficient in Julia. These can probably be made even faster using "@parallel for" and multiple threads, but people who need performance will want to use the GPU versions:
relu(l,y::CudaArray)=ccall((:reluforw,libkunet),
    Void,(Cint,Cmat),length(y),y)
relu(l,y::CudaArray,dy::CudaArray)=ccall((:reluback,libkunet),
    Void,(Cint,Cmat,Cmat),length(dy),y,dy)
Being able to write GPU kernels directly in Julia is still work in progress. So for now I write a couple of lines of CUDA (the reluforw and reluback functions, which are basically the C versions of relu), compile it to a shared library (libkunet.so), and use Julia's (very convenient) ccall function to directly call them. This does not quite satisfy my "generic readable code" requirement but it is the simplest solution I found at this stage of JuliaGPU development. The nice thing is Julia automatically calls the GPU version if the y argument is a CudaArray and falls back on the pure Julia version otherwise. This allows the user to control whether calculations will be performed on the CPU or the GPU by declaring l.w and l.b as regular arrays or CudaArrays.

Preprocessing functions

A preprocessing function precedes the linear transformation (wx+b) modifying the input of the layer. Preprocessing functions typically add some noise to the input to improve generalization. For example dropout can be implemented as a preprocessing function of a layer where each input is dropped (replaced by 0) with a given probability. Adding Gaussian noise, or elastic transformations of image inputs can also be implemented as preprocessing functions. Here is the (simplified) dropout implementation from func.jl:
function drop(l, x)
    rand!(l.xdrop)
    drop(x, l.xdrop, l.dropout, 1/(1-l.dropout))
end

function drop(l, x, dx)
    drop(dx, l.xdrop, l.dropout, 1/(1-l.dropout))
end

function drop(x, xdrop, dropout, scale)
    for i=1:length(x) 
        x[i] = (xdrop[i] < dropout ? 0 : scale * x[i]) 
    end
end

function drop(x::CudaArray, xdrop::CudaArray, dropout, scale)
    ccall((:drop,libkunet),Void,(Cint,Cmat,Cmat,Cfloat,Cfloat),
          length(x),x,xdrop,dropout,scale)
end
The two argument drop is for the forward calculation, the three argument version is for the backward calculation, and the last two are the CPU and GPU versions of the common helper. This implementation uses three fields of the layer type: l.fx is set to "drop", l.xdrop is an array of random numbers between 0 and 1 that has the same dimensionality as l.x, and l.dropout is the probability of dropping an input. Every time we receive a new x, rand! fills xdrop with new random numbers and a different subset of the input is dropped. The rest of the input is rescaled. Preprocessing is applied during training, not prediction, so the actual forw implementation takes an optional flag "apply_fx" that determines whether fx is to be applied.

Why Julia?

I wanted to write something that is concise, easy to understand, easy to extend, and reasonably efficient. There is a subtle trade-off between conciseness and extensibility: If we use a very high level language that already has a "neural_network_train" function, we can write very concise code but we lose the ability to change the training algorithm. If we use a very low level language that only provides primitive arithmetic operations, all the algorithm details are exposed and modifiable but the code is bulky and difficult to understand. For a happy medium, the code should reflect the level at which I think of the problem: e.g. I should be able to say A*B if I want to multiply two matrices, and I shouldn't constantly worry about the location and element types of my arrays. That restricts the playing field to a handful of languages. Efficiency requires being able to work with a GPU. Ideally the code should be generic, i.e. algorithms should be expressed once and work whether the data is on the GPU or the CPU memory. Julia has fledgling GPU support but it does have concise matrix arithmetic and it excels at generic operations. So I struggled a bit with trying to get generic matrix operations to work on the GPU (which is nowhere nearly complete but currently good enough to run the KUnet code), and to express each algorithm in as mathematical and concise a manner as possible (which is still work in progress, largely due to the lack of GPU generics and a standard syntax for in-place operations). But the progress so far (and the invaluable support I got from the Julia community) convinced me that this is a worthwhile endeavor.

Future work

With very little extra effort (and lines of code) one should be able to extend KUnet to implement convolutional and recurrent nets and new tricks like maxout units, maxnorm regularization, rmsprop optimization etc. You can send me suggestions for improvement (both in coding style and new functionality) using comments to this blog post, or using issues or pull requests on GitHub.

Full post...