Deniz Yuret's Homepage: 2016

December 19, 2016

Julia ve Knet ile Derin Öğrenmeye Giriş

Sunumlar:

Data İstanbul, 21 Aralık 2016 Çarşamba, 19:30 (URL, Sunum,Video).
ODTÜ Yapay Öğrenme ve Bilgi İşlemede Yeni Teknikler Yaz Okulu (OBAYO), 6-9 Eylül, 2016, ODTÜ, Ankara. (URL, Sunum,Video)
İsmail Arı Bilgisayar Bilimleri ve Mühendisliği Bilimsel Eğitim Etkinliği (ISMAIL 2017), 31 Temmuz - 4 Ağustos 2017 Boğaziçi Üniversitesi, İstanbul. (URL).

December 14, 2016

CharNER: Character-Level Named Entity Recognition

Onur Kuru, Ozan Arkan Can and Deniz Yuret. 2016. COLING. Osaka. (PDF,Presentation)

Abstract

We describe and evaluate a character-level tagger for language-independent Named Entity Recognition (NER). Instead of words, a sentence is represented as a sequence of characters. The model consists of stacked bidirectional LSTMs which inputs characters and outputs tag probabilities for each character. These probabilities are then converted to consistent word level named entity tags using a Viterbi decoder. We are able to achieve close to state-of-the-art NER performance in seven languages with the same basic model using only labeled NER data and no hand-engineered features or other external resources like syntactic taggers or Gazetteers.

December 13, 2016

Learning grammatical categories using paradigmatic representations: Substitute words for language acquisition

Mehmet Ali Yatbaz, Volkan Cirik, Aylin Küntay and Deniz Yuret. 2016. COLING. Osaka. (PDF,Poster)

Abstract

Learning word categories is a fundamental task in language acquisition. Previous studies show that co-occurrence patterns of preceding and following words are essential to group words into categories. However, the neighboring words, or frames, are rarely repeated exactly in the data. This creates data sparsity and hampers learning for frame based models. In this work, we propose a paradigmatic representation of word context which uses probable substitutes instead of frames. Our experiments on child-directed speech show that models based on probable substitutes learn more accurate categories with fewer examples compared to models based on frames.

December 10, 2016

Knet: beginning deep learning with 100 lines of Julia (NIPS workshop)

Deniz Yuret. 2016. Machine Learning Systems Workshop at NIPS 2016. Barcelona. (PDF,Slide,Poster)

Abstract

Knet (pronounced "kay-net") is the Koç University machine learning framework implemented in Julia, a high-level, high-performance, dynamic programming language. Unlike gradient generating compilers like Theano and TensorFlow which restrict users into a modeling mini-language, Knet allows models to be defined by just describing their forward computation in plain Julia, allowing the use of loops, conditionals, recursion, closures, tuples, dictionaries, array indexing, concatenation and other high level language features. High performance is achieved by combining automatic differentiation of most of Julia with efficient GPU kernels and memory management. Several examples and benchmarks are provided to demonstrate that GPU support and automatic differentiation of a high level language are sufficient for concise definition and efficient training of sophisticated models.

November 01, 2016

Transfer Learning for Low-Resource Neural Machine Translation

Zoph, Barret and Yuret, Deniz and May, Jonathan and Knight, Kevin. 2016. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing pp 1568--1575, Austin, Texas. (PDF)

Abstract

The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios, but is much less effective for low-resource languages. We present a transfer learning method that significantly improves BLEU scores across a range of low-resource languages. Our key idea is to first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. Using our transfer learning method we improve baseline NMT models by an average of 5.6 BLEU on four low-resource language pairs. Ensembling and unknown word replacement add another 2 BLEU which brings the NMT performance on low-resource machine translation close to a strong syntax based machine translation (SBMT) system, exceeding its performance on one language pair. Additionally, using the transfer learning model for re-scoring, we can improve the SBMT system by an average of 1.3 BLEU, improving the state-of-the-art on low-resource machine translation.

Why Neural Translations are the Right Length

Shi, Xing and Knight, Kevin and Yuret, Deniz. 2016. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing pp 2278--2282, Austin, Texas. (PDF)

Abstract

We investigate how neural, encoder-decoder translation systems output target strings of appropriate lengths, finding that a collection of hidden units learns to explicitly implement this functionality.

September 20, 2016

Introducing Knet8: beginning deep learning with 100 lines of Julia

It has been a year and a half since I wrote the first version of this tutorial and it is time for an update.

Knet (pronounced “kay-net”) is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports construction of high-performance deep learning models in plain Julia by combining automatic differentiation with efficient GPU kernels and memory management. Models can be defined and trained using arbitrary Julia code with helper functions, loops, conditionals, recursion, closures, array indexing and concatenation. The training can be performed on the GPU by simply using KnetArray instead of Array for parameters and data. Check out the full documentation and the examples directory for more information.

Installation
Examples
Under the hood
Benchmarks
Contributing

Installation

You can install Knet using Pkg.add("Knet"). Some of the examples use additional packages such as ArgParse, GZip, and JLD. These are not required by Knet and can be installed when needed using additional Pkg.add() commands. See the detailed installation instructions as well as the section on using Amazon AWS to experiment with GPU machines on the cloud with pre-installed Knet images.

Examples

In Knet, a machine learning model is defined using plain Julia code. A typical model consists of a prediction and a loss function. The prediction function takes model parameters and some input, returns the prediction of the model for that input. The loss function measures how bad the prediction is with respect to some desired output. We train a model by adjusting its parameters to reduce the loss. In this section we will see the prediction, loss, and training functions for five models: linear regression, softmax classification, fully-connected, convolutional and recurrent neural networks.

Linear regression

Here is the prediction function and the corresponding quadratic loss function for a simple linear regression model:

predict(w,x) = w[1]*x .+ w[2]

loss(w,x,y) = sumabs2(y - predict(w,x)) / size(y,2)

The variable w is a list of parameters (it could be a Tuple, Array, or Dict), x is the input and y is the desired output. To train this model, we want to adjust its parameters to reduce the loss on given training examples. The direction in the parameter space in which the loss reduction is maximum is given by the negative gradient of the loss. Knet uses the higher-order function grad from AutoGrad.jl to compute the gradient direction:

using Knet

lossgradient = grad(loss)

Note that grad is a higher-order function that takes and returns other functions. The lossgradient function takes the same arguments as loss, e.g. dw = lossgradient(w,x,y). Instead of returning a loss value, lossgradient returns dw, the gradient of the loss with respect to its first argument w. The type and size of dw is identical to w, each entry in dw gives the derivative of the loss with respect to the corresponding entry in w. See @doc grad for more information.

Given some training data = [(x1,y1),(x2,y2),...], here is how we can train this model:

function train(w, data; lr=.1)
    for (x,y) in data
        dw = lossgradient(w, x, y)
        for i in 1:length(w)
            w[i] -= lr * dw[i]
        end
    end
    return w
end

We simply iterate over the input-output pairs in data, calculate the lossgradient for each example, and move the parameters in the negative gradient direction with a step size determined by the learning rate lr.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/housing.jpeg?raw=true

Let’s train this model on the Housing dataset from the UCI Machine Learning Repository.

julia> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
julia> rawdata = readdlm(download(url))
julia> x = rawdata[:,1:13]'
julia> x = (x .- mean(x,2)) ./ std(x,2)
julia> y = rawdata[:,14:14]'
julia> w = Any[ 0.1*randn(1,13), 0 ]
julia> for i=1:10; train(w, [(x,y)]); println(loss(w,x,y)); end
366.0463078055053
...
29.63709385230451

The dataset has housing related information for 506 neighborhoods in Boston from 1978. Each neighborhood is represented using 13 attributes such as crime rate or distance to employment centers. The goal is to predict the median value of the houses given in $1000’s. After downloading, splitting and normalizing the data, we initialize the parameters randomly and take 10 steps in the negative gradient direction. We can see the loss dropping from 366.0 to 29.6. See housing.jl for more information on this example.

Note that grad was the only function used that is not in the Julia standard library. This is typical of models defined in Knet.

Softmax classification

In this example we build a simple classification model for the MNIST handwritten digit recognition dataset. MNIST has 60000 training and 10000 test examples. Each input x consists of 784 pixels representing a 28x28 image. The corresponding output indicates the identity of the digit 0..9.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/firsteightimages.jpg?raw=true

Classification models handle discrete outputs, as opposed to regression models which handle numeric outputs. We typically use the cross entropy loss function in classification models:

function loss(w,x,ygold)
    ypred = predict(w,x)
    ynorm = ypred .- log(sum(exp(ypred),1))
    -sum(ygold .* ynorm) / size(ygold,2)
end

Other than the change of loss function, the softmax model is identical to the linear regression model. We use the same predict, same train and set lossgradient=grad(loss) as before. To see how well our model classifies let’s define an accuracy function which returns the percentage of instances classified correctly:

function accuracy(w, data)
    ncorrect = ninstance = 0
    for (x, ygold) in data
        ypred = predict(w,x)
        ncorrect += sum(ygold .* (ypred .== maximum(ypred,1)))
        ninstance += size(ygold,2)
    end
    return ncorrect/ninstance
end

Now let’s train a model on the MNIST data:

julia> include(Pkg.dir("Knet/examples/mnist.jl"))
julia> using MNIST: xtrn, ytrn, xtst, ytst, minibatch
julia> dtrn = minibatch(xtrn, ytrn, 100)
julia> dtst = minibatch(xtst, ytst, 100)
julia> w = Any[ -0.1+0.2*rand(Float32,10,784), zeros(Float32,10,1) ]
julia> println((:epoch, 0, :trn, accuracy(w,dtrn), :tst, accuracy(w,dtst)))
julia> for epoch=1:10
           train(w, dtrn; lr=0.5)
           println((:epoch, epoch, :trn, accuracy(w,dtrn), :tst, accuracy(w,dtst)))
       end
(:epoch,0,:trn,0.11761667f0,:tst,0.121f0)
(:epoch,1,:trn,0.9005f0,:tst,0.9048f0)
...
(:epoch,10,:trn,0.9196f0,:tst,0.9153f0)

Including mnist.jl loads the MNIST data, downloading it from the internet if necessary, and provides a training set (xtrn,ytrn), test set (xtst,ytst) and a minibatch utility which we use to rearrange the data into chunks of 100 instances. After randomly initializing the parameters we train for 10 epochs, printing out training and test set accuracy at every epoch. The final accuracy of about 92% is close to the limit of what we can achieve with this type of model. To improve further we must look beyond linear models.

Multi-layer perceptron

A multi-layer perceptron, i.e. a fully connected feed-forward neural network, is basically a bunch of linear regression models stuck together with non-linearities in between.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/neural_net2.jpeg?raw=true

We can define a MLP by slightly modifying the predict function:

function predict(w,x)
    for i=1:2:length(w)-2
        x = max(0, w[i]*x .+ w[i+1])
    end
    return w[end-1]*x .+ w[end]
end

Here w[2k-1] is the weight matrix and w[2k] is the bias vector for the k’th layer. max(0,a) implements the popular rectifier non-linearity. Note that if w only has two entries, this is equivalent to the linear and softmax models. By adding more entries to w, we can define multi-layer perceptrons of arbitrary depth. Let’s define one with a single hidden layer of 64 units:

w = Any[ -0.1+0.2*rand(Float32,64,784), zeros(Float32,64,1),
         -0.1+0.2*rand(Float32,10,64),  zeros(Float32,10,1) ]

The rest of the code is the same as the softmax model. We use the same cross-entropy loss function and the same training script. The code for this example is available in mnist.jl. The multi-layer perceptron does significantly better than the softmax model:

(:epoch,0,:trn,0.10166667f0,:tst,0.0977f0)
(:epoch,1,:trn,0.9389167f0,:tst,0.9407f0)
...
(:epoch,10,:trn,0.9866f0,:tst,0.9735f0)

Convolutional neural network

To improve the performance further, we can use convolutional neural networks. We will implement the LeNet model which consists of two convolutional layers followed by two fully connected layers.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/le_net.png?raw=true

Knet provides the conv4(w,x) and pool(x) functions for the implementation of convolutional nets (see @doc conv4 and @doc pool for more information):

function predict(w,x0)
    x1 = pool(max(0, conv4(w[1],x0) .+ w[2]))
    x2 = pool(max(0, conv4(w[3],x1) .+ w[4]))
    x3 = max(0, w[5]*mat(x2) .+ w[6])
    return w[7]*x3 .+ w[8]
end

The weights for the convolutional net can be initialized as follows:

w = Any[ -0.1+0.2*rand(Float32,5,5,1,20),  zeros(Float32,1,1,20,1),
         -0.1+0.2*rand(Float32,5,5,20,50), zeros(Float32,1,1,50,1),
         -0.1+0.2*rand(Float32,500,800),   zeros(Float32,500,1),
         -0.1+0.2*rand(Float32,10,500),    zeros(Float32,10,1) ]

Currently convolution and pooling are only supported on the GPU for 4-D and 5-D arrays. So we reshape our data and transfer it to the GPU along with the parameters by converting them into KnetArrays (see @doc KnetArray for more information):

dtrn = map(d->(KnetArray(reshape(d[1],(28,28,1,100))), KnetArray(d[2])), dtrn)
dtst = map(d->(KnetArray(reshape(d[1],(28,28,1,100))), KnetArray(d[2])), dtst)
w = map(KnetArray, w)

The training proceeds as before giving us even better results. The code for the LeNet example can be found in lenet.jl.

(:epoch,0,:trn,0.12215f0,:tst,0.1263f0)
(:epoch,1,:trn,0.96963334f0,:tst,0.971f0)
...
(:epoch,10,:trn,0.99553335f0,:tst,0.9879f0)

Recurrent neural network

In this section we will see how to implement a recurrent neural network (RNN) in Knet. An RNN is a class of neural network where connections between units form a directed cycle, which allows them to keep a persistent state over time. This gives them the ability to process sequences of arbitrary length one element at a time, while keeping track of what happened at previous elements.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/RNN-unrolled.png?raw=true

As an example, we will build a character-level language model inspired by “The Unreasonable Effectiveness of Recurrent Neural Networks” from the Andrej Karpathy blog. The model can be trained with different genres of text, and can be used to generate original text in the same style.

It turns out simple RNNs are not very good at remembering things for a very long time. Currently the most popular solution is to use a more complicated unit like the Long Short Term Memory (LSTM). An LSTM controls the information flow into and out of the unit using gates similar to digital circuits and can model long term dependencies. See Understanding LSTM Networks by Christopher Olah for a good overview of LSTMs.

https://github.com/denizyuret/Knet.jl/blob/master/docs/images/LSTM3-chain.png?raw=true

The code below shows one way to define an LSTM in Knet. The first two arguments are the parameters, the weight matrix and the bias vector. The next two arguments hold the internal state of the LSTM: the hidden and cell arrays. The last argument is the input. Note that for performance reasons we lump all the parameters of the LSTM into one matrix-vector pair instead of using separate parameters for each gate. This way we can perform a single matrix multiplication, and recover the gates using array indexing. We represent input, hidden and cell as row vectors rather than column vectors for more efficient concatenation and indexing. sigm and tanh are the sigmoid and the hyperbolic tangent activation functions. The LSTM returns the updated state variables hidden and cell.

function lstm(weight,bias,hidden,cell,input)
    gates   = hcat(input,hidden) * weight .+ bias
    hsize   = size(hidden,2)
    forget  = sigm(gates[:,1:hsize])
    ingate  = sigm(gates[:,1+hsize:2hsize])
    outgate = sigm(gates[:,1+2hsize:3hsize])
    change  = tanh(gates[:,1+3hsize:end])
    cell    = cell .* forget + ingate .* change
    hidden  = outgate .* tanh(cell)
    return (hidden,cell)
end

The LSTM has an input gate, forget gate and an output gate that control information flow. Each gate depends on the current input value, and the last hidden state hidden. The memory value cell is computed by blending a new value change with the old cell value under the control of input and forget gates. The output gate decides how much of the cell is shared with the outside world.

If an input gate element is close to 0, the corresponding element in the new input will have little effect on the memory cell. If a forget gate element is close to 1, the contents of the corresponding memory cell can be preserved for a long time. Thus the LSTM has the ability to pay attention to the current input, or reminisce in the past, and it can learn when to do which based on the problem.

To build a language model, we need to predict the next character in a piece of text given the current character and recent history as encoded in the internal state. The predict function below implements a multi-layer LSTM model. s[2k-1:2k] hold the hidden and cell arrays and w[2k-1:2k] hold the weight and bias parameters for the k’th LSTM layer. The last three elements of w are the embedding matrix and the weight/bias for the final prediction. predict takes the current character encoded in x as a one-hot row vector, multiplies it with the embedding matrix, passes it through a number of LSTM layers, and converts the output of the final layer to the same number of dimensions as the input using a linear transformation. The state variable s is modified in-place.

function predict(w, s, x)
    x = x * w[end-2]
    for i = 1:2:length(s)
        (s[i],s[i+1]) = lstm(w[i],w[i+1],s[i],s[i+1],x)
        x = s[i]
    end
    return x * w[end-1] .+ w[end]
end

To train the language model we will use Backpropagation Through Time (BPTT) which basically means running the network on a given sequence and updating the parameters based on the total loss. Here is a function that calculates the total cross-entropy loss for a given (sub)sequence:

function loss(param,state,sequence,range=1:length(sequence)-1)
    total = 0.0; count = 0
    atype = typeof(getval(param[1]))
    input = convert(atype,sequence[first(range)])
    for t in range
        ypred = predict(param,state,input)
        ynorm = logp(ypred,2) # ypred .- log(sum(exp(ypred),2))
        ygold = convert(atype,sequence[t+1])
        total += sum(ygold .* ynorm)
        count += size(ygold,1)
        input = ygold
    end
    return -total / count
end

Here param and state hold the parameters and the state of the model, sequence and range give us the input sequence and a possible range over it to process. We convert the entries in the sequence to inputs that have the same type as the parameters one at a time (to conserve GPU memory). We use each token in the given range as an input to predict the next token. The average cross-entropy loss per token is returned.

To generate text we sample each character randomly using the probabilities predicted by the model based on the previous character:

function generate(param, state, vocab, nchar)
    index_to_char = Array(Char, length(vocab))
    for (k,v) in vocab; index_to_char[v] = k; end
    input = oftype(param[1], zeros(1,length(vocab)))
    index = 1
    for t in 1:nchar
        ypred = predict(param,state,input)
        input[index] = 0
        index = sample(exp(logp(ypred)))
        print(index_to_char[index])
        input[index] = 1
    end
    println()
end

Here param and state hold the parameters and state variables as usual. vocab is a Char->Int dictionary of the characters that can be produced by the model, and nchar gives the number of characters to generate. We initialize the input as a zero vector and use predict to predict subsequent characters. sample picks a random index based on the normalized probabilities output by the model.

At this point we can train the network on any given piece of text (or other discrete sequence). For efficiency it is best to minibatch the training data and run BPTT on small subsequences. See charlm.jl for details. Here is a sample run on ‘The Complete Works of William Shakespeare’:

$ cd .julia/Knet/examples
$ wget http://www.gutenberg.org/files/100/100.txt
$ julia charlm.jl --data 100.txt --epochs 10 --winit 0.3 --save shakespeare.jld
... takes about 10 minutes on a GPU machine
$ julia charlm.jl --load shakespeare.jld --generate 1000

    Pand soping them, my lord, if such a foolish?
  MARTER. My lord, and nothing in England's ground to new comp'd.
    To bless your view of wot their dullst. If Doth no ape;
    Which with the heart. Rome father stuff
    These shall sweet Mary against a sudden him
    Upon up th' night is a wits not that honour,
    Shouts have sure?
  MACBETH. Hark? And, Halcance doth never memory I be thou what
    My enties mights in Tim thou?
  PIESTO. Which it time's purpose mine hortful and
    is my Lord.
  BOTTOM. My lord, good mine eyest, then: I will not set up.
  LUCILIUS. Who shall

Under the hood

Coming soon...

Benchmarks

Coming soon...

Contributing

Knet is an open-source project and we are always open to new contributions: bug reports and fixes, feature requests and contributions, new machine learning models and operators, inspiring examples, benchmarking results are all welcome. If you need help or would like to request a feature, please consider joining the knet-users mailing list. If you find a bug, please open a GitHub issue. If you would like to contribute to Knet development, check out the knet-dev mailing list and tips for developers. If you use Knet in your own work, the suggested citation is:

@misc{knet,
  author={Yuret, Deniz},
  title={Knet: Ko\c{c} University deep learning framework.},
  year={2016},
  howpublished={\url{https://github.com/denizyuret/Knet.jl}}
}

August 31, 2016

Onur Kuru, M.S. 2016

Current position: Senior Data Scientist at Delivery Hero, Berlin. (Linkedin)
M.S. Thesis: Character-level Tagging. Koç University, Department of Computer Engineering. August, 2016. (PDF, Presentation, Code)

Abstract:

I describe and evaluate a language-independent character-level tagger for sequence labeling problems: Named Entity Recognition (NER), Part-of-Speech (POS) tagging and Chunking. Instead of words, a sentence is represented as a sequence of characters. The model consists of stacked bidirectional LSTMs which input characters and output tag probabilities for each character. These probabilities are then converted to consistent word level phrase tags using a Viterbi decoder. The model uses only labeled data and does not rely on hand-engineered features or other external resources like syntactic taggers or Gazetteers. The model is able to achieve close to state-of-the-art NER performance in seven languages, performs as well as or better than previous work in four languages for POS tagging and yields competitive results for English Chunking dataset.

August 23, 2016

AutoGrad.jl

AutoGrad.jl is an automatic differentiation package for Julia. It is a Julia port of the popular Python autograd package. It can differentiate regular Julia code that includes loops, conditionals, helper functions, closures etc. by keeping track of the primitive operations and using this execution trace to compute gradients. It uses reverse mode differentiation (a.k.a. backpropagation) so it can efficiently handle functions with array inputs and scalar outputs. It can compute gradients of gradients to handle higher order derivatives. Please see the comments in core.jl for a description of how the code works in detail.

Installation

You can install AutoGrad in Julia using:

julia> Pkg.add("AutoGrad")

In order to use it in your code start with:

using AutoGrad

Example

Here is a linear regression example simplified from housing.jl:

using AutoGrad

function loss(w)
    global xtrn,ytrn
    ypred = w[1]*xtrn .+ w[2]
    sum(abs2(ypred - ytrn)) / size(ypred,2)
end

function train(w; lr=.1, epochs=20)
    gradfun = grad(loss)
    for epoch=1:epochs
        g = gradfun(w)
        for i in 1:length(w)
            w[i] -= lr * g[i]
        end
    end
    return w
end

The loss function takes parameters as input and returns the loss to be minimized. The parameter w for this example is a pair: w[1] is a weight matrix, and w[2] is a bias vector. The training data xtrn,ytrn are in global variables. ypred is the predicted output, and the last line computes the quadratic loss. The loss function is implemented in regular Julia.

The train function takes initial parameters and returns optimized parameters. grad is the only AutoGrad function used: it creates a function gradfun that takes the same arguments as loss, but returns the gradient instead. The returned gradient will have the same type and shape as the input argument. The for loop implements gradient descent, where we calculate the gradient and subtract a scaled version of it from the weights.

See the examples directory for more examples, and the extensively documented core.jl for details.

Extending AutoGrad

AutoGrad can only handle a function if the primitives it uses have known gradients. You can add your own primitives with gradients as described in detail in core.jl or using the @primitive and @zerograd macros in util.jl Here is an example:

@primitive hypot(x1::Number,x2::Number)::y  (dy->dy*x1/y)  (dy->dy*x2/y)

The @primitive macro marks the hypot(::Number,::Number) method as a new primitive and the next two expressions define gradient functions wrt the first and second argument. The gradient expressions can refer to the parameters and the return variable (indicated after the final ::) of the method declaration.

Note that Julia supports multiple-dispatch, i.e. a function may have multiple methods each supporting different argument types. For example hypot(x1::Array,x2::Array) is another hypot method. In AutoGrad.jl each method can independently be defined as a primitive and can have its own specific gradient.

Code structure

core.jl implements the main functionality and acts as the main documentation source. util.jl has some support functions to define and test new primitives. interfaces.jl sets up support for common data structures including Arrays, Tuples, and Dictionaries. The numerical gradients are defined in files such as base/math.jl, special/trig.jl that mirror the organization under julia/base.

Current status and future work

The gradient coverage is spotty, I am still adding more gradients to cover the Julia base. Next steps are to make models faster by providing support for GPU operations and overwriting functions (to avoid memory allocation). I should also find out about the efficiency of closures and untyped functions in Julia which are used extensively in the code.

Acknowledgments and references

AutoGrad.jl was written by Deniz Yuret. Large parts of the code are directly ported from the Python autograd package. I'd like to thank autograd author Dougal Maclaurin for his support. See (Baydin et al. 2015) for a general review of automatic differentiation, autograd tutorial for some Python examples, and Dougal's PhD thesis for design principles. JuliaDiff has alternative differentiation tools for Julia. I would like to thank my students Ozan Arkan Can and Emre Yolcu for helpful contributions.

Also see: A presentation, A demo.

June 14, 2016

Natural language communication with robots

Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016) pp 751--761, San Diego, California. (PDF, Slides)

Abstract

We propose a framework for devising empirically testable algorithms for bridging the communication gap between humans and robots. We instantiate our framework in the context of a problem setting in which humans give instructions to robots using unrestricted natural language commands, with instruction sequences being subservient to building complex goal configurations in a blocks world. We show how one can collect meaningful training data and we propose three neural architectures for interpreting contextually grounded natural language commands. The proposed architectures allow us to correctly understand/ground the blocks that the robot should move when instructed by a human who uses unrestricted language. The architectures have more diffi- culty in correctly understanding/grounding the spatial relations required to place blocks correctly, especially when the blocks are not easily identifiable.

June 06, 2016

Saman Zia, M.S. 2016

Current position: Senior Software Engineer at Microsoft, Seattle (Email, Linkedin)
M.S. Thesis: RGB-D Object Recognition using Deep Convolutional Neural Networks. Koç University, Department of Computer Engineering. June, 2016. (PDF, Presentation, Code)

Abstract:

Recent availability of low cost RGB-D sensors has led to an increased interest in object recognition combining both color and depth modalities. Object recognition from RGB-D images is particularly important in robotic tasks and the inclusion of depth has been proven to increase the performance. The problem of combining depth and color information is being widely researched. This thesis addresses this problem by initializing a 2-D Convolutional Neural Network (CNN) for RGB information via transfer learning and 3-D Convolutional Neural Network for encoding depth infor- mation. The obtained feature representations are fused to report performance over the RGB-D object recognition task. The transferred weights are from CNNs that are trained on large ImageNet classification challenge dataset and produces meaningful features. The depth information is encoded along with the color information in a 3-D voxel and learns joint features from scratch using a 3-D CNN. The approach is evaluated on the Washington RGB-D dataset and the performance for RGB category recognition exceeds the state-of-the-art, while the RGB-D performance is on par with it for category recognition. Due to good features learnt by the 3-D CNN, the po- tential of transfer learning from 2-D pre-trained CNN to 3-D CNN to include depth information is also addressed.

March 19, 2016

How to write a technical paper

This is the evolving set of recommendations I share with my graduate students for technical writing...

Empathy: This is the single most important principle of technical writing. Try reading what you write from the perspective of somebody who has not spent the last few months working on your problem. Better yet, find such a person and see if they understand everything you are talking about. Don’t just take their word for it, ask them to tell you what they understand in their own words. See where they struggle and debug your paper: Do they get lost in too much detail and miss the main point? Do they get disoriented because you jump around too much? Are there terms they do not understand? Fix the paper using the following techniques until a dedicated freshman can understand all the important points.
Winston’s Onion Rule: The document should state the most important points first, and expand on them gradually. It is a mistake to keep any important points until the end of the paper. Only details and supporting material should be left to the end. If I stop reading the document at any point, everything I haven't read so far should be less important than everything I have read up to this point:

The title should be descriptive of the main point.
The first sentence should state the main point.
The first paragraph should expand on the first sentence.
The first section should expand on the first paragraph.
The first chapter should expand on the first section.
The whole paper/thesis should expand on the first chapter, etc.

Yuret’s Fractal Rule: Parts at every level of your document, down to each paragraph, should have their own introduction / conclusion to keep the reader oriented (i.e. stop them from asking “What is this person talking about now, and why?”):

The first chapter of a paper/thesis should state the topic of the paper/thesis and the last chapter should summarize its point.
The first section of a chapter should state the topic of the chapter and the last section should summarize its point.
The first paragraph of a section should state the topic of the section and the last paragraph should summarize its point.
The first sentence of a paragraph should state the topic of the paragraph and the last sentence should summarize its point.

No undefined terms: Any technical term your nine year old niece would not understand should be defined before first use. Any acronym should first be given in parentheses next to its long form before first use. All variables in equations, all axes in graphs should be explained at the first opportunity. Tables and Figures should have descriptive captions that can be understood stand-alone. Technical terms and mathematical notation should be used consistently, no confusing variations allowed (i.e. calling the same thing context vector somewhere and word context vector elsewhere will confuse the reader into thinking these are two separate things).
Replicability: Science is based on replicable results. Your paper should provide enough detail (possibly in the appendices), and links to its code and data, to replicate each of its results. In particular, for each set of experiments you should have:

Data table: e.g. in a natural language processing experiment, things like number of words and sentences in train, dev, test; vocabulary size, tagset size, tag frequencies, out-of-vocabulary rate, average sentence length, i.e. any data statistic relevant to the task should go to a data table.
Parameter table: things like the model structure, the training algorithm used, the hyperparameters used, number of training epochs, and any other details related to experimental replication should go to a table.
Result table: the results (table or plot) should clearly indicate the evaluation metric, sensible lower bound baselines, upper bounds (e.g. inter-annotator agreement) if available, and current state of the art in published work to put your results in perspective.

February 01, 2016

Learning Navigational Language from Linguistic and Visual Cues (2016-2018)

TUBITAK 1001 Project 114E628. "Dilbilimsel ve Görsel İpuçlarını Birlikte Kullanarak Gezinim Dilinin Öğrenilmesi." (2016-02-01 -- 2018-08-01)

December 19, 2016

December 14, 2016

December 13, 2016

December 10, 2016

November 01, 2016

September 20, 2016

Contents

Installation

Examples

Linear regression

Softmax classification

Multi-layer perceptron

Convolutional neural network

Recurrent neural network

Under the hood

Benchmarks

Contributing

August 31, 2016

August 23, 2016

June 14, 2016

June 06, 2016

March 19, 2016

February 01, 2016