## December 01, 2015

### ReGROUND: Relational symbol grounding through affordance learning (2015-2018)

Full post...

## November 19, 2015

### Osman Baskaya, M.S. 2015

**Current position:**Senior Data Scientist, Twilio Inc., San Francisco (Linkedin)

**M.S. Thesis:**Analysis of Context Embeddings in Word Sense Induction. Koç University, Department of Computer Engineering. November, 2015. (PDF, Presentation, Code)

**Publications:**bibtex.php

**Abstract**

There exist several drawbacks of representing the word senses with a fixed list of definitions of a manually constructed lexical database. There is no guarantee that they reflect the exact meaning of a target word in a given context, since they usually contain definitions that are too general. More so, lexical databases often include many rare senses while missing corpus/domain-specific senses. Word Sense Induction (WSI) focuses on discriminating the usages of a polysemous word without using a fixed list of definitions or any hand-crafted resources.

In contrast to the most common approach in WSI, which is to apply clustering or graph partitioning on a representation of first- or second-order co-occurrences of a word, my method obtains a probability distribution for each context suggested by a statistical model. This distribution helps to create context embeddings using the co-occurrence framework that represents the context with low-dimensional, dense vectors in Euclidean space. Then, these context embeddings are clustered by k-means clustering algorithm to discriminate usages (senses) of a word. This method proved its usefulness in Unsupervised Part-of-Speech Induction, and supervised tasks such as Multilingual Dependency Parsing. I examine this method on SemEval 2010 and SemEval 2013 Word Sense Induction lexical sample tasks, and the dataset I created using OntoNotes 5.0. This new lexical sample dataset has high inter-annotator agreement (IAA) (>90%) and number of instances for each word type is more than any previous lexical sample tasks (>500 instances).

The contributions in this thesis are as follows: (1) I suggest a method to attack the Word Sense Induction problem. (2) I provide a comprehensive analysis (a) in embedding step by comparing other popular word embeddings by transforming each of them to context embeddings using substitute word distributions for each context, and (b) in clustering step by comparing different clustering algorithms (kmeans, Spectral Clustering, DBSCAN) and different clustering approaches (local approach where instances of each word type clustered separately, and part-of-speech based approach where instances tagged with same-part-of-speech clusters independently).

The code to replicate the results in this thesis can be found at https://github.com/osmanbaskaya/wsid.

Full post...

## July 10, 2015

## June 16, 2015

## May 27, 2015

### What is wrong with p-values?

Earlier this year, editors of the journal Basic and Applied Social Psychology announced that the journal would no longer publish papers containing p-values. The latest American Psychological Association Publication Manual states that researchers should "wherever possible, base discussion and interpretation of results on point and interval estimates," i.e. not p-values. FDA has been encouraging Bayesian analysis. What is wrong with p-values?

**What is a p-value?** In the classical statistical procedure known as "significance testing", we have a default hypothesis, usually called the null hypothesis and denoted by H0, and we wish to determine whether or not to reject H0 based on some observations X. We choose a statistic S=f(X) (a scalar function of X) that summarizes our data. The p-value is the probability of observing a value at least as extreme as S under H0. We reject H0 if the p-value is below some specified small threshold like α=0.05 and we say something like "H0 is rejected at 0.05 significance level." This threshold or significance level (α) upper bounds the probability of false rejection, i.e. rejecting H0 when it is correct.

**Example:** We toss a coin 1000 times and observe 532 heads, 468 tails. Is this a fair coin? In this example the null hypothesis H0 is that the coin is fair, observation X is the sequence of heads and tails, and the statistic S is the number of heads. The p-value, or probability of S ∉ [469,531] under H0, can be calculated as:
\[ 1 - \sum_{k=469}^{531} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.04629 \]
We can reject the null hypothesis at 0.05 significance level and decide the coin is biased. But should we?

**Objection 1:** (MacKay 2003, pp.63) What we would actually like to know is the probability of H0 given that we observed 532 heads. Unfortunately the p-value 0.04629 is not that probability (although this is a common confusion). We can't calculate a probability for H0 unless we specify some alternatives. Come to think of it, how can we reject a hypothesis if we don't look at what the alternatives are? What if the alternatives are worse? So let's specify a "biased coin" alternative (H1) which assumes that the head probability of the coin (θ) is distributed uniformly between 0 and 1 (other ways of specifying H1 are possible and do not effect the conclusion). We have:
\[ P(S=532 \mid H_0) = {1000 \choose 532} \left(\frac{1}{2}\right)^{1000} = 0.003256 \]
\[ P(S=532 \mid H_1) = \int_0^1 {1000 \choose 532} \theta^{532} (1-\theta)^{468} d\theta = 0.001 \]
So H0 makes our data 3.2 times more likely than H1! And here the p-value almost made us think the data was 1:20 in favor of the "biased" hypothesis.

**Objection 2:** (Berger 1982, pp.13) Well, now that we understand p-value is not the probability of H0, does it tell us anything useful? According to the definition it limits the false rejection rate, i.e. if we always use significance tests with a p-value threshold α=0.01, we can be assured of incorrectly rejecting only 1% of correct hypotheses in the long run. So does that mean when I reject a null hypothesis I am only mistaken 1% of the time? Of course not! P(reject|correct) is 1%, P(correct|reject) can be anything! Here is an example:

X=1 | X=2 | |

H0 | .01 | .99 |

H1 | .01001 | .98999 |

The table gives the probabilities the two hypotheses H0 and H1 assign to different outcomes X=1 or X=2. Say we observe X=1. We reject H0 at α=0.01 significance level. But there is very little evidence against H0, the likelihood ratio P(X|H1)/P(X|H0) is very close to 1, so the chance of being in error is about 1/2 (assuming H0 and H1 are a-priori equally likely). Thus α=0.01 is providing a very misleading and false sense of security when rejection actually occurs.

**Objection 3:** (Murphy 2013, pp.213) Consider two experiments. In the first one we toss a coin 1000 times and observe 474 tails. Using T=474 as our statistic the one sided p-value is P(T≤474|H0):
\[ \sum_{k=0}^{474} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.05337 \]
So at a significance level of α=0.05 we do not reject the null hypothesis of an unbiased coin.

In the second experiment we toss the coin until we observe 474 tails, and it happens to take us 1000 trials. Different intention, same data. This time N=1000 is the natural test statistic and the one sided p-value is P(N≥1000|H0): \[ \sum_{n=1000}^\infty {n-1 \choose 473} \left(\frac{1}{2}\right)^n = 0.04994 \] Suddenly we are below the magical α=0.05 threshold and we can reject the null hypothesis. The observed data, thus the likelihoods of any hypotheses for this data have not changed. The p-value is based not just on what actually happened, but what could have happened. This is clearly absurd.

**Objection 4:** (Cumming 2012) If we base the fate of our hypotheses on p-values computed from experiments, at the very least we should expect the p-values (thus our critical decisions) to change very little when we replicate the experiments. Unfortunately p-values do not even give us stability, as this wonderful video "Dance of the p values" by Geoff Cumming illustrates:

**Conclusion:** (Jaynes 2003, pp.524) expressed the absurdity of significance testing best:

In order to argue for an hypothesis H1 that some effect exists, one does it indirectly: invent a "null hypothesis" H0 that denies any such effect, then argue against H0 in a way that makes no reference to H1 at all (that is, using only probabilities conditional on H0). To see how far this procedure takes us from elementary logic, suppose we decide that the effect exists; that is, we reject H0. Surely, we must also reject probabilities conditional on H0; but then what was the logical justification for the decision? Orthodox logic saws off its own limb.

Harold Jeffreys (1939, p. 316) expressed his astonishment at such limb-sawing reasoning by looking at a different side of it: "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it."

Full post...

## May 17, 2015

### Beginning deep learning with 500 lines of Julia (Version 0.1)

Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

Click here for an older version (v0.0) of this tutorial.

OK, first a disclaimer: this version of KUnet.jl, my deep learning code for Julia, is a bit more than 500 lines, but it is still under 1000 lines and it supports convolution and pooling, new activation and loss functions, arrays of arbitrary dimensionality with 32 and 64 bit floats etc. See here for installation instructions. We will use the MNIST dataset to illustrate basic usage of KUnet:

```
julia> include(Pkg.dir("KUnet/test/mnist.jl"))
```

This may take a bit the first time you run to download the data.

Next we tell Julia we intend to use KUnet, and some variables from MNIST:

```
julia> using KUnet
julia> using MNIST: xtrn, ytrn, xtst, ytst
```

The MNIST variables are Float32 matrices. The x matrices have pixel values scaled to [0.0:1.0] for a 28x28 image on each column. The y matrices have 10 rows indicating the 10 classes with a single nonzero entry for the correct class in each column.

```
julia> xtrn, ytrn, xtst, ytst
(
784x60000 Array{Float32,2}: ...
10x60000 Array{Float32,2}: ...
784x10000 Array{Float32,2}: ...
10x10000 Array{Float32,2}: ...
)
```

Before using KUnet, we should specify the array type and the element type we want to use. The array type determines whether KUnet uses the GPU, and the element type should match that of the data.

```
julia> KUnet.atype(CudaArray) # CudaArray or Array
julia> KUnet.ftype(Float32) # Float32 or Float64
```

Let's construct a neural net with a single layer of 64 hidden units using the relu activation function and the cross entropy loss function.

```
julia> net = [ Mmul(64,784), Bias(64), Relu(),
Mmul(10,64), Bias(10), XentLoss() ]
```

Each element of the net array represents an operation, e.g. Mmul multiplies its input with a weight matrix, Bias adds a bias vector, Relu applies the rectified linear transformation to each element etc. They are subtypes of an abstract type called Layer. The full list of Layers currently implemented are: Bias, Conv, Drop, Logp, Mmul, Pool, Relu, Sigm, Soft, Tanh, LogpLoss, QuadLoss, SoftLoss, XentLoss.

A Net is simply a 1-D array of Layers. Here are the definitions from net.jl and bias.jl:

```
abstract Layer
typealias Net Array{Layer,1}
type Bias <: Layer; b::Param; Bias(b::Param)=new(b); end
```

If you are not happy with the default Layer constructors, you can specify your own parameters. For example, the Mmul(64,784) constructor fills a (64,784) weight matrix with random weights from a Gaussian distribution with std=0.01. If we want a different initialization, we could create a weight matrix any way we want and pass it to the Mmul constructor instead. Note that the weight matrix for an Mmul layer with 784 inputs and 64 outputs has size (64, 784).

```
julia> w1 = randn(64, 784) * 0.05
julia> l1 = Mmul(w1)
```

Training parameters like the learning rate (lr) can be specified at layer construction, or using setparam! on the whole network or on individual layers.

```
julia> l1 = Mmul(64,784; lr=0.01)
julia> setparam!(l1; lr=0.01)
julia> setparam!(net; lr=0.01)
```

It is also possible to save nets to
JLD files using
`savenet(fname::String, n::Net)`

and read them using
`loadnet(fname::String)`

. Let's save our initial random network for
replicatibility.

```
julia> savenet("net0.jld", net)
```

OK, now that we have some data and a network, let's proceed with training. Here is a convenience function to measure the classification accuracy:

```
julia> accuracy(y,z)=mean(findmax(y,1)[2] .== findmax(z,1)[2])
```

Let's do 100 epochs of training:

```
@time for i=1:100
train(net, xtrn, ytrn; batch=128)
println((i, accuracy(ytst, predict(net, xtst)),
accuracy(ytrn, predict(net, xtrn))))
end
```

If you take a look at the definition of `train`

in
net.jl,
you will see that it takes a network net, the input x, and the desired
output y, and after splitting the data into minibatches it just calls
`backprop`

and `update`

. Here is a simplified version:

```
train(net, x, y)=(backprop(net, x, y); update(net))
```

The `backprop`

function calls `forw`

which computes the network
output, and `back`

which computes the gradients of the network
parameters with respect to the loss function:

```
backprop(net, x, y)=(forw(net, x); back(net, y))
```

The `forw`

and `back`

functions for a Net simply call the `forw`

and
`back`

functions of each layer in order, feeding the output of one to
the input of the next:

```
forw(n::Net, x)=(for i=1:length(n); x=forw(n[i], x); end)
back(n::Net, y)=(for i=length(n):-1:1; y=back(n[i], y); end)
```

The `forw`

function for a layer takes the layer input x, and returns
the layer output y. The `back`

function for a layer takes the loss
gradient wrt its output dy and returns the loss gradient wrt its input
dx. If the layer has a parameter p, `back`

also computes the loss
gradient p.diff wrt its current value p.data. You can take a look at
individual layer definitions (e.g. in
mmul.jl,
bias.jl,
relu.jl,
etc.) to see how this is done for each layer.

The final layer of the network
(XentLoss
in our case) is a subtype of LossLayer. LossLayer is a special type
of layer: its forw does nothing but record the network output y. Its
back expects the desired output z (not a gradient) and computes the
loss gradient wrt the network output dy. A LossLayer also implements
the function `loss(l::LossLayer,z)`

which returns the actual loss
value given the desired output z. KUnet currently implements the
following LossLayers:
LogpLoss,
QuadLoss,
SoftLoss,
XentLoss.

The `update`

function for a net calls the `update`

function for each
of its layers, which in turn calls the `update`

function on layer
parameters:

```
update(n::Net)=(for l in n; update(l); end)
update(l::Bias)=update(l.b)
```

The `update`

function for a parameter p is used to update its values
(p.data) given the loss gradients (p.diff). Its behavior is
controlled by the following parameters: lr, l1reg, l2reg, adagrad,
momentum, nesterov. Here is a simplified definition of `update`

from
param.jl
(p.ada, p.mom, and p.nes are temporary arrays initialized to 0):

```
function update(p::Param; o...)
isdefined(p,:l1reg) && (p.diff += p.l1reg * sign(p.data))
isdefined(p,:l2reg) && (p.diff += p.l2reg * p.data)
isdefined(p,:adagrad) && (p.ada += p.diff.^2; p.diff /= p.adagrad + sqrt(p.ada))
isdefined(p,:momentum) && (p.diff += p.momentum * p.mom; p.mom[:] = p.diff)
isdefined(p,:nesterov) && (p.nes *= p.nesterov; p.nes += p.diff; p.diff += p.nesterov * p.nes)
isdefined(p,:lr) && (p.diff *= p.lr)
p.data -= p.diff
end
```

Our training should print out the test set and training set accuracy at the end of every epoch.

```
(1,0.3386,0.3356)
(2,0.7311,0.7226666666666667)
(3,0.821,0.8157333333333333)
...
(99,0.9604,0.9658166666666667)
(100,0.9604,0.96605)
elapsed time: 39.738191211 seconds (1526525108 bytes allocated, 3.05% gc time)
```

Note that for actual research we should not be looking at the test set accuracy at this point. We should instead split the training set into a training and a development portion and do all our playing around with those. We should also run each experiment 10 times with different random seeds and measure standard errors, etc. But, this is just a KUnet tutorial.

It seems the training set accuracy is not that great. Maybe increasing the learning rate may help:

```
net = loadnet("net0.jld")
setparam!(net, lr=0.5)
# same for loop...
(1,0.9152,0.9171833333333334)
(2,0.9431,0.9440333333333333)
(3,0.959,0.9611666666666666)
...
(59,0.9772,0.9999833333333333)
(60,0.9773,1.0)
...
(100,0.9776,1.0)
```

Wow! We got 100% training set accuracy in 60 epochs. This should drive home the importance of setting a good learning rate.

But the test set is still lagging behind. What if we try increasing the number of hidden units:

```
for h in (128, 256, 512, 1024)
net = [Mmul(h,784), Bias(h), Relu(), Mmul(10,h), Bias(10), XentLoss()]
setparam!(net; lr=0.5)
for i=1:100
train(net, xtrn, ytrn; batch=128)
println((i, accuracy(ytst, predict(net, xtst)),
accuracy(ytrn, predict(net, xtrn))))
end
end
# Number of epochs and test accuracy when training accuracy reaches 1.0:
# 128: (43,0.9803,1.0)
# 256: (42,0.983,1.0)
# 512: (36,0.983,1.0)
# 1024: (30,0.9833,1.0)
```

This improvement is unexpected, we were already overfitting with 64 hidden units, and common wisdom is not to increase the capacity of the network by increasing the hidden units in that situation. Maybe we should try dropout:

```
net = [Drop(0.2), Mmul(1024,784), Bias(1024), Relu(),
Drop(0.5), Mmul(10,1024), Bias(10), XentLoss()]
# lr=0.5, same for loop
...
(100,0.9875,0.9998166666666667)
elapsed time: 122.898730432 seconds (1667849932 bytes allocated, 0.96% gc time)
```

Or bigger and bigger nets:

```
net = [Drop(0.2), Mmul(4096,784), Bias(4096), Relu(),
Drop(0.5), Mmul(4096,4096), Bias(4096), Relu(),
Drop(0.5), Mmul(10,4096), Bias(10), XentLoss()]
# lr=0.5, same for loop
...
(100,0.9896,0.9998166666666667)
elapsed time: 804.242212488 seconds (1080 MB allocated, 0.02% gc time in 49 pauses with 0 full sweep)
```

Or maybe we should try convolution. Here is an implementation of LeNet:

```
net = [Conv(5,5,1,20), Bias(20), Relu(), Pool(2),
Conv(5,5,20,50), Bias(50), Relu(), Pool(2),
Mmul(500,800), Bias(500), Relu(),
Mmul(10,500), Bias(10), XentLoss()]
setparam!(net; lr=0.1)
# Need to reshape the input arrays for convolution:
xtrn2 = reshape(xtrn, 28, 28, 1, size(xtrn, 2))
xtst2 = reshape(xtst, 28, 28, 1, size(xtst, 2))
# same for loop
...
(100,0.9908,1.0)
elapsed time: 360.722851006 seconds (5875158944 bytes allocated, 1.95% gc time)
```

OK, that's enough fiddling around. I hope this gave you enough to get your hands dirty. We are already among the better results on the MNIST website. I am sure you can do better playing around with the learning rate, dropout probabilities, momentum, adagrad, regularization, and numbers, sizes, types of layers etc. But be careful, it could become addictive :)

Full post...

## May 07, 2015

### CUDNN Julia port

Full post...

## April 24, 2015

### Big questions

- Physics does not have X. Is X something real and missing in physics, or is X an illusion, a modeling tool, a result of our limited information, agent based perspective, or stupidity? Examples: consciousness, free will, causality and directionality of time.
- Physics does have X which doesn't make sense. Is X something real, or a modeling tool? Examples: randomness, entropy, various sorts of quantum weirdness.

- I was trying to find out what fellow MIT AI Lab alum Gary Drescher is up to after writing Good and Real (which covers most of the big questions mentioned above). I found this video of a talk on causality and choice he gave at the Singularity Summit in 2009. He argues convincingly that determinism and choice are not at odds with each other.
- Speaking of causality, Michael Nielsen (author of Quantum Computation and Quantum Information, as well as a free online book on Neural Networks and Deep Learning) has a nice article summarizing the key points of Judea Pearl's Causality book. The epilogue of the book (well worth the read if you don't have the time or patience for the whole book) is a lecture Pearl gave some years ago where he covers the history of the confusion about causality, why physics (if we model the whole universe) does not require it, how economics and other social sciences (where lack of controlled experiments make causality detection difficult) still do not adequately model it, and how it is actually possible in some cases to derive causality from purely observational data without resorting to controlled experiments.
- Another regular at the Singularity Summit is Eliezer Yudkowsky. I first discovered his posts on Less Wrong, which has a lot of good material that identifies common sources of confusion when thinking about the big questions.

Full post...

## April 01, 2015

### Natural language processing applications of unsupervised lexical category induction (2013-2015)

Full post...

## March 06, 2015

### Alec Radford's animations for optimization algorithms

Alec Radford has created some great animations comparing optimization algorithms SGD, Momentum, NAG, Adagrad, Adadelta, RMSprop (unfortunately no Adam) on low dimensional problems. Also check out his presentation on RNNs.

"Noisy moons:This is logistic regression on noisy moons dataset from sklearn which shows the smoothing effects of momentum based techniques (which also results in over shooting and correction). The error surface is visualized as an average over the whole dataset empirically, but the trajectories show the dynamics of minibatches on noisy data. The bottom chart is an accuracy plot."

"Beale's function:Due to the large initial gradient, velocity based techniques shoot off and bounce around - adagrad almost goes unstable for the same reason. Algos that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability."

"Long valley:Algos without scaling based on gradient information really struggle to break symmetry here - SGD gets no where and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimization direction. Algos that scale step size based on the gradient quickly break symmetry and begin descent."

"Saddle point:Behavior around a saddle point. NAG/Momentum again like to explore around, almost taking a different path. Adadelta/Adagrad/RMSProp proceed like accelerated SGD."

Full post...

### Parsing the Penn Treebank in 60 seconds

Parsers (as well as many other natural language processing algorithms) work by (1) extracting features for the current state, and (2) using a machine learning model to predict the best action / structure based on those features. The feature extraction code is typically messy and irregular and best performed on (possibly multiple) CPUs. The machine learning models can typically be accelerated significantly using GPUs. In this post I will use a greedy transition based parser with a neural network model and figure out how to use both the GPU and the multiple CPU cores effectively. We will take the parser speed from 55 ms/word (with a single CPU) to 0.055 ms/word (using 20 CPU cores and two K40 GPUs). At this speed we can parse the whole Penn Treebank (approx 1M words) in less than 60 seconds.

Some related links:

The code used in this post

Parallel processing for natural language (same idea, Matlab version)

Beginning deep learning with 500 lines of Julia

A greedy transition based parser parses a sentence using the following steps:

function gparse(s::Sentence, n::Net, f::Features) p = ArcHybrid(wcnt(s)) # initialize parser state while (v = valid(p); any(v)) # while we have valid moves x = features(p, s, f) # extract features y = predict(n, x) # score moves y[!v] = -Inf # ignore invalid moves move!(p, indmax(y)) # make the max score move end return p.head end

Here n is a machine learning model, f is a specification of what features to extract, and p represents the parser state. The parser works by extracting features representing the sentence and the current parser state, using the model to score possible moves, and executing the highest scoring move until no valid moves are left.

To parse a whole corpus (array of sentences), we just map gparse to each sentence. Julia can distinguish which gparse we mean by looking at the types of arguments.

typealias Corpus AbstractVector{Sentence} gparse(c::Corpus, n::Net, f::Features)=map(s->gparse(s,n,f), c)

For our first experiment we will parse some sentences on a single CPU core (`Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz`):

julia> nworkers() # we use a single core 1 julia> blas_set_num_threads(1) # including blas julia> using KUnet julia> KUnet.gpu(false) # no gpu yet julia> using KUparser julia> @time KUparser.gparse(dev, net, feats); elapsed time: 2244.3076923076924 seconds

The corpus, dev, is the Penn Treebank WSJ section 22 (1700 sentences, 40117 words); net is a standard feed forward neural network with 1326 input units, 20000 hidden units in a single layer, and 3 output units; feats is a specification of features to be extracted. The parsing speed is 55.944 ms/word. More than 99% of that time is spent on "predict".

In order to speed up "predict", we will use the GPU (`NVIDIA Tesla K20m`):

julia> gnet=copy(net,:gpu) julia> @time KUparser.gparse(dev, gnet, feats); elapsed time: 148.56374417550305 seconds

This gives us 3.704 ms/word, a 15x speed-up. However the GPU can be better utilized if we somehow manage to batch our feature vectors and compute scores for multiple instances in parallel. The problem is parsing a sentence is a serial process, you need the state resulting from the last move in order to compute the features for the next move. The solution is to parse multiple sentences in parallel (thanks to Alkan Kabakcioglu for suggesting this). Different sentences have no dependencies on each other and we can keep track of their states and predict their moves in bulk. The second version of gparse takes an additional "batchsize" argument specifying how many sentences to parse in parallel. This needs some bookkeeping (requiring 80 lines of code for gparse v2 instead of the 10 line beauty you see above), so I won't cut-and-paste it here, you can see the source code if you wish. Here are some experiments with the batched gparse:

julia> @time KUparser.gparse(dev, gnet, feats, 1); elapsed time: 148.725787323 seconds julia> @time KUparser.gparse(dev, gnet, feats, 10); elapsed time: 48.573996933 seconds julia> @time KUparser.gparse(dev, gnet, feats, 100); elapsed time: 25.502507879 seconds julia> @time KUparser.gparse(dev, gnet, feats, 1700); elapsed time: 22.079269825 seconds

As we increase the number of sentences processed in parallel (doing all 1700 sentences in the corpus in parallel in the last example), we get 0.550 ms/word, a 100x speedup from where we started. At this point the time spent on prediction is about a third of the time spent on feature extraction, so let's take another look at features. We will use Julia's parallel computing primitives to group the sentences to be processed on separate cores. The third version of gparse takes yet another argument specifying the number of cpu cores:

function gparse(corpus::Corpus, net::Net, fmat::Features, batch::Integer, ncpu::Integer) d = distribute(corpus, workers()[1:ncpu]) n = copy(net, :cpu) p = pmap(procs(d)) do x gparse(localpart(d), copy(n, :gpu), fmat, batch) end end

The distribute command distributes the corpus equally among ncpu workers, and localpart gives each worker its own subset. We copy the net back and forth between the CPU and the GPU because I couldn't figure out how to pass GPU pointers between different workers. Finally pmap is the parallel map which calls gparse v2 on each worker for the appropriate subset of the corpus, pmerge merges the results. This time we will run the parser on the training set (Sections 02-21, ~40k sentences, ~950k words)

julia> addprocs(20) julia> require("CUDArt") julia> @everywhere CUDArt.device((myid()-1)%CUDArt.devcount()) julia> require("KUparser") julia> @time KUparser.gparse(trn, gnet, feats, 2000, 20); elapsed time: 52.13701401 seconds

The server has 20 CPU cores and 2 GPUs. We create 20 workers, and assign equal numbers to each GPU. Parsing 950k words takes 52 seconds (0.055 ms/word), a 1000x speedup from where we started.

Full post...

## February 28, 2015

### Beginning deep learning with 500 lines of Julia (20150228)

Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

There are a number of deep learning packages out there. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov's accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner's neural network tutorial (modeled after Honnibal's excellent parsing example).#### Contents

Layers and netsThe layer function

Prediction

Loss functions

Backpropagation Training

Activation functions

Preprocessing functions

Why Julia?

Future work

#### Layers and nets

A feed forward neural network consists of layers.typealias Net Array{Layer,1}

type Layer w; b; f; fx; ... end

#### The layer function (forw)

Given its parameters and functions, here is how a layer computes its output from its input:function forw(l::Layer, x) l.x = l.fx(l, x) # preprocess the input l.y = l.w * l.x # multiply with weight matrix l.y = l.y .+ l.b # add the bias vector (to every column) l.y = l.f(l,l.y) # apply the activation fn to the output end

#### Prediction

Once we have forw for a single layer, calculating a prediction for the whole network is literally a one-liner:forw(n::Net, x)=(for l=n x=forw(l,x) end; x)

#### Loss functions

In order to train the network, we need a loss function. A loss function takes y, the network output, and dy, the desired output, and returns a loss value which we try to minimize during training. In addition, it calculates the gradient of the loss with respect to y, which tells us how small changes in the network output will effect the loss. As an example, here is the implementation of softmaxloss (albeit a bit low level) in func.jl:function softmaxloss(y, dy) yrows,ycols = size(y) loss = zero(eltype(y)) prob = similar(y, yrows) for j=1:ycols ymax = y[1,j] for i=2:yrows y[i,j] > ymax && (ymax = y[i,j]) end psum = zero(ymax) for i=1:yrows yij = y[i,j] - ymax prob[i] = exp(yij) psum += prob[i] dy[i,j] == 1 && (loss += yij) end loss -= log(psum) for i=1:yrows prob[i] /= psum dy[i,j] = (prob[i] - dy[i,j]) / ycols end end return loss end

`dy[i,j] = (prob[i] - dy[i,j]) / ycols`

. We should note a couple of important design decisions here: (1) y and dy should have the same dimensionality, so use one-of-k encoding for desired classification outputs. (2) The loss function overwrites dy with the gradient of the loss with respect to y.#### Backpropagation

Backpropagation is the algorithm that calculates the gradient of the loss with respect to network weights, given the gradient of the loss with respect to the network output. This can be accomplished taking a backward pass through the layers after the forward pass and the loss calculation:function backprop(net::Net, x, dy, loss=softmaxloss) y = forw(net, x) # y: network output loss(y, dy) # dy: desired output -> gradient back(net, dy) # calculate derivatives end

back(n::Net, dy) = (for i=length(n):-1:1 dy=back(n[i],dy) end)

function back(l::Layer, dy) dy = l.f(l,l.y,dy) l.dw = dy * l.x' l.db = sum!(l.db, dy) l.dx = l.w' * dy l.dx = l.fx(l,l.x,l.dx) end

#### Training

The gradients calculated by backprop, l.dw and l.db, tell us how much small changes in corresponding entries in l.w and l.b will effect the loss (for the last instance, or minibatch). Small steps in the gradient direction will increase the loss, steps in the opposite direction will decrease the loss. This suggests the following update rule:w = w - dw

Over the years, people have noted many subtle problems with this approach and suggested improvements:

**Step size:**If the step sizes are too small, the SGD algorithm will take too long to converge. If they are too big it will overshoot the optimum and start to oscillate. So we scale the gradients with an adjustable parameter called the learning rate:

w = w - learningRate * dw

**Step direction:**More importantly, it turns out the gradient (or its opposite) is often NOT the direction you want to go in order to minimize loss. Let us illustrate with a simple picture:

The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.

Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is

*the loss for the last minibatch/instance only*, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.

So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a standard textbook on numerical optimization, and peruse the latest papers on optimization in deep learning.

**Minimize what?**The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see Radford Neal's fbm), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.

KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer:

type UpdateParam learningRate; l1reg; l2reg; maxnorm; adagrad; momentum; nesterov; ... end

setparam!(p::UpdateParam,k,v) setparam!(l::Layer,k,v) setparam!(net::Net,k,v)

function update(w, dw, o::UpdateParam) initupdate(w, dw, o) nz(o,:l1reg) && l1reg!(o.l1reg, w, dw) nz(o,:l2reg) && l2reg!(o.l2reg, w, dw) nz(o,:adagrad) && adagrad!(o.adagrad, o.ada, dw) nz(o,:learningRate,1f0) && (dw = dw .* o.learningRate) nz(o,:momentum) && momentum!(o.momentum, o.mom, dw) nz(o,:nesterov) && nesterov!(o.nesterov, o.nes, dw) w = w .- dw nz(o,:maxnorm) && maxnorm!(o.maxnorm, w) end nz(o,n,v=0f0)=(isdefined(o,n) && (o.(n) != v))

The "train" function in net.jl is just a wrapper around "backprop" and "update". It goes through the training set once, splitting it into minibatches, and feeding each minibatch through backprop and update.

#### Activation functions

The activation function follows the linear transformation (wx+b) in a layer and is typically a non-linear element-wise function. Without an activation function, multiple layers in a neural network would be useless because the composition of several linear functions is still a linear function. As an example activation function l.f, here is the definition of the relu (rectified linear unit) from func.jl:relu(l,y)=for i=1:length(y) (y[i]<0) && (y[i]=0) end relu(l,y,dy)=for i=1:length(y) (y[i]==0) && (dy[i]=0) end

relu(l,y::CudaArray)=ccall((:reluforw,libkunet), Void,(Cint,Cmat),length(y),y) relu(l,y::CudaArray,dy::CudaArray)=ccall((:reluback,libkunet), Void,(Cint,Cmat,Cmat),length(dy),y,dy)

#### Preprocessing functions

A preprocessing function precedes the linear transformation (wx+b) modifying the input of the layer. Preprocessing functions typically add some noise to the input to improve generalization. For example dropout can be implemented as a preprocessing function of a layer where each input is dropped (replaced by 0) with a given probability. Adding Gaussian noise, or elastic transformations of image inputs can also be implemented as preprocessing functions. Here is the (simplified) dropout implementation from func.jl:function drop(l, x) rand!(l.xdrop) drop(x, l.xdrop, l.dropout, 1/(1-l.dropout)) end function drop(l, x, dx) drop(dx, l.xdrop, l.dropout, 1/(1-l.dropout)) end function drop(x, xdrop, dropout, scale) for i=1:length(x) x[i] = (xdrop[i] < dropout ? 0 : scale * x[i]) end end function drop(x::CudaArray, xdrop::CudaArray, dropout, scale) ccall((:drop,libkunet),Void,(Cint,Cmat,Cmat,Cfloat,Cfloat), length(x),x,xdrop,dropout,scale) end

#### Why Julia?

I wanted to write something that is concise, easy to understand, easy to extend, and reasonably efficient. There is a subtle trade-off between conciseness and extensibility: If we use a very high level language that already has a "neural_network_train" function, we can write very concise code but we lose the ability to change the training algorithm. If we use a very low level language that only provides primitive arithmetic operations, all the algorithm details are exposed and modifiable but the code is bulky and difficult to understand. For a happy medium, the code should reflect the level at which I think of the problem: e.g. I should be able to say A*B if I want to multiply two matrices, and I shouldn't constantly worry about the location and element types of my arrays. That restricts the playing field to a handful of languages. Efficiency requires being able to work with a GPU. Ideally the code should be generic, i.e. algorithms should be expressed once and work whether the data is on the GPU or the CPU memory. Julia has fledgling GPU support but it does have concise matrix arithmetic and it excels at generic operations. So I struggled a bit with trying to get generic matrix operations to work on the GPU (which is nowhere nearly complete but currently good enough to run the KUnet code), and to express each algorithm in as mathematical and concise a manner as possible (which is still work in progress, largely due to the lack of GPU generics and a standard syntax for in-place operations). But the progress so far (and the invaluable support I got from the Julia community) convinced me that this is a worthwhile endeavor.#### Future work

With very little extra effort (and lines of code) one should be able to extend KUnet to implement convolutional and recurrent nets and new tricks like maxout units, maxnorm regularization, rmsprop optimization etc. You can send me suggestions for improvement (both in coding style and new functionality) using comments to this blog post, or using issues or pull requests on GitHub.Full post...

## January 21, 2015

### Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms

*IEEE Transactions on Audio, Speech and Language Processing*, vol 23, no 2, pp 339--350, February. IEEE. (URL, PDF, code)

**Abstract**:
We introduce FDA5 for efficient parameterization,
optimization, and implementation of feature decay algorithms
(FDA), a class of instance selection algorithms that use feature
decay. FDA increase the diversity of the selected training set
by devaluing features (i.e. n-grams) that have already been
included. FDA5 decides which instances to select based on three
functions used for initializing and decaying feature values and
scaling sentence scores controlled with 5 parameters. We present
optimization techniques that allow FDA5 to adapt these functions
to in-domain and out-of-domain translation tasks for different
language pairs. In a transductive learning setting, selection of
training instances relevant to the test set can improve the
final translation quality. In machine translation experiments
performed on the 2 million sentence English-German section of
the Europarl corpus, we show that a subset of the training set
selected by FDA5 can gain up to 3.22 BLEU points compared to
a randomly selected subset of the same size, can gain up to 0.41
BLEU points compared to using all of the available training data
using only 15% of it, and can reach within 0.5 BLEU points to the
full training set result by using only 2.7% of the full training data.
FDA5 peaks at around 8M words or 15% of the full training set.
In an active learning setting, FDA5 minimizes the human effort
by identifying the most informative sentences for translation and
FDA gains up to 0.45 BLEU points using 3/5 of the available
training data compared to using all of it and 1.12 BLEU points
compared to random training set. In translation tasks involving
English and Turkish, a morphologically rich language, FDA5 can
gain up to 11.52 BLEU points compared to a randomly selected
subset of the same size, can achieve the same BLEU score using
as little as 4% of the data compared to random instance selection,
and can exceed the full dataset result by 0.78 BLEU points. FDA5
is able to reduce the time to build a statistical machine translation
system to about half with 1M words using only 3% of the space
for the phrase table and 8% of the overall space when compared
with a baseline system using all of the training data available yet
still obtain only 0.58 BLEU points difference with the baseline
system in out-of-domain translation.

Full post...

## January 20, 2015

### Parallel processing for natural language

Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".

To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.

To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).

A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:

This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:

This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.

Here is a summary of the results:

baseline | 10.9 |

gpu | 6.24 |

parfor | 3.70 |

gpu+parfor | 5.03 |

minibatch | 2.80 |

minibatch+gpu | 1.67 |

minibatch+parfor | 1.08 |

minibatch+gpu+parfor | 0.46 |

Full post...