I am an associate professor of Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later cofounded Inquira, Inc. My research is in natural language processing and machine learning.
For prospective students here are some research topics, papers, classes, blog posts and past students.

Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

July 10, 2015
June 16, 2015
May 27, 2015
What is wrong with pvalues?
Earlier this year, editors of the journal Basic and Applied Social Psychology announced that the journal would no longer publish papers containing pvalues. The latest American Psychological Association Publication Manual states that researchers should "wherever possible, base discussion and interpretation of results on point and interval estimates," i.e. not pvalues. FDA has been encouraging Bayesian analysis. What is wrong with pvalues?
What is a pvalue? In the classical statistical procedure known as "significance testing", we have a default hypothesis, usually called the null hypothesis and denoted by H0, and we wish to determine whether or not to reject H0 based on some observations X. We choose a statistic S=f(X) (a scalar function of X) that summarizes our data. The pvalue is the probability of observing a value at least as extreme as S under H0. We reject H0 if the pvalue is below some specified small threshold like α=0.05 and we say something like "H0 is rejected at 0.05 significance level." This threshold or significance level (α) upper bounds the probability of false rejection, i.e. rejecting H0 when it is correct.
Example: We toss a coin 1000 times and observe 532 heads, 468 tails. Is this a fair coin? In this example the null hypothesis H0 is that the coin is fair, observation X is the sequence of heads and tails, and the statistic S is the number of heads. The pvalue, or probability of S ∉ [469,531] under H0, can be calculated as: \[ 1  \sum_{k=469}^{531} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.04629 \] We can reject the null hypothesis at 0.05 significance level and decide the coin is biased. But should we?
Objection 1: (MacKay 2003, pp.63) What we would actually like to know is the probability of H0 given that we observed 532 heads. Unfortunately the pvalue 0.04629 is not that probability (although this is a common confusion). We can't calculate a probability for H0 unless we specify some alternatives. Come to think of it, how can we reject a hypothesis if we don't look at what the alternatives are? What if the alternatives are worse? So let's specify a "biased coin" alternative (H1) which assumes that the head probability of the coin (θ) is distributed uniformly between 0 and 1 (other ways of specifying H1 are possible and do not effect the conclusion). We have: \[ P(S=532 \mid H_0) = {1000 \choose 532} \left(\frac{1}{2}\right)^{1000} = 0.003256 \] \[ P(S=532 \mid H_1) = \int_0^1 {1000 \choose 532} \theta^{532} (1\theta)^{468} d\theta = 0.001 \] So H0 makes our data 3.2 times more likely than H1! And here the pvalue almost made us think the data was 1:20 in favor of the "biased" hypothesis.
Objection 2: (Berger 1982, pp.13) Well, now that we understand pvalue is not the probability of H0, does it tell us anything useful? According to the definition it limits the false rejection rate, i.e. if we always use significance tests with a pvalue threshold α=0.01, we can be assured of incorrectly rejecting only 1% of correct hypotheses in the long run. So does that mean when I reject a null hypothesis I am only mistaken 1% of the time? Of course not! P(rejectcorrect) is 1%, P(correctreject) can be anything! Here is an example:
X=1  X=2  
H0  .01  .99 
H1  .01001  .98999 
The table gives the probabilities the two hypotheses H0 and H1 assign to different outcomes X=1 or X=2. Say we observe X=1. We reject H0 at α=0.01 significance level. But there is very little evidence against H0, the likelihood ratio P(XH1)/P(XH0) is very close to 1, so the chance of being in error is about 1/2 (assuming H0 and H1 are apriori equally likely). Thus α=0.01 is providing a very misleading and false sense of security when rejection actually occurs.
Objection 3: (Murphy 2013, pp.213) Consider two experiments. In the first one we toss a coin 1000 times and observe 474 tails. Using T=474 as our statistic the one sided pvalue is P(T≤474H0): \[ \sum_{k=0}^{474} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.05337 \] So at a significance level of α=0.05 we do not reject the null hypothesis of an unbiased coin.
In the second experiment we toss the coin until we observe 474 tails, and it happens to take us 1000 trials. Different intention, same data. This time N=1000 is the natural test statistic and the one sided pvalue is P(N≥1000H0): \[ \sum_{n=1000}^\infty {n1 \choose 473} \left(\frac{1}{2}\right)^n = 0.04994 \] Suddenly we are below the magical α=0.05 threshold and we can reject the null hypothesis. The observed data, thus the likelihoods of any hypotheses for this data have not changed. The pvalue is based not just on what actually happened, but what could have happened. This is clearly absurd.
Objection 4: (Cumming 2012) If we base the fate of our hypotheses on pvalues computed from experiments, at the very least we should expect the pvalues (thus our critical decisions) to change very little when we replicate the experiments. Unfortunately pvalues do not even give us stability, as this wonderful video "Dance of the p values" by Geoff Cumming illustrates:
Conclusion: (Jaynes 2003, pp.524) expressed the absurdity of significance testing best:
In order to argue for an hypothesis H1 that some effect exists, one does it indirectly: invent a "null hypothesis" H0 that denies any such effect, then argue against H0 in a way that makes no reference to H1 at all (that is, using only probabilities conditional on H0). To see how far this procedure takes us from elementary logic, suppose we decide that the effect exists; that is, we reject H0. Surely, we must also reject probabilities conditional on H0; but then what was the logical justification for the decision? Orthodox logic saws off its own limb.
Harold Jeffreys (1939, p. 316) expressed his astonishment at such limbsawing reasoning by looking at a different side of it: "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it."
Full post...
May 17, 2015
Beginning deep learning with 500 lines of Julia (Version 0.1)
Click here for an older version (v0.0) of this tutorial. OK, first a disclaimer: this version of KUnet.jl, my deep learning code for Julia, is a bit more than 500 lines, but it is still under 1000 lines and it supports convolution and pooling, new activation and loss functions, arrays of arbitrary dimensionality with 32 and 64 bit floats etc. See here for installation instructions. We will use the MNIST dataset to illustrate basic usage of KUnet:
julia> include(Pkg.dir("KUnet/test/mnist.jl"))
This may take a bit the first time you run to download the data.
Next we tell Julia we intend to use KUnet, and some variables from MNIST:
julia> using KUnet
julia> using MNIST: xtrn, ytrn, xtst, ytst
The MNIST variables are Float32 matrices. The x matrices have pixel values scaled to [0.0:1.0] for a 28x28 image on each column. The y matrices have 10 rows indicating the 10 classes with a single nonzero entry for the correct class in each column.
julia> xtrn, ytrn, xtst, ytst
(
784x60000 Array{Float32,2}: ...
10x60000 Array{Float32,2}: ...
784x10000 Array{Float32,2}: ...
10x10000 Array{Float32,2}: ...
)
Before using KUnet, we should specify the array type and the element type we want to use. The array type determines whether KUnet uses the GPU, and the element type should match that of the data.
julia> KUnet.atype(CudaArray) # CudaArray or Array
julia> KUnet.ftype(Float32) # Float32 or Float64
Let's construct a neural net with a single layer of 64 hidden units using the relu activation function and the cross entropy loss function.
julia> net = [ Mmul(64,784), Bias(64), Relu(),
Mmul(10,64), Bias(10), XentLoss() ]
Each element of the net array represents an operation, e.g. Mmul multiplies its input with a weight matrix, Bias adds a bias vector, Relu applies the rectified linear transformation to each element etc. They are subtypes of an abstract type called Layer. The full list of Layers currently implemented are: Bias, Conv, Drop, Logp, Mmul, Pool, Relu, Sigm, Soft, Tanh, LogpLoss, QuadLoss, SoftLoss, XentLoss.
A Net is simply a 1D array of Layers. Here are the definitions from net.jl and bias.jl:
abstract Layer
typealias Net Array{Layer,1}
type Bias <: Layer; b::Param; Bias(b::Param)=new(b); end
If you are not happy with the default Layer constructors, you can specify your own parameters. For example, the Mmul(64,784) constructor fills a (64,784) weight matrix with random weights from a Gaussian distribution with std=0.01. If we want a different initialization, we could create a weight matrix any way we want and pass it to the Mmul constructor instead. Note that the weight matrix for an Mmul layer with 784 inputs and 64 outputs has size (64, 784).
julia> w1 = randn(64, 784) * 0.05
julia> l1 = Mmul(w1)
Training parameters like the learning rate (lr) can be specified at layer construction, or using setparam! on the whole network or on individual layers.
julia> l1 = Mmul(64,784; lr=0.01)
julia> setparam!(l1; lr=0.01)
julia> setparam!(net; lr=0.01)
It is also possible to save nets to
JLD files using
savenet(fname::String, n::Net)
and read them using
loadnet(fname::String)
. Let's save our initial random network for
replicatibility.
julia> savenet("net0.jld", net)
OK, now that we have some data and a network, let's proceed with training. Here is a convenience function to measure the classification accuracy:
julia> accuracy(y,z)=mean(findmax(y,1)[2] .== findmax(z,1)[2])
Let's do 100 epochs of training:
@time for i=1:100
train(net, xtrn, ytrn; batch=128)
println((i, accuracy(ytst, predict(net, xtst)),
accuracy(ytrn, predict(net, xtrn))))
end
If you take a look at the definition of train
in
net.jl,
you will see that it takes a network net, the input x, and the desired
output y, and after splitting the data into minibatches it just calls
backprop
and update
. Here is a simplified version:
train(net, x, y)=(backprop(net, x, y); update(net))
The backprop
function calls forw
which computes the network
output, and back
which computes the gradients of the network
parameters with respect to the loss function:
backprop(net, x, y)=(forw(net, x); back(net, y))
The forw
and back
functions for a Net simply call the forw
and
back
functions of each layer in order, feeding the output of one to
the input of the next:
forw(n::Net, x)=(for i=1:length(n); x=forw(n[i], x); end)
back(n::Net, y)=(for i=length(n):1:1; y=back(n[i], y); end)
The forw
function for a layer takes the layer input x, and returns
the layer output y. The back
function for a layer takes the loss
gradient wrt its output dy and returns the loss gradient wrt its input
dx. If the layer has a parameter p, back
also computes the loss
gradient p.diff wrt its current value p.data. You can take a look at
individual layer definitions (e.g. in
mmul.jl,
bias.jl,
relu.jl,
etc.) to see how this is done for each layer.
The final layer of the network
(XentLoss
in our case) is a subtype of LossLayer. LossLayer is a special type
of layer: its forw does nothing but record the network output y. Its
back expects the desired output z (not a gradient) and computes the
loss gradient wrt the network output dy. A LossLayer also implements
the function loss(l::LossLayer,z)
which returns the actual loss
value given the desired output z. KUnet currently implements the
following LossLayers:
LogpLoss,
QuadLoss,
SoftLoss,
XentLoss.
The update
function for a net calls the update
function for each
of its layers, which in turn calls the update
function on layer
parameters:
update(n::Net)=(for l in n; update(l); end)
update(l::Bias)=update(l.b)
The update
function for a parameter p is used to update its values
(p.data) given the loss gradients (p.diff). Its behavior is
controlled by the following parameters: lr, l1reg, l2reg, adagrad,
momentum, nesterov. Here is a simplified definition of update
from
param.jl
(p.ada, p.mom, and p.nes are temporary arrays initialized to 0):
function update(p::Param; o...)
isdefined(p,:l1reg) && (p.diff += p.l1reg * sign(p.data))
isdefined(p,:l2reg) && (p.diff += p.l2reg * p.data)
isdefined(p,:adagrad) && (p.ada += p.diff.^2; p.diff /= p.adagrad + sqrt(p.ada))
isdefined(p,:momentum) && (p.diff += p.momentum * p.mom; p.mom[:] = p.diff)
isdefined(p,:nesterov) && (p.nes *= p.nesterov; p.nes += p.diff; p.diff += p.nesterov * p.nes)
isdefined(p,:lr) && (p.diff *= p.lr)
p.data = p.diff
end
Our training should print out the test set and training set accuracy at the end of every epoch.
(1,0.3386,0.3356)
(2,0.7311,0.7226666666666667)
(3,0.821,0.8157333333333333)
...
(99,0.9604,0.9658166666666667)
(100,0.9604,0.96605)
elapsed time: 39.738191211 seconds (1526525108 bytes allocated, 3.05% gc time)
Note that for actual research we should not be looking at the test set accuracy at this point. We should instead split the training set into a training and a development portion and do all our playing around with those. We should also run each experiment 10 times with different random seeds and measure standard errors, etc. But, this is just a KUnet tutorial.
It seems the training set accuracy is not that great. Maybe increasing the learning rate may help:
net = loadnet("net0.jld")
setparam!(net, lr=0.5)
# same for loop...
(1,0.9152,0.9171833333333334)
(2,0.9431,0.9440333333333333)
(3,0.959,0.9611666666666666)
...
(59,0.9772,0.9999833333333333)
(60,0.9773,1.0)
...
(100,0.9776,1.0)
Wow! We got 100% training set accuracy in 60 epochs. This should drive home the importance of setting a good learning rate.
But the test set is still lagging behind. What if we try increasing the number of hidden units:
for h in (128, 256, 512, 1024)
net = [Mmul(h,784), Bias(h), Relu(), Mmul(10,h), Bias(10), XentLoss()]
setparam!(net; lr=0.5)
for i=1:100
train(net, xtrn, ytrn; batch=128)
println((i, accuracy(ytst, predict(net, xtst)),
accuracy(ytrn, predict(net, xtrn))))
end
end
# Number of epochs and test accuracy when training accuracy reaches 1.0:
# 128: (43,0.9803,1.0)
# 256: (42,0.983,1.0)
# 512: (36,0.983,1.0)
# 1024: (30,0.9833,1.0)
This improvement is unexpected, we were already overfitting with 64 hidden units, and common wisdom is not to increase the capacity of the network by increasing the hidden units in that situation. Maybe we should try dropout:
net = [Drop(0.2), Mmul(1024,784), Bias(1024), Relu(),
Drop(0.5), Mmul(10,1024), Bias(10), XentLoss()]
# lr=0.5, same for loop
...
(100,0.9875,0.9998166666666667)
elapsed time: 122.898730432 seconds (1667849932 bytes allocated, 0.96% gc time)
Or bigger and bigger nets:
net = [Drop(0.2), Mmul(4096,784), Bias(4096), Relu(),
Drop(0.5), Mmul(4096,4096), Bias(4096), Relu(),
Drop(0.5), Mmul(10,4096), Bias(10), XentLoss()]
# lr=0.5, same for loop
...
(100,0.9896,0.9998166666666667)
elapsed time: 804.242212488 seconds (1080 MB allocated, 0.02% gc time in 49 pauses with 0 full sweep)
Or maybe we should try convolution. Here is an implementation of LeNet:
net = [Conv(5,5,1,20), Bias(20), Relu(), Pool(2),
Conv(5,5,20,50), Bias(50), Relu(), Pool(2),
Mmul(500,800), Bias(500), Relu(),
Mmul(10,500), Bias(10), XentLoss()]
setparam!(net; lr=0.1)
# Need to reshape the input arrays for convolution:
xtrn2 = reshape(xtrn, 28, 28, 1, size(xtrn, 2))
xtst2 = reshape(xtst, 28, 28, 1, size(xtst, 2))
# same for loop
...
(100,0.9908,1.0)
elapsed time: 360.722851006 seconds (5875158944 bytes allocated, 1.95% gc time)
OK, that's enough fiddling around. I hope this gave you enough to get your hands dirty. We are already among the better results on the MNIST website. I am sure you can do better playing around with the learning rate, dropout probabilities, momentum, adagrad, regularization, and numbers, sizes, types of layers etc. But be careful, it could become addictive :)
Full post...
May 07, 2015
CUDNN Julia port
Full post...