Deeplearning.net and homepages of Bengio, Schmidhuber have further information, background and links.
November 12, 2014
Some starting points for deep learning and RNNs
Deeplearning.net and homepages of Bengio, Schmidhuber have further information, background and links.
September 30, 2014
Mode coupling points to functionally important residues in myosin II.
Abstract: Relevance of mode coupling to energy/information transfer during protein function, particularly in the context of allosteric interactions is widely accepted. However, existing evidence in favor of this hypothesis comes essentially from model systems. We here report a novel formal analysis of the near-native dynamics of myosin II, which allows us to explore the impact of the interaction between possibly non-Gaussian vibrational modes on fluctuational dynamics. We show that an information-theoretic measure based on mode coupling alone yields a ranking of residues with a statistically significant bias favoring the functionally critical locations identified by experiments on myosin II.
August 23, 2014
Unsupervised Instance-Based Part of Speech Induction Using Probable Substitutes
August 08, 2014
Volkan Cirik, M.S. 2014
M.S. Thesis: Analysis of SCODE Word Embeddings based on Substitute Distributions in Supervised Tasks. Koç University, Department of Computer Engineering. August, 2014. (PDF, Presentation, word vectors (github), word vectors (dropbox))
Publications: bibtex.php
Abstract
One of the interests of the Natural Language Processing (NLP) community is to find representations for lexical items using large amount of unlabeled data. Inducing low-dimensional, continuous, dense word vectors, or word embeddings, have become the principal technique to find representations for words. Word embeddings address the issues of the classical categorical representation of words by capturing syntactic and semantic information of words in the dimensions of a vector. These representations are shown to be successful across NLP tasks including Named Entity Recognition, Part-of-speech Tagging, Parsing, and Semantic Role Labeling.
In this work, I analyze a word embedding method in supervised Natural Language Processing (NLP) tasks. The framework maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. I compared word embeddings, including more recent representations, in Named Entity Recognition (NER), Chunking, and Dependency Parsing. I examine the framework in a multilingual setup as well. The results show that the examined method achieves as good as or better results compared to the other word embeddings. The framework is consistent in improving the baseline systems across languages and achieves state-of-the-art results in multilingual dependency parsing.
June 26, 2014
Probabilistic Modeling of Joint-context in Distributional Similarity
Abstract: Most traditional distributional similarity models fail to capture syntagmatic patterns that group together multiple word features within the same joint context. In this work we introduce a novel generic distributional similarity scheme under which the power of probabilistic models can be leveraged to effectively model joint contexts. Based on this scheme, we implement a concrete model which utilizes probabilistic n-gram language models. Our evaluations suggest that this model is particularly well-suited for measuring similarity for verbs, which are known to exhibit richer syntagmatic patterns, while maintaining comparable or better performance with respect to competitive baselines for nouns. Following this, we propose our scheme as a framework for future semantic similarity models leveraging the substantial body of work that exists in probabilistic language modeling.
June 11, 2014
Online kernel methods for parsing
June 08, 2014
SVMs are not sparse
The SVM algorithm solves the following constrained optimization problem: \[ \min_{w,b} \frac{1}{2}|w|^2 + C\sum_i \xi_i\; \mbox{ s.t. }\; y_i(w^\top x_i+b)\geq 1-\xi_i,\; \xi_i \geq 0 \] which is minimizing a combination of the L2 norm of \(w\) and the total hinge loss \(\xi_i=(1-y_i(w^\top x_i+b))_+\). We know the L2 regularization drives \(w_i\) toward zero but not the whole way, so \(w\) won't be sparse. To see about sparsity of support vectors we need to look at the dual form. The Lagrangian for the above constrained optimization is: \[ L = \frac{1}{2}|w|^2 + C\sum_i \xi_i - \sum_i \alpha_i (y_i(w^\top x_i+b)-1+\xi_i) - \sum_i \beta_i \xi_i \] where \(\alpha_i \geq 0, \beta_i \geq 0\) are called the dual variables, and by setting various derivatives to zero we learn what the solution looks like: \[\begin{eqnarray*} \nabla_w L &=& w - \sum_i \alpha_i y_i x_i = 0 \\ \partial_b L &=& \sum_i \alpha_i y_i = 0 \\ \partial_{\xi_i} L &=& C - \alpha_i - \beta_i = 0 \end{eqnarray*}\] The first equation tells us that we can write \(w\) as a linear combination of training instances with weights \(\alpha_i\), and the last equation tells us that \(\alpha_i + \beta_i = C\) for each training instance. In fact there are three types of instances:
- \(\alpha_i = 0,\; \beta_i = C,\; \xi_i = 0,\; y_i(w^\top x_i+b)\gt 1 \): These are the points that satisfy the margin.
- \(0\lt\alpha_i\lt C,\; 0\lt\beta_i\lt C,\; \xi_i = 0,\; y_i(w^\top x_i+b)=1 \): These are the points exactly at the margin.
- \(\alpha_i = C,\; \beta_i = 0,\; \xi_i \gt 0,\; y_i(w^\top x_i+b)\lt 1 \): These are the points that violate the margin.
It turns out the optimal value of the original minimization problem and the optimal value of the dual maximization problem are the same. The dual always lower bounds the original problem, and in this case the bound is tight. So the following holds at the optimum \(w, \alpha\): \[ \frac{1}{2}|w|^2 + C\sum_i \xi_i = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j) \] The last sum is equal to \(|w|^2\), so rearranging gives us: \[ \|\alpha\|_1 = \|w\|_2^2 + C \sum_i \xi_i \] which means for a separable problem where all \(\xi_i=0\), we have \(\|\alpha\|_1 = \|w\|_2^2\)! If we are minimizing the L2 norm of \(w\), we are also minimizing the L1 norm of the corresponding \(\alpha\)! So how come the perceptron solves the 9vs4 on MNIST with half the number of support vectors?
I think it has to do with the fact that the SVM support vectors are always at (or on the wrong side of) the margin. Sometimes instances on the correct side of the margin may let us express the same hypothesis more compactly. The following picture may help:
We see a bunch of points at the margin, (at least some of) which will be picked by the SVM as support vectors. We also see two points at the poles. A perceptron which encounters one of the two pole points first will never make a mistake again, so it will express the same hypothesis more compactly. RVMs are also known to pick interior points as support vectors and generate more compact models.
Questions for the curious reader:- Can the projectron solve the problem given in the figure compactly given a bad ordering of points?
- Can we maximize the margin while minimizing the number of support vectors?
- Can we prove learning bounds on hypotheses representable by a small number of support vectors?
- Does L1 SVM sparsify the dual and does it lead to fewer support vectors?
- Can we have nonparametric algorithms which stop growing when no longer necessary and still provide good bounds?
May 18, 2014
Reddit AMA with Yann Lecun
- If you are an undergrad, take as many math and physics course as you can, and learn to program. If you are an aspiring grad student: apply to schools where there is someone you want to work with. It's much more important that the ranking of the school (as long as the school is in the top 50).
- Read, learn from on-line material, try things for yourself. As Feynman said: don't read everything about a topic before starting to work on it. Think about the problem for yourself, figure out what's important, then read the literature. This will allow you to interpret the literature and tell what's good from what's bad.
- Don't get fooled by people who claim to have a solution to Artificial General Intelligence, who claim to have AI systems that work "just like the human brain", or who claim to have figured out how the brain works (well, except if it's Geoff Hinton making the claim). Ask them what error rate they get on MNIST or ImageNet.
- We are using Torch for most of our research projects (and some of our development projects) at Facebook. Deep Mind is also using Torch in a big way (largely because my former student and Torch-co-maintainer Koray Kavukcuoglu sold them on it). Since the Deep Mind acquisition, folks in the Google Brain group in Mountain View have also started to use it.
- Kernel methods are great for many purposes, but they are merely glorified template matching. Despite the beautiful math, a kernel machine is nothing more than one layer of template matchers (one per training sample) where the templates are the training samples, and one layer of linear combinations on top.
- There is nothing magical about margin maximization. It's just another way of saying "L2 regularization" (despite the cute math).
- There is no opposition between deep learning and graphical models. Many deep learning approaches can be seen as factor graphs. I posted about this in the past.
- (On most promising current areas of research): Representation learning (the current crop of deep learning methods is just one way of doing it); learning long-term dependencies; marrying representation learning with structured prediction and/or reasoning; unsupervised representation learning, particularly prediction-based methods for temporal/sequential signals; marrying representation learning and reinforcement learning; using learning to speed up the solution of complex inference problems; theory: do theory (any theory) on deep learning/representation learning; understanding the landscape of objective functions in deep learning; in terms of applications: natural language understanding (e.g. for machine translation), video understanding; learning complex control.
- (On HTM): Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it's the very idea of deep learning) and will use temporal prediction. But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?). I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice. As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.
- Do you think having a PhD is important if you want to work in a good research team in the industry?
Yes.
May 17, 2014
How to learn about deep learning
- Neural Networks for Machine Learning: Hinton's course at Coursera is a wonderful introduction to deep learning. I recommend lectures 11-15 that cover Hopfield nets, Boltzmann machines, sigmoid belief nets and sparse auto-encoders.
- UFLDL tutorial: Tutorial on unsupervised feature learning and deep learning by Andrew Ng. Includes lecture notes and programming exercises. The first programming exercise shows how sparse auto-encoders can self-learn features found in the visual cortex, and the rest shows how these unsupervised features improve performance on supervised visual recognition tasks like MNIST and STL10.
- IPAM graduate summer school on deep learning and feature learning: has video lectures by many of the leaders of the field.
- Torch7: is a Matlab-like environment for state-of-the-art machine learning algorithms used at the IPAM summer school and several research groups on deep learning.
- deeplearning.net lists these and many other resources.
May 15, 2014
That pesky Z: multinomial vs log-linear generative models
Let us start with the log-linear model. Let \(x\) be the data we are trying to model, \(\phi(x)\) a vector of features, and \(w\) a vector of model parameters. (Here I use \(x\) to denote the whole training set, more typically denoted as \(\{(x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\}\), to reduce notational clutter). The generative log-linear model defines the probability of a particular dataset we observed, \(x_o\), as: \[ \begin{eqnarray*} p(x_o) &=& (1/Z) \exp w^\top \phi(x_o) \\ Z &=& \sum_{x} \exp w^\top \phi(x) \end{eqnarray*} \] where \(Z\), the normalization constant, or as physicists call it the partition function, is computed as a sum over all possible observations \(x\). To maximize the log likelihood of the observed data \(x_o\) we compute its gradient with respect to coefficients \(w_k\) that correspond to features \(\phi_k(x)\): \[ \begin{eqnarray*} \log p(x_o) &=& w^\top \phi(x_o) - \log Z \\ \frac{\partial}{\partial w_k} w^\top \phi(x_o) &=& \phi_k(x_o) \\ \frac{\partial}{\partial w_k}\log Z &=& (1/Z) \sum_{x} (\exp w^\top \phi(x)) \, \phi_k(x) \\ &=& \sum_{x} p(x) \phi_k(x) = \langle\phi_k\rangle \\ \frac{\partial}{\partial w_k}\log p(x_o) &=& \phi_k(x_o) - \langle\phi_k\rangle \end{eqnarray*} \] which shows that at the maximum, where the derivative is zero, the model expectation of the k'th feature \(\langle\phi_k\rangle\) should be equal to its value in the observed data \(\phi_k(x_o)\). That is very nice, but it does not tell us how to set \(w_k\) to achieve that maximum, for that we still need to run an optimizer. To compute the gradient for the optimizer, or even the likelihood of a particular \(w\) we need to compute a sum over all possible \(x\), which is completely impractical. The typical solution is to sample a random \(x\) from the distribution \(p(x)\) and use \(\phi_k(x)\) as an unbiased estimate to replace \(\langle\phi_k\rangle\). But sampling from a distribution with an unknown \(Z\) is no easy matter either. Now you know why generative log-linear models are not very popular.
Let us now turn to multinomial models like HMM. How do they define a distribution over all possible observations, yet avoid computing very large sums? They (very carefully) get rid of the big \(Z\). In particular, they come up with a generative story where the observed data is built piece by piece, and the choice for each piece depends only on a subset of the previously selected pieces. For example, an HMM is a simple machine which picks a current state based on its previous state (or last few states), and its current output based on the current state. To compute the probability of a particular state-output sequence we just multiply the conditional probabilities for each of the choices.
It is difficult to find a notation to express multinomial models that is simple, general, and that makes their relation to log-linear models obvious. Here is an attempt: let \(d\) be a decision, \(c\) be a condition, \(p(d|c)\) the probability of making decision \(d\) under condition \(c\), and \(n_o(c,d)\) the number of times we made decision \(d\) under condition \(c\) while constructing the observed data \(x_o\). For example in an HMM, \(d\) would be a member of states and outputs, \(c\) would be a member of states. We can then express the probability of the observed data \(x_o\) as: \[ \begin{eqnarray*} p(x_o) &=& \prod_{c,d} p(d|c)^{n_o(c,d)} \\ \log p(x_o) &=& \sum_{c,d} n_o(c,d) \log p(d|c) \end{eqnarray*} \] If we define \(\phi\) and \(w\) vectors indexed by \((c,d)\) pairs as: \[ \begin{eqnarray*} \phi_{c,d}(x_o) &=& n_o(c,d)\\ w_{c,d} &=& \log p(d|c) \end{eqnarray*} \] we can write the log likelihood as: \[ \log p(x_o) = w^\top \phi(x_o) \] which looks exactly like the log-linear model except for the missing \(Z\) term! Of course that doesn't mean we can pick any old \(w\) vector we want. In particular the probability of our decisions under each condition should sum to 1: \[ \forall c \sum_d p(d|c) = \sum_d \exp w_{c,d} = 1 \] To find the \(w\) that satisfies these constraints and maximizes log-likelihood, we use Lagrange multipliers: \[ \begin{eqnarray*} L(w,\lambda) &=& w^\top \phi(x_o) + \sum_c \lambda_c (1 - \sum_d \exp w_{c,d}) \\ \frac{\partial}{\partial w_{c,d}} L &=& \phi_{c,d}(x_o) - \lambda_c \exp w_{c,d} \\ &=& n_o(c,d) - \lambda_c p(d|c) = 0 \\ p(d|c) &=& n_o(c,d) / \lambda_c \end{eqnarray*} \] which shows that the maximum likelihood estimates for conditional probabilities \(p(d|c)\) are proportional to the observed counts \(n_o(c,d)\). To satisfy the sum to 1 constraint, we obtain \(p(d|c) = n_o(c,d)/n_o(c)\).
So, multinomial models are specific cases of log-linear models after all. However they avoid the big \(Z\) calculation by carefully defining the features and the corresponding parameters so as to ensure \(Z=1\). They do this by decomposing the observation into many small parts and making sure the decisions that generate each part are individually normalized. If each decision involves one of 10 choices and we need to make 20 decisions to construct the whole observation the set of all possible observations would have \(10^{20}\) elements. Normalizing 20 decisions with 10 elements each is easier than normalizing a set with \(10^{20}\) elements.
Questions for the interested reader:
- Show that normalization of each part ensures the normalization of the full observation.
- The log-linear solution above used an explicit Z, whereas the multinomial solution used constrained optimization. Can you use constrained optimization for log-linear models or an explicit Z for multinomial models, and does that make a difference?
- LSP mentions disadvantages of locally normalized conditional models compared to globally normalized conditional models. Do the same arguments apply to globally normalized generative log-linear models vs locally normalized generative multinomial models?
- Unsupervised learning can be thought of as hiding part of x from view and learning weights that maximize the likelihood given the visible portion. Is inference with multinomial models still easy in the unsupervised case? How about log-linear models?
- In the world of neural networks, log-linear models are analogous to "energy based", "undirected" or "symmetric" models like Hopfield nets and Boltzmann machines whereas multinomial models are analogous to "causal", "directed" models like feed-forward neural nets or sigmoid belief nets. Show that analogous arguments apply to these models.
- In the world of probabilistic graphical models, log-linear models are instances of "undirected" or "Markov" networks, whereas multinomial models are instances of "directed, acyclic" or "Bayesian" networks. Show that analogous arguments apply to these networks.
April 14, 2014
On the emergence of visual cortex receptive field properties
The light sensitive neurons of the retina (120 million rods and 6 million cones) pass their electrical signals through several intermediate layers to ganglion cells, which extend about 1.5 million cell fibers (the optic nerve) into the brain for further processing. Passing through the LGN in mid-brain, the signals end up in the visual cortex at the back where their processing eventually lead to our perception of visual objects and events.
Now, the rods and cones fire when light falls on them, they are simple photoreceptors. Ganglions, on the other hand, receive inputs from a large number of photoreceptors, so we'd like to know their receptive field. The receptive field of a ganglion is the region of the retina that effects its firing when stimulated. Kuffler (1953) showed that cat ganglion cells have a receptive field with a "center surround" pattern:
Note that these cells will respond strongly to a patch of light with the right size and location, but unlike rods and cones, they will not respond strongly to a uniformly bright surface because the excitatory (+) and the inhibitory (-) areas will approximately cancel out.
Hubel and Wiesel (1959) went further and recorded from the neurons of the cat visual cortex to identify their receptive fields. Their attempts at trying to elicit response from cortical neurons with spots of light were unsuccessful in the beginning. A couple of hours into the experiment they accidentally discovered a neuron that "went off like a machine gun" when inserting a glass slide into the projector. It turned out the neuron liked the straight edge of the slide that was moving on the screen. They discovered cortical neurons (which they called simple cells) respond best not to spots, but to bars of light oriented in a particular direction.
So far we have covered the "what?" part of our story. In the following decades research on computer vision took off at the MIT AI Lab and elsewhere. The engineering approach of trying to build systems that could actually see helped us understand why nature may have "designed" the receptive fields the way it has. For example Marr and Hildreth (1980) argue that the center surround receptive fields can be thought of as convolution with the second derivative of a Gaussian and the simple cells detect the zero-crossings of this convolution which facilitate edge detection.
Engineering approaches may answer the "why" but still leave open the question of how neurons know to connect in this intricate pattern. Embryo development is still not completely understood and while it may be reasonable that the same developmental processes that can build "fingers and toes" may create some functional regions in the brain, it is probably not reasonable to expect instructions for neuron #134267542 to connect to neuron # 49726845. At this point in our story came Olshausen and Field (1996) who showed that if you train a neural network with patches of natural images and bias it to preserve information and promote sparseness, you automagically get receptive fields similar to those of Hubel and Wiesel's simple cells:
So the poor neurons had no choice in the matter. It turns out to be inevitable that under some simple conditions random networks exposed to natural images would connect up in a way that an engineer would want them to connect and perform functions similar to that of the neurons in mammalian brains!April 12, 2014
Our Mathematical Universe by Max Tegmark
That brings us to the next natural question: do they need the simulation, the DVD, do they need us at all? Don't they just "exist" independently whether or not we happen to be watching, recording, interacting or anybody has ever thought of them? Max Tegmark advances the Mathematical Universe Hypothesis, which says (1) our universe at the most basic level IS (not just described by) a mathematical structure (an electron or a photon can be defined exactly by a handful of numbers and there is nothing left over), and that (2) any mathematically consistent (computable?) structure must exist in the same sense as we do!
Needless to say this has led to quite a bit of spirited discussion around the web, I would recommend Scott Aaronson's blog post and the comments therein.
April 06, 2014
March 25, 2014
Emre Ünal, M.S. 2014
M.S. Thesis: A Language Visualization System. Koç University Department of Computer Engineering, March 2014. (PDF, Presentation, Related video, paper)
Publications: bibtex.php
Abstract: In this thesis, a novel language visualization system is presented that converts natural language text into 3D scenes. The system is capable of understanding some concrete nouns, visualizable adjectives and spatial prepositions from full natural language sentences and generating 3D static scenes using these sentences. It is a rule based system that uses natural language processing tools, 3D model galleries and language resources during the process. Several techniques are shown that deals with the generality and ambiguity of the language in order to visualize the natural language text. A question answering module is built as well to answer certain types of spatial inference questions after the scene generation process is completed. The system demonstrates a new way of solving spatial inference problems by not only using the language itself but with the extra information provided by the visualization process.
February 14, 2014
Machine learning in 10 pictures
1. Test and training error: Why lower training error is not always a good thing: ESL Figure 2.11. Test and training error as a function of model complexity.
2. Under and overfitting: PRML Figure 1.4. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve.
3. Occam's razor: ITILA Figure 28.3. Why Bayesian inference embodies Occam’s razor. This figure gives the basic intuition for why complex models can turn out to be less probable. The horizontal axis represents the space of possible data sets D. Bayes’ theorem rewards models in proportion to how much they predicted the data that occurred. These predictions are quantified by a normalized probability distribution on D. This probability of the data given model Hi, P (D | Hi), is called the evidence for Hi. A simple model H1 makes only a limited range of predictions, shown by P(D|H1); a more powerful model H2, that has, for example, more free parameters than H1, is able to predict a greater variety of data sets. This means, however, that H2 does not predict the data sets in region C1 as strongly as H1. Suppose that equal prior probabilities have been assigned to the two models. Then, if the data set falls in region C1, the less powerful model H1 will be the more probable model.
4. Feature combinations: (1) Why collectively relevant features may look individually irrelevant, and also (2) Why linear methods may fail. From Isabelle Guyon's feature extraction slides.
5. Irrelevant features: Why irrelevant features hurt kNN, clustering, and other similarity based methods. The figure on the left shows two classes well separated on the vertical axis. The figure on the right adds an irrelevant horizontal axis which destroys the grouping and makes many points nearest neighbors of the opposite class.
6. Basis functions: How non-linear basis functions turn a low dimensional classification problem without a linear boundary into a high dimensional problem with a linear boundary. From SVM tutorial slides by Andrew Moore: a one dimensional non-linear classification problem with input x is turned into a 2-D problem z=(x, x^2) that is linearly separable.
7. Discriminative vs. Generative: Why discriminative learning may be easier than generative: PRML Figure 1.27. Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate.
8. Loss functions: Learning algorithms can be viewed as optimizing different loss functions: PRML Figure 7.5. Plot of the ‘hinge’ error function used in support vector machines, shown in blue, along with the error function for logistic regression, rescaled by a factor of 1/ln(2) so that it passes through the point (0, 1), shown in red. Also shown are the misclassification error in black and the squared error in green.
9. Geometry of least squares: ESL Figure 3.2. The N-dimensional geometry of least squares regression with two predictors. The outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection yˆ represents the vector of the least squares predictions.
10. Sparsity: Why Lasso (L1 regularization or Laplacian prior) gives sparse solutions (i.e. weight vectors with more zeros): ESL Figure 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions |β1| + |β2| ≤ t and β12 + β22 ≤ t2, respectively, while the red ellipses are the contours of the least squares error function.
February 12, 2014
Mehmet Ali Yatbaz, Ph.D. 2014
PhD Thesis: Linguistic Category Induction and Tagging Using the Paradigmatic Context Representations with Substitute Words. Koç University Department of Computer Engineering, February 2014. (PDF, Presentation, Keynote)
M.S. Thesis: Stretch: A Feature Weighting Method for The k Nearest Neighbor Algorithms. Koç University Department of Computer Engineering, October 2007. (PDF, Presentation).
Publications: bibtex.php
Linguistic Category Induction and Tagging Using the Paradigmatic Context Representations with Substitute Words
This thesis introduces a new paradigmatic representation of word contexts. Paradigmatic representations of word context are constructed from the potential substitutes of a word, in contrast to syntagmatic representations, which are constructed from the properties of neighboring words. The potential substitutes are calculated by using a statistical language model that is trained on raw text without any annotation or supervision. Thus, each context is represented as a distribution of substitute words. This thesis introduces models with different properties that can incorporate the new paradigmatic representation, and discusses the applications of these models to the tagging task in natural language processing (NLP).
In a standard NLP tagging task, the goal is to build a model in which the input is a sequence of observed words, and the output, depending on the level of supervision, is a sequence of cluster-ids or predefined tags. We define 5 different models with different properties and supervision requirements. The first model ignores the identity of the word, and clusters the substitute distributions without requiring supervision at any level. The second model represents the co-occurrences of words with their substitute words, and thus incorporates the word identity and context information at the same time. To construct the co-occurrence representation, this model discretizes the substitute distribution. The third model uses probabilistic voting to estimate the distribution of tags in a given context. Unlike the first and second models, this model requires the availability of a word-tag dictionary which can provide all possible tags of each given word. The fourth model proposes two extensions to the standard HMM-based tagging models in which both the word identity and the dependence between consecutive tags are taken into consideration. The last one introduces a generative probabilistic model, the noisy channel model, for the taggin tasks in which the word-tag frequencies are available. In this model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular tag T using a possibly ambiguous word W. To reconstruct the intended message (T), the hearer uses the distribution of possible tags in the given context Pr(TjC) and the possible words that can express each tag Pr(WjT).
The models are applied and analyzed on NLP tagging tasks with different characteristics. The first two models are tested on unsupervised part-of-speech (POS) induction in which the objective is to cluster syntactically similar words into the same group. The probabilistic voting model is tested on the morphological disambiguation of Turkish, with the objective of disambiguating the correct morphological parse of a word, given the available parses. The HMM-based model is applied to the part-of-speech tagging of English, with the objective of determining the correct POS tag of a word, given the available tags. Finally, the last model is tested on the word-sense disambiguation of English, with the objective of determining the correct sense of a word, given the word-sense frequencies.
January 27, 2014
Turkish Language Resources
- Turkish Data Depository (TDD)
- Our new website that we hope will act as a central location for all computational resources and models for Turkish. Most resources listed below and more will be found there.
- ITU Turkish Natural Language Processing Pipeline
- This webpage provides the Turkish NLP Tools and API developed in Istanbul Technical University by our Natural Language Processing group led by Asst. Prof. Dr. Gülşen Eryiğit including a Tokenizer,
Normalization tools,
Deasciifier,
Vocalizer,
Spelling Corrector,
Turkish word detector,
Morphological Analyzer,
Morphological Disambiguator,
Named Entity Recognizer, and
Dependency Parser.
- TS Corpus
- Taner Sezer's TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.
- BounWebCorpus
- Hasim Sak's page contains some useful Turkish language resources and code in addition to a large web corpus. (2021-08-26: link broken)
- Bibliography
- Özgür Yılmazel's Bibliography on Turkish Information Retrieval and Natural Language Processing.
- tr-disamb.tgz
- Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.
- correctparses_03.txt.gz, train.merge.gz
- Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct). A newer version can be found in TrMor2006.
- test.1.2.dis.gz, test.merge.gz
- Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy. A newer version can be found in TrMor2006.
- tr-tagger.tgz
- Turkish morphological tagger, includes Oflazer's finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software. For a better tagger use Morse.
- turklex.tgz, pc_kimmo.tgz
- Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu
- Milliyet.clean.bz2
- Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, originally was in three 11MB parts. From Kemal Oflazer. Please use with permission.
- Turkish wordnet
- From Kemal Oflazer. Please use with permission. (2021-08-26: link broken)
- METU-Sabanci Turkish Treebank
- Turkish treebank with dependency annotations. Please use with permission.
- sozluk.txt.gz
- English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.
- sozluk-boun.txt.gz
- Turkish word list (25822 words, 73K) Originally from www.cmpe.boun.edu.tr/courses/cmpe230
- Avrupa Birliği Temel Terimler Sözlüğü
- (Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)
- BilisimSozlugu.zip
- Bilişim Sözlüğü by Bülent Sankur (Originally from: www.bilisimsozlugu.com, Oct 9, 2006)
- turkish.el
- Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.
- en-tr.zip, lm.tr.gz
- Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 -- Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is "selectively split", i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.
- Zemberek (blog) (github)
- Zemberek doğal dil işleme kütüphanesi.
January 20, 2014
dlib: Deniz's C Library
(Download the latest version from Github)
Abstract: The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?
Contents
Introduction
The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?
As a motivational example, let's say we want to "count the number of times each word appears in a file". It takes 11 words to express this program in English (7 if you don't count the function words). Perl is just as concise (7 tokens excluding punctuation):
while(<>) {
for (split) {
$cnt{$_}++;
}
}
Imagine what a nightmare this tiny program is in standard C (or especially Java). The main feature that distinguishes Perl among high level languages and enables a Perl programmer to write concise code is that it provides reasonable defaults for everything, so that things you express most often can be written concisely (human languages and compression algorithms use the same trick). And of course as with every modern language these days Perl has built in dynamic arrays, hashes, and regexps and makes them practical to use with its concise syntax. This is my feeble, incomplete, and evolving attempt to import some of these ideas into C. Among other things, dlib introduces some iteration constructs for file I/O and string tokenization and built-in hashes to make programs of this sort almost as concise:
forline (str, NULL) {
fortok (tok, str) {
cnt(tok)++;
}
}
Compilation
To compile your C code with dlib, #include "dlib.h"
at the beginning
of your files and add dlib.c
to the files to be compiled. My typical
gcc options are:
-O3 -D_GNU_SOURCE -std=c99 -pedantic -Wall
I tried to stick with the C99 standard but used some extensions that
can be turned off. Use the -D_GNU_SOURCE
compiler flag if you want
to compile with these extensions without warnings. Define the
following flags with -D
compiler options if you don't have, or don't
want these extensions:
_NO_POPEN Do not use pipes in File I/O.
_NO_GETLINE Do not use GNU getline.
_NO_PROC Do not use the proc filesystem for memory reporting.
_NO_MUSABLE Do not use GNU malloc_usable_size for memory reporting.
NDEBUG Turn off debug output and assert checks (from assert.h).
File input
forline(s, f)
, is an iteration construct which executes the
statements in its body with the undeclared string variable s
set to
each line in file f
. All the file handle creation, opening and
closing, allocating and freeing of string buffers etc. are taken care
of behind the scenes. The following example prints the length of each
line in "file.txt"
:
forline (str, "file.txt") {
printf("%d\n", strlen(str));
}
There are some special f
arguments:
- If
f==NULL
, stdin is read. - If pipes are available, and
f
starts with<
, as inf=="< cmd args"
, thecmd
is run withargs
and its stdout is read. - If pipes are available, compressed files with .gz, .xz, and .bz2 extensions are automatically handled.
- Otherwise a regular file with path
f
is read.
Tokenization
fortok(t, s)
is an iteration construct which executes the statements
in its body with the undeclared string variable t
set to each
whitespace separated token in string s
. Empty tokens are skipped.
It modifies and tokenizes s
the same way strtok
does, but unlike
strtok
it is reentry safe (i.e. multiple nested fortok
loops are
ok). fortok3(t, s, d)
takes an additional character array d
to
specify delimiter characters any one of which will act as a delimiter.
fortok(t, s)
is equivalent to fortok3(t, s, " \f\n\r\t\v")
.
Examples:
char *str = strdup(" To be or not");
// need strdup because fortok won't work with constant strings
fortok (tok, str) {
printf("[%s]", tok); // prints "[To][be][or][not]"
}
char *pwd = strdup(":root::/root:/bin/bash:");
fortok3 (tok, pwd, ":/") {
printf("[%s]", tok); // prints "[root][root][bin][bash]"
}
split(char *str, const char *delim, char **argv, size_t argv_len)
returns the tokens of a string in an array. This is useful because
one often needs to refer to the n'th token in a string rather than
iterating over them. split
splits the str
into tokens delimited
by single characters from delim
(including empty tokens) and sets
the pointers in the argv
array to successive tokens. argv
should
have enough space to hold argv_len
pointers. Split stops when
argv_len
tokens are reached or str
runs out. It modifies str
by
replacing delimiter characters with '\0'
. Returns the number of
tokens placed in argv
. Example:
char *pwd = strdup(":root::/root:/bin/bash:");
char **toks = malloc(10 * sizeof(char *));
int n = split(pwd, ":/", toks, 10);
for (int i = 0; i < n; i++) {
printf("[%s]", toks[i]); // prints "[][root][][][root][][bin][bash][]"
}
Note the difference in delimiter handling between split
and
fortok
. fortok
and fortok3
follow strtok
and see multiple
delimiter characters in a row as a single delimiter, whereas split
,
following strsep
, will perceive empty fields. fortok
is intended
for free text, split
for structured data.
NOTES: Neither split
nor fortok
can handle a multi-character
delimiter like "::"
. Not sure if using different delimiter handling
strategies in the two will be confusing for the user. Probably need
something like chop
or trim
eventually.
Dynamic arrays
darr_t
(pronounced "dar-ty") is a general purpose dynamic array
type in dlib.
-
darr_t darr(size_t n, type t)
returns a new array that has an initial capacity of at leastn
and element typet
.n==0
is ok. The elements are not initialized. -
void darr_free(darr_t a)
frees the space allocated fora
. -
size_t len(darr_t a)
gives the number of elements ina
. A new array haslen(a) == 0
. -
size_t cap(darr_t a)
gives the current capacity ofa
. It will grow as needed, up to a maximum of(1<<58)
elements.
A regular C array reference, such as a[i]
, can be used as an
l-value (a[i] = 10)
, an r-value (x = a[i])
, or both (a[i]++)
. I
think this type of access makes the code more readable and I wanted
the same flexibility with darr_t
. So instead of the usual set
,
get
, add
etc. functions we have a single macro val(darr_t a,
size_t i, type t)
which gives a reference to the i
'th element of
a
that can be used as an l-value or an r-value. All of the
following are valid expressions and do what they look like they are
supposed to:
darr_t a = darr(0, int); // initialize array
val(a, i, int) = 10; // set an element
int x = val(a, i, int); // get an element
val(a, i, int)++; // increment an element
int *p = &val(a, i, int); // get a pointer to an element
val(a, len(a), int) = 5; // add a new element, increments len(a)
The user can request or set any index of a darr_t
from 0
to
(1<<58-1)
. The darr_t
will never complain and resize itself to
permit the access (if memory allows). len(a)
will be one plus the
largest index accessed (read or write). Some may think this gives too
much power to the user to shoot themselves in the foot. A read access
to a never initialized element will return a random value. An
accidental read or write to a very large index may blow up the memory.
Oh well, don't do it.
Hash tables
Hash tables are implemented as dynamic arrays (darr_t
) with some
search logic. You can initialize and free a hash table using darr
and darr_free
exactly like you do with dynamic arrays. The macro
D_HASH
defines an inline function xget
(the prefix x
is user
specified) that searches the array for an element matching a given key
and returns a pointer to it:
etype *xget(darr_t htable, ktype key, bool insert);
The etype
and ktype
are user specified types for array elements
and their keys respectively. The boolean argument insert
determines
what happens when a matching element cannot be found. If insert ==
true
one will be created (in a user specified manner) and added to
the array and xget
will return its pointer. If insert == false
,
xget
will return NULL
.
forhash(etype, eptr, htable, isnull)
is an iteration construct for hash tables which executes the
statements in its body with etype *eptr
bound to each element of
darr_t htable
for which the macro or function isnull
returns
false
.
In order to define the xget
function, the macro D_HASH
takes
the following nine arguments (I am open to suggestions to reduce this
number). The last six can be macros (preferrable) or functions.
- prefix: The prefix added to
get
to allow multiple hash table types (e.g.x
forxget
). - etype: Type of array element.
- ktype: Type of key.
- keyof:
ktype keyof(etype e)
returns the key for elemente
. - kmatch:
bool kmatch(ktype a, ktype b)
returns true if two keys match. - khash:
size_t khash(ktype k)
is a hash function for keys. - einit:
etype einit(ktype k)
returns a new element with key matchingk
. - isnull:
bool isnull(etype e)
returns true if array elemente
is empty. - mknull:
void mknull(etype e)
modifies array elemente
to become empty.
NOTE: This horribly convoluted interface is necessary to have a
general enough implementation that can support sets (where the key and
the element are one and the same) and maps (where the key is a
function of the element); element types that are primitive
(uint32_t
, char*
) or compound (structs, bit-fields), keys
that can be components of compound elements (e.key
) or
arbitrary functions of primitive ones (e & mask
), arrays that
store the elements themselves or their pointers etc. It is not
intended for daily use. Think of it as your hash table code
generator. Once you correctly generate the code for the few hash
table types you use, you will hopefully never need D_HASH
again.
Here is an example hash table for counting strings:
#include <stdio.h>
#include "dlib.h"
typedef struct strcnt_s { char *key; size_t cnt; } strcnt_t;
#define keyof(e) ((e).key)
extern size_t fnv1a(const char *k); // string hash fn defined in dlib
#define strmatch(a,b) (!strcmp((a),(b)))
#define newcnt(k) ((strcnt_t) { strdup(k), 0 })
#define keyisnull(e) ((e).key == NULL)
#define keymknull(e) ((e).key = NULL)
D_HASH(s, strcnt_t, char *, keyof, strmatch, fnv1a, newcnt, keyisnull, keymknull)
Given the function sget
defined by D_HASH
above, we can now write
our word counting example from the introduction.
#define cnt(k) sget(htable, (k), true)->cnt
int main() {
darr_t htable = darr(0, strcnt_t);
forline (str, NULL) {
fortok (tok, str) {
cnt(tok)++;
}
}
And print the counts:
forhash (strcnt_t, e, htable, keyisnull) {
printf("%s\t%lu\n", e->key, e->cnt);
}
}