Deeplearning.net and homepages of Bengio, Schmidhuber have further information, background and links.
Full post...
Abstract: Most traditional distributional similarity models fail to capture syntagmatic patterns that group together multiple word features within the same joint context. In this work we introduce a novel generic distributional similarity scheme under which the power of probabilistic models can be leveraged to effectively model joint contexts. Based on this scheme, we implement a concrete model which utilizes probabilistic n-gram language models. Our evaluations suggest that this model is particularly well-suited for measuring similarity for verbs, which are known to exhibit richer syntagmatic patterns, while maintaining comparable or better performance with respect to competitive baselines for nouns. Following this, we propose our scheme as a framework for future semantic similarity models leveraging the substantial body of work that exists in probabilistic language modeling.
The SVM algorithm solves the following constrained optimization problem: \[ \min_{w,b} \frac{1}{2}|w|^2 + C\sum_i \xi_i\; \mbox{ s.t. }\; y_i(w^\top x_i+b)\geq 1-\xi_i,\; \xi_i \geq 0 \] which is minimizing a combination of the L2 norm of \(w\) and the total hinge loss \(\xi_i=(1-y_i(w^\top x_i+b))_+\). We know the L2 regularization drives \(w_i\) toward zero but not the whole way, so \(w\) won't be sparse. To see about sparsity of support vectors we need to look at the dual form. The Lagrangian for the above constrained optimization is: \[ L = \frac{1}{2}|w|^2 + C\sum_i \xi_i - \sum_i \alpha_i (y_i(w^\top x_i+b)-1+\xi_i) - \sum_i \beta_i \xi_i \] where \(\alpha_i \geq 0, \beta_i \geq 0\) are called the dual variables, and by setting various derivatives to zero we learn what the solution looks like: \[\begin{eqnarray*} \nabla_w L &=& w - \sum_i \alpha_i y_i x_i = 0 \\ \partial_b L &=& \sum_i \alpha_i y_i = 0 \\ \partial_{\xi_i} L &=& C - \alpha_i - \beta_i = 0 \end{eqnarray*}\] The first equation tells us that we can write \(w\) as a linear combination of training instances with weights \(\alpha_i\), and the last equation tells us that \(\alpha_i + \beta_i = C\) for each training instance. In fact there are three types of instances:
It turns out the optimal value of the original minimization problem and the optimal value of the dual maximization problem are the same. The dual always lower bounds the original problem, and in this case the bound is tight. So the following holds at the optimum \(w, \alpha\): \[ \frac{1}{2}|w|^2 + C\sum_i \xi_i = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j) \] The last sum is equal to \(|w|^2\), so rearranging gives us: \[ \|\alpha\|_1 = \|w\|_2^2 + C \sum_i \xi_i \] which means for a separable problem where all \(\xi_i=0\), we have \(\|\alpha\|_1 = \|w\|_2^2\)! If we are minimizing the L2 norm of \(w\), we are also minimizing the L1 norm of the corresponding \(\alpha\)! So how come the perceptron solves the 9vs4 on MNIST with half the number of support vectors?
I think it has to do with the fact that the SVM support vectors are always at (or on the wrong side of) the margin. Sometimes instances on the correct side of the margin may let us express the same hypothesis more compactly. The following picture may help:
We see a bunch of points at the margin, (at least some of) which will be picked by the SVM as support vectors. We also see two points at the poles. A perceptron which encounters one of the two pole points first will never make a mistake again, so it will express the same hypothesis more compactly. RVMs are also known to pick interior points as support vectors and generate more compact models.
Questions for the curious reader:Let us start with the log-linear model. Let \(x\) be the data we are trying to model, \(\phi(x)\) a vector of features, and \(w\) a vector of model parameters. (Here I use \(x\) to denote the whole training set, more typically denoted as \(\{(x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\}\), to reduce notational clutter). The generative log-linear model defines the probability of a particular dataset we observed, \(x_o\), as: \[ \begin{eqnarray*} p(x_o) &=& (1/Z) \exp w^\top \phi(x_o) \\ Z &=& \sum_{x} \exp w^\top \phi(x) \end{eqnarray*} \] where \(Z\), the normalization constant, or as physicists call it the partition function, is computed as a sum over all possible observations \(x\). To maximize the log likelihood of the observed data \(x_o\) we compute its gradient with respect to coefficients \(w_k\) that correspond to features \(\phi_k(x)\): \[ \begin{eqnarray*} \log p(x_o) &=& w^\top \phi(x_o) - \log Z \\ \frac{\partial}{\partial w_k} w^\top \phi(x_o) &=& \phi_k(x_o) \\ \frac{\partial}{\partial w_k}\log Z &=& (1/Z) \sum_{x} (\exp w^\top \phi(x)) \, \phi_k(x) \\ &=& \sum_{x} p(x) \phi_k(x) = \langle\phi_k\rangle \\ \frac{\partial}{\partial w_k}\log p(x_o) &=& \phi_k(x_o) - \langle\phi_k\rangle \end{eqnarray*} \] which shows that at the maximum, where the derivative is zero, the model expectation of the k'th feature \(\langle\phi_k\rangle\) should be equal to its value in the observed data \(\phi_k(x_o)\). That is very nice, but it does not tell us how to set \(w_k\) to achieve that maximum, for that we still need to run an optimizer. To compute the gradient for the optimizer, or even the likelihood of a particular \(w\) we need to compute a sum over all possible \(x\), which is completely impractical. The typical solution is to sample a random \(x\) from the distribution \(p(x)\) and use \(\phi_k(x)\) as an unbiased estimate to replace \(\langle\phi_k\rangle\). But sampling from a distribution with an unknown \(Z\) is no easy matter either. Now you know why generative log-linear models are not very popular.
Let us now turn to multinomial models like HMM. How do they define a distribution over all possible observations, yet avoid computing very large sums? They (very carefully) get rid of the big \(Z\). In particular, they come up with a generative story where the observed data is built piece by piece, and the choice for each piece depends only on a subset of the previously selected pieces. For example, an HMM is a simple machine which picks a current state based on its previous state (or last few states), and its current output based on the current state. To compute the probability of a particular state-output sequence we just multiply the conditional probabilities for each of the choices.
It is difficult to find a notation to express multinomial models that is simple, general, and that makes their relation to log-linear models obvious. Here is an attempt: let \(d\) be a decision, \(c\) be a condition, \(p(d|c)\) the probability of making decision \(d\) under condition \(c\), and \(n_o(c,d)\) the number of times we made decision \(d\) under condition \(c\) while constructing the observed data \(x_o\). For example in an HMM, \(d\) would be a member of states and outputs, \(c\) would be a member of states. We can then express the probability of the observed data \(x_o\) as: \[ \begin{eqnarray*} p(x_o) &=& \prod_{c,d} p(d|c)^{n_o(c,d)} \\ \log p(x_o) &=& \sum_{c,d} n_o(c,d) \log p(d|c) \end{eqnarray*} \] If we define \(\phi\) and \(w\) vectors indexed by \((c,d)\) pairs as: \[ \begin{eqnarray*} \phi_{c,d}(x_o) &=& n_o(c,d)\\ w_{c,d} &=& \log p(d|c) \end{eqnarray*} \] we can write the log likelihood as: \[ \log p(x_o) = w^\top \phi(x_o) \] which looks exactly like the log-linear model except for the missing \(Z\) term! Of course that doesn't mean we can pick any old \(w\) vector we want. In particular the probability of our decisions under each condition should sum to 1: \[ \forall c \sum_d p(d|c) = \sum_d \exp w_{c,d} = 1 \] To find the \(w\) that satisfies these constraints and maximizes log-likelihood, we use Lagrange multipliers: \[ \begin{eqnarray*} L(w,\lambda) &=& w^\top \phi(x_o) + \sum_c \lambda_c (1 - \sum_d \exp w_{c,d}) \\ \frac{\partial}{\partial w_{c,d}} L &=& \phi_{c,d}(x_o) - \lambda_c \exp w_{c,d} \\ &=& n_o(c,d) - \lambda_c p(d|c) = 0 \\ p(d|c) &=& n_o(c,d) / \lambda_c \end{eqnarray*} \] which shows that the maximum likelihood estimates for conditional probabilities \(p(d|c)\) are proportional to the observed counts \(n_o(c,d)\). To satisfy the sum to 1 constraint, we obtain \(p(d|c) = n_o(c,d)/n_o(c)\).
So, multinomial models are specific cases of log-linear models after all. However they avoid the big \(Z\) calculation by carefully defining the features and the corresponding parameters so as to ensure \(Z=1\). They do this by decomposing the observation into many small parts and making sure the decisions that generate each part are individually normalized. If each decision involves one of 10 choices and we need to make 20 decisions to construct the whole observation the set of all possible observations would have \(10^{20}\) elements. Normalizing 20 decisions with 10 elements each is easier than normalizing a set with \(10^{20}\) elements.
Questions for the interested reader:
The light sensitive neurons of the retina (120 million rods and 6 million cones) pass their electrical signals through several intermediate layers to ganglion cells, which extend about 1.5 million cell fibers (the optic nerve) into the brain for further processing. Passing through the LGN in mid-brain, the signals end up in the visual cortex at the back where their processing eventually lead to our perception of visual objects and events.
Now, the rods and cones fire when light falls on them, they are simple photoreceptors. Ganglions, on the other hand, receive inputs from a large number of photoreceptors, so we'd like to know their receptive field. The receptive field of a ganglion is the region of the retina that effects its firing when stimulated. Kuffler (1953) showed that cat ganglion cells have a receptive field with a "center surround" pattern:
Note that these cells will respond strongly to a patch of light with the right size and location, but unlike rods and cones, they will not respond strongly to a uniformly bright surface because the excitatory (+) and the inhibitory (-) areas will approximately cancel out.
Hubel and Wiesel (1959) went further and recorded from the neurons of the cat visual cortex to identify their receptive fields. Their attempts at trying to elicit response from cortical neurons with spots of light were unsuccessful in the beginning. A couple of hours into the experiment they accidentally discovered a neuron that "went off like a machine gun" when inserting a glass slide into the projector. It turned out the neuron liked the straight edge of the slide that was moving on the screen. They discovered cortical neurons (which they called simple cells) respond best not to spots, but to bars of light oriented in a particular direction.
So far we have covered the "what?" part of our story. In the following decades research on computer vision took off at the MIT AI Lab and elsewhere. The engineering approach of trying to build systems that could actually see helped us understand why nature may have "designed" the receptive fields the way it has. For example Marr and Hildreth (1980) argue that the center surround receptive fields can be thought of as convolution with the second derivative of a Gaussian and the simple cells detect the zero-crossings of this convolution which facilitate edge detection.
Engineering approaches may answer the "why" but still leave open the question of how neurons know to connect in this intricate pattern. Embryo development is still not completely understood and while it may be reasonable that the same developmental processes that can build "fingers and toes" may create some functional regions in the brain, it is probably not reasonable to expect instructions for neuron #134267542 to connect to neuron # 49726845. At this point in our story came Olshausen and Field (1996) who showed that if you train a neural network with patches of natural images and bias it to preserve information and promote sparseness, you automagically get receptive fields similar to those of Hubel and Wiesel's simple cells:
So the poor neurons had no choice in the matter. It turns out to be inevitable that under some simple conditions random networks exposed to natural images would connect up in a way that an engineer would want them to connect and perform functions similar to that of the neurons in mammalian brains!That brings us to the next natural question: do they need the simulation, the DVD, do they need us at all? Don't they just "exist" independently whether or not we happen to be watching, recording, interacting or anybody has ever thought of them? Max Tegmark advances the Mathematical Universe Hypothesis, which says (1) our universe at the most basic level IS (not just described by) a mathematical structure (an electron or a photon can be defined exactly by a handful of numbers and there is nothing left over), and that (2) any mathematically consistent (computable?) structure must exist in the same sense as we do!
Needless to say this has led to quite a bit of spirited discussion around the web, I would recommend Scott Aaronson's blog post and the comments therein.
1. Test and training error: Why lower training error is not always a good thing: ESL Figure 2.11. Test and training error as a function of model complexity.
2. Under and overfitting: PRML Figure 1.4. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve.
3. Occam's razor: ITILA Figure 28.3. Why Bayesian inference embodies Occam’s razor. This figure gives the basic intuition for why complex models can turn out to be less probable. The horizontal axis represents the space of possible data sets D. Bayes’ theorem rewards models in proportion to how much they predicted the data that occurred. These predictions are quantified by a normalized probability distribution on D. This probability of the data given model Hi, P (D | Hi), is called the evidence for Hi. A simple model H1 makes only a limited range of predictions, shown by P(D|H1); a more powerful model H2, that has, for example, more free parameters than H1, is able to predict a greater variety of data sets. This means, however, that H2 does not predict the data sets in region C1 as strongly as H1. Suppose that equal prior probabilities have been assigned to the two models. Then, if the data set falls in region C1, the less powerful model H1 will be the more probable model.
4. Feature combinations: (1) Why collectively relevant features may look individually irrelevant, and also (2) Why linear methods may fail. From Isabelle Guyon's feature extraction slides.
5. Irrelevant features: Why irrelevant features hurt kNN, clustering, and other similarity based methods. The figure on the left shows two classes well separated on the vertical axis. The figure on the right adds an irrelevant horizontal axis which destroys the grouping and makes many points nearest neighbors of the opposite class.
6. Basis functions: How non-linear basis functions turn a low dimensional classification problem without a linear boundary into a high dimensional problem with a linear boundary. From SVM tutorial slides by Andrew Moore: a one dimensional non-linear classification problem with input x is turned into a 2-D problem z=(x, x^2) that is linearly separable.
7. Discriminative vs. Generative: Why discriminative learning may be easier than generative: PRML Figure 1.27. Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate.
8. Loss functions: Learning algorithms can be viewed as optimizing different loss functions: PRML Figure 7.5. Plot of the ‘hinge’ error function used in support vector machines, shown in blue, along with the error function for logistic regression, rescaled by a factor of 1/ln(2) so that it passes through the point (0, 1), shown in red. Also shown are the misclassification error in black and the squared error in green.
9. Geometry of least squares: ESL Figure 3.2. The N-dimensional geometry of least squares regression with two predictors. The outcome vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection yˆ represents the vector of the least squares predictions.
10. Sparsity: Why Lasso (L1 regularization or Laplacian prior) gives sparse solutions (i.e. weight vectors with more zeros): ESL Figure 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions |β1| + |β2| ≤ t and β1^{2} + β2^{2} ≤ t2, respectively, while the red ellipses are the contours of the least squares error function.
Linguistic Category Induction and Tagging Using the Paradigmatic Context Representations with Substitute Words
This thesis introduces a new paradigmatic representation of word contexts. Paradigmatic representations of word context are constructed from the potential substitutes of a word, in contrast to syntagmatic representations, which are constructed from the properties of neighboring words. The potential substitutes are calculated by using a statistical language model that is trained on raw text without any annotation or supervision. Thus, each context is represented as a distribution of substitute words. This thesis introduces models with different properties that can incorporate the new paradigmatic representation, and discusses the applications of these models to the tagging task in natural language processing (NLP).
In a standard NLP tagging task, the goal is to build a model in which the input is a sequence of observed words, and the output, depending on the level of supervision, is a sequence of cluster-ids or predefined tags. We define 5 different models with different properties and supervision requirements. The first model ignores the identity of the word, and clusters the substitute distributions without requiring supervision at any level. The second model represents the co-occurrences of words with their substitute words, and thus incorporates the word identity and context information at the same time. To construct the co-occurrence representation, this model discretizes the substitute distribution. The third model uses probabilistic voting to estimate the distribution of tags in a given context. Unlike the first and second models, this model requires the availability of a word-tag dictionary which can provide all possible tags of each given word. The fourth model proposes two extensions to the standard HMM-based tagging models in which both the word identity and the dependence between consecutive tags are taken into consideration. The last one introduces a generative probabilistic model, the noisy channel model, for the taggin tasks in which the word-tag frequencies are available. In this model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular tag T using a possibly ambiguous word W. To reconstruct the intended message (T), the hearer uses the distribution of possible tags in the given context Pr(TjC) and the possible words that can express each tag Pr(WjT).
The models are applied and analyzed on NLP tagging tasks with different characteristics. The first two models are tested on unsupervised part-of-speech (POS) induction in which the objective is to cluster syntactically similar words into the same group. The probabilistic voting model is tested on the morphological disambiguation of Turkish, with the objective of disambiguating the correct morphological parse of a word, given the available parses. The HMM-based model is applied to the part-of-speech tagging of English, with the objective of determining the correct POS tag of a word, given the available tags. Finally, the last model is tested on the word-sense disambiguation of English, with the objective of determining the correct sense of a word, given the word-sense frequencies.
(Download the latest version from Github)
Abstract: The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?
The main design goal of dlib is to make C programming as practical and concise as Perl, and yet to retain its fine grained control over memory use. For years I have been starting my research projects in Perl, because I can get results quickly. Then at some point I rewrite everything in C, because I run out of memory (I usually work with large statistical models). The tiny memory overheads for each string, array, hash, or object add up when you have billions of them. Yet, I cringe at the prospect of expressing the same ideas in C using many more lines of code than Perl. Why should that be the case?
As a motivational example, let's say we want to "count the number of times each word appears in a file". It takes 11 words to express this program in English (7 if you don't count the function words). Perl is just as concise (7 tokens excluding punctuation):
while(<>) {
for (split) {
$cnt{$_}++;
}
}
Imagine what a nightmare this tiny program is in standard C (or especially Java). The main feature that distinguishes Perl among high level languages and enables a Perl programmer to write concise code is that it provides reasonable defaults for everything, so that things you express most often can be written concisely (human languages and compression algorithms use the same trick). And of course as with every modern language these days Perl has built in dynamic arrays, hashes, and regexps and makes them practical to use with its concise syntax. This is my feeble, incomplete, and evolving attempt to import some of these ideas into C. Among other things, dlib introduces some iteration constructs for file I/O and string tokenization and built-in hashes to make programs of this sort almost as concise:
forline (str, NULL) {
fortok (tok, str) {
cnt(tok)++;
}
}
To compile your C code with dlib, #include "dlib.h"
at the beginning
of your files and add dlib.c
to the files to be compiled. My typical
gcc options are:
-O3 -D_GNU_SOURCE -std=c99 -pedantic -Wall
I tried to stick with the C99 standard but used some extensions that
can be turned off. Use the -D_GNU_SOURCE
compiler flag if you want
to compile with these extensions without warnings. Define the
following flags with -D
compiler options if you don't have, or don't
want these extensions:
_NO_POPEN Do not use pipes in File I/O.
_NO_GETLINE Do not use GNU getline.
_NO_PROC Do not use the proc filesystem for memory reporting.
_NO_MUSABLE Do not use GNU malloc_usable_size for memory reporting.
NDEBUG Turn off debug output and assert checks (from assert.h).
forline(s, f)
, is an iteration construct which executes the
statements in its body with the undeclared string variable s
set to
each line in file f
. All the file handle creation, opening and
closing, allocating and freeing of string buffers etc. are taken care
of behind the scenes. The following example prints the length of each
line in "file.txt"
:
forline (str, "file.txt") {
printf("%d\n", strlen(str));
}
There are some special f
arguments:
f==NULL
, stdin is read.f
starts with <
, as in f=="< cmd
args"
, the cmd
is run with args
and its stdout is read.f
is read.fortok(t, s)
is an iteration construct which executes the statements
in its body with the undeclared string variable t
set to each
whitespace separated token in string s
. Empty tokens are skipped.
It modifies and tokenizes s
the same way strtok
does, but unlike
strtok
it is reentry safe (i.e. multiple nested fortok
loops are
ok). fortok3(t, s, d)
takes an additional character array d
to
specify delimiter characters any one of which will act as a delimiter.
fortok(t, s)
is equivalent to fortok3(t, s, " \f\n\r\t\v")
.
Examples:
char *str = strdup(" To be or not");
// need strdup because fortok won't work with constant strings
fortok (tok, str) {
printf("[%s]", tok); // prints "[To][be][or][not]"
}
char *pwd = strdup(":root::/root:/bin/bash:");
fortok3 (tok, pwd, ":/") {
printf("[%s]", tok); // prints "[root][root][bin][bash]"
}
split(char *str, const char *delim, char **argv, size_t argv_len)
returns the tokens of a string in an array. This is useful because
one often needs to refer to the n'th token in a string rather than
iterating over them. split
splits the str
into tokens delimited
by single characters from delim
(including empty tokens) and sets
the pointers in the argv
array to successive tokens. argv
should
have enough space to hold argv_len
pointers. Split stops when
argv_len
tokens are reached or str
runs out. It modifies str
by
replacing delimiter characters with '\0'
. Returns the number of
tokens placed in argv
. Example:
char *pwd = strdup(":root::/root:/bin/bash:");
char **toks = malloc(10 * sizeof(char *));
int n = split(pwd, ":/", toks, 10);
for (int i = 0; i < n; i++) {
printf("[%s]", toks[i]); // prints "[][root][][][root][][bin][bash][]"
}
Note the difference in delimiter handling between split
and
fortok
. fortok
and fortok3
follow strtok
and see multiple
delimiter characters in a row as a single delimiter, whereas split
,
following strsep
, will perceive empty fields. fortok
is intended
for free text, split
for structured data.
NOTES: Neither split
nor fortok
can handle a multi-character
delimiter like "::"
. Not sure if using different delimiter handling
strategies in the two will be confusing for the user. Probably need
something like chop
or trim
eventually.
darr_t
(pronounced "dar-ty") is a general purpose dynamic array
type in dlib.
darr_t darr(size_t n, type t)
returns a new array that has an
initial capacity of at least n
and element type t
. n==0
is
ok. The elements are not initialized.void darr_free(darr_t a)
frees the space allocated for a
.size_t len(darr_t a)
gives the number of elements in a
. A new
array has len(a) == 0
.size_t cap(darr_t a)
gives the current capacity of a
. It will
grow as needed, up to a maximum of (1<<58)
elements.A regular C array reference, such as a[i]
, can be used as an
l-value (a[i] = 10)
, an r-value (x = a[i])
, or both (a[i]++)
. I
think this type of access makes the code more readable and I wanted
the same flexibility with darr_t
. So instead of the usual set
,
get
, add
etc. functions we have a single macro val(darr_t a,
size_t i, type t)
which gives a reference to the i
'th element of
a
that can be used as an l-value or an r-value. All of the
following are valid expressions and do what they look like they are
supposed to:
darr_t a = darr(0, int); // initialize array
val(a, i, int) = 10; // set an element
int x = val(a, i, int); // get an element
val(a, i, int)++; // increment an element
int *p = &val(a, i, int); // get a pointer to an element
val(a, len(a), int) = 5; // add a new element, increments len(a)
The user can request or set any index of a darr_t
from 0
to
(1<<58-1)
. The darr_t
will never complain and resize itself to
permit the access (if memory allows). len(a)
will be one plus the
largest index accessed (read or write). Some may think this gives too
much power to the user to shoot themselves in the foot. A read access
to a never initialized element will return a random value. An
accidental read or write to a very large index may blow up the memory.
Oh well, don't do it.
Hash tables are implemented as dynamic arrays (darr_t
) with some
search logic. You can initialize and free a hash table using darr
and darr_free
exactly like you do with dynamic arrays. The macro
D_HASH
defines an inline function xget
(the prefix x
is user
specified) that searches the array for an element matching a given key
and returns a pointer to it:
etype *xget(darr_t htable, ktype key, bool insert);
The etype
and ktype
are user specified types for array elements
and their keys respectively. The boolean argument insert
determines
what happens when a matching element cannot be found. If insert ==
true
one will be created (in a user specified manner) and added to
the array and xget
will return its pointer. If insert == false
,
xget
will return NULL
.
forhash(etype, eptr, htable, isnull)
is an iteration construct for hash tables which executes the
statements in its body with etype *eptr
bound to each element of
darr_t htable
for which the macro or function isnull
returns
false
.
In order to define the xget
function, the macro D_HASH
takes
the following nine arguments (I am open to suggestions to reduce this
number). The last six can be macros (preferrable) or functions.
get
to allow multiple hash table types (e.g. x
for xget
).ktype keyof(etype e)
returns the key for element e
.bool kmatch(ktype a, ktype b)
returns true if two keys match.size_t khash(ktype k)
is a hash function for keys.etype einit(ktype k)
returns a new element with key matching k
.bool isnull(etype e)
returns true if array element e
is empty.void mknull(etype e)
modifies array element e
to become empty.NOTE: This horribly convoluted interface is necessary to have a
general enough implementation that can support sets (where the key and
the element are one and the same) and maps (where the key is a
function of the element); element types that are primitive
(uint32_t
, char*
) or compound (structs, bit-fields), keys
that can be components of compound elements (e.key
) or
arbitrary functions of primitive ones (e & mask
), arrays that
store the elements themselves or their pointers etc. It is not
intended for daily use. Think of it as your hash table code
generator. Once you correctly generate the code for the few hash
table types you use, you will hopefully never need D_HASH
again.
Here is an example hash table for counting strings:
#include <stdio.h>
#include "dlib.h"
typedef struct strcnt_s { char *key; size_t cnt; } strcnt_t;
#define keyof(e) ((e).key)
extern size_t fnv1a(const char *k); // string hash fn defined in dlib
#define strmatch(a,b) (!strcmp((a),(b)))
#define newcnt(k) ((strcnt_t) { strdup(k), 0 })
#define keyisnull(e) ((e).key == NULL)
#define keymknull(e) ((e).key = NULL)
D_HASH(s, strcnt_t, char *, keyof, strmatch, fnv1a, newcnt, keyisnull, keymknull)
Given the function sget
defined by D_HASH
above, we can now write
our word counting example from the introduction.
#define cnt(k) sget(htable, (k), true)->cnt
int main() {
darr_t htable = darr(0, strcnt_t);
forline (str, NULL) {
fortok (tok, str) {
cnt(tok)++;
}
}
And print the counts:
forhash (strcnt_t, e, htable, keyisnull) {
printf("%s\t%lu\n", e->key, e->cnt);
}
}