## May 15, 2014

### That pesky Z: multinomial vs log-linear generative models

I was thinking about when and how unsupervised learning is possible and practical. First of all we seem to need a generative model (can we even define a discriminative unsupervised model?). A generative model defines a joint probability distribution for all your data, in contrast to conditional models which only define p(y|x) where x is the "input" and y is the "output". LSP defines two types (parametrizations) of generative model: multinomial and log-linear. Multinomial models (Naive Bayes, NGRAM, HMM, PCFG, etc.), are hard to define: you need to imagine an exact sequence of choices that construct each example. Log-linear models are easier to define, they focus more on description than construction: you can pick any property you like as a "feature" and make the probability of each example be defined by the state of its features. When it comes to generating instances or learning parameters though, multinomial models are easier: they can be learnt by simple counting, log-linear models require convex optimization. On the other hand, didn't I show that Naive Bayes, for example, can be seen both as a multinomial model and a generative log-linear model in a previous post? What is going on?

Let us start with the log-linear model. Let $$x$$ be the data we are trying to model, $$\phi(x)$$ a vector of features, and $$w$$ a vector of model parameters. (Here I use $$x$$ to denote the whole training set, more typically denoted as $$\{(x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\}$$, to reduce notational clutter). The generative log-linear model defines the probability of a particular dataset we observed, $$x_o$$, as: $\begin{eqnarray*} p(x_o) &=& (1/Z) \exp w^\top \phi(x_o) \\ Z &=& \sum_{x} \exp w^\top \phi(x) \end{eqnarray*}$ where $$Z$$, the normalization constant, or as physicists call it the partition function, is computed as a sum over all possible observations $$x$$. To maximize the log likelihood of the observed data $$x_o$$ we compute its gradient with respect to coefficients $$w_k$$ that correspond to features $$\phi_k(x)$$: $\begin{eqnarray*} \log p(x_o) &=& w^\top \phi(x_o) - \log Z \\ \frac{\partial}{\partial w_k} w^\top \phi(x_o) &=& \phi_k(x_o) \\ \frac{\partial}{\partial w_k}\log Z &=& (1/Z) \sum_{x} (\exp w^\top \phi(x)) \, \phi_k(x) \\ &=& \sum_{x} p(x) \phi_k(x) = \langle\phi_k\rangle \\ \frac{\partial}{\partial w_k}\log p(x_o) &=& \phi_k(x_o) - \langle\phi_k\rangle \end{eqnarray*}$ which shows that at the maximum, where the derivative is zero, the model expectation of the k'th feature $$\langle\phi_k\rangle$$ should be equal to its value in the observed data $$\phi_k(x_o)$$. That is very nice, but it does not tell us how to set $$w_k$$ to achieve that maximum, for that we still need to run an optimizer. To compute the gradient for the optimizer, or even the likelihood of a particular $$w$$ we need to compute a sum over all possible $$x$$, which is completely impractical. The typical solution is to sample a random $$x$$ from the distribution $$p(x)$$ and use $$\phi_k(x)$$ as an unbiased estimate to replace $$\langle\phi_k\rangle$$. But sampling from a distribution with an unknown $$Z$$ is no easy matter either. Now you know why generative log-linear models are not very popular.

Let us now turn to multinomial models like HMM. How do they define a distribution over all possible observations, yet avoid computing very large sums? They (very carefully) get rid of the big $$Z$$. In particular, they come up with a generative story where the observed data is built piece by piece, and the choice for each piece depends only on a subset of the previously selected pieces. For example, an HMM is a simple machine which picks a current state based on its previous state (or last few states), and its current output based on the current state. To compute the probability of a particular state-output sequence we just multiply the conditional probabilities for each of the choices.

It is difficult to find a notation to express multinomial models that is simple, general, and that makes their relation to log-linear models obvious. Here is an attempt: let $$d$$ be a decision, $$c$$ be a condition, $$p(d|c)$$ the probability of making decision $$d$$ under condition $$c$$, and $$n_o(c,d)$$ the number of times we made decision $$d$$ under condition $$c$$ while constructing the observed data $$x_o$$. For example in an HMM, $$d$$ would be a member of states and outputs, $$c$$ would be a member of states. We can then express the probability of the observed data $$x_o$$ as: $\begin{eqnarray*} p(x_o) &=& \prod_{c,d} p(d|c)^{n_o(c,d)} \\ \log p(x_o) &=& \sum_{c,d} n_o(c,d) \log p(d|c) \end{eqnarray*}$ If we define $$\phi$$ and $$w$$ vectors indexed by $$(c,d)$$ pairs as: $\begin{eqnarray*} \phi_{c,d}(x_o) &=& n_o(c,d)\\ w_{c,d} &=& \log p(d|c) \end{eqnarray*}$ we can write the log likelihood as: $\log p(x_o) = w^\top \phi(x_o)$ which looks exactly like the log-linear model except for the missing $$Z$$ term! Of course that doesn't mean we can pick any old $$w$$ vector we want. In particular the probability of our decisions under each condition should sum to 1: $\forall c \sum_d p(d|c) = \sum_d \exp w_{c,d} = 1$ To find the $$w$$ that satisfies these constraints and maximizes log-likelihood, we use Lagrange multipliers: $\begin{eqnarray*} L(w,\lambda) &=& w^\top \phi(x_o) + \sum_c \lambda_c (1 - \sum_d \exp w_{c,d}) \\ \frac{\partial}{\partial w_{c,d}} L &=& \phi_{c,d}(x_o) - \lambda_c \exp w_{c,d} \\ &=& n_o(c,d) - \lambda_c p(d|c) = 0 \\ p(d|c) &=& n_o(c,d) / \lambda_c \end{eqnarray*}$ which shows that the maximum likelihood estimates for conditional probabilities $$p(d|c)$$ are proportional to the observed counts $$n_o(c,d)$$. To satisfy the sum to 1 constraint, we obtain $$p(d|c) = n_o(c,d)/n_o(c)$$.

So, multinomial models are specific cases of log-linear models after all. However they avoid the big $$Z$$ calculation by carefully defining the features and the corresponding parameters so as to ensure $$Z=1$$. They do this by decomposing the observation into many small parts and making sure the decisions that generate each part are individually normalized. If each decision involves one of 10 choices and we need to make 20 decisions to construct the whole observation the set of all possible observations would have $$10^{20}$$ elements. Normalizing 20 decisions with 10 elements each is easier than normalizing a set with $$10^{20}$$ elements.