- If you are an undergrad, take as many math and physics course as you can, and learn to program. If you are an aspiring grad student: apply to schools where there is someone you want to work with. It's much more important that the ranking of the school (as long as the school is in the top 50).
- Read, learn from on-line material, try things for yourself. As Feynman said: don't read everything about a topic before starting to work on it. Think about the problem for yourself, figure out what's important, then read the literature. This will allow you to interpret the literature and tell what's good from what's bad.
- Don't get fooled by people who claim to have a solution to Artificial General Intelligence, who claim to have AI systems that work "just like the human brain", or who claim to have figured out how the brain works (well, except if it's Geoff Hinton making the claim). Ask them what error rate they get on MNIST or ImageNet.
- We are using Torch for most of our research projects (and some of our development projects) at Facebook. Deep Mind is also using Torch in a big way (largely because my former student and Torch-co-maintainer Koray Kavukcuoglu sold them on it). Since the Deep Mind acquisition, folks in the Google Brain group in Mountain View have also started to use it.
- Kernel methods are great for many purposes, but they are merely glorified template matching. Despite the beautiful math, a kernel machine is nothing more than one layer of template matchers (one per training sample) where the templates are the training samples, and one layer of linear combinations on top.
- There is nothing magical about margin maximization. It's just another way of saying "L2 regularization" (despite the cute math).
- There is no opposition between deep learning and graphical models. Many deep learning approaches can be seen as factor graphs. I posted about this in the past.
- (On most promising current areas of research): Representation learning (the current crop of deep learning methods is just one way of doing it); learning long-term dependencies; marrying representation learning with structured prediction and/or reasoning; unsupervised representation learning, particularly prediction-based methods for temporal/sequential signals; marrying representation learning and reinforcement learning; using learning to speed up the solution of complex inference problems; theory: do theory (any theory) on deep learning/representation learning; understanding the landscape of objective functions in deep learning; in terms of applications: natural language understanding (e.g. for machine translation), video understanding; learning complex control.
- (On HTM): Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it's the very idea of deep learning) and will use temporal prediction. But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?). I think Jeff Hawkins, Dileep George and others greatly underestimated the difficulty of reducing these conceptual ideas to practice. As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task.
- Do you think having a PhD is important if you want to work in a good research team in the industry?
Yes.
May 18, 2014
Reddit AMA with Yann Lecun
May 17, 2014
How to learn about deep learning
- Neural Networks for Machine Learning: Hinton's course at Coursera is a wonderful introduction to deep learning. I recommend lectures 11-15 that cover Hopfield nets, Boltzmann machines, sigmoid belief nets and sparse auto-encoders.
- UFLDL tutorial: Tutorial on unsupervised feature learning and deep learning by Andrew Ng. Includes lecture notes and programming exercises. The first programming exercise shows how sparse auto-encoders can self-learn features found in the visual cortex, and the rest shows how these unsupervised features improve performance on supervised visual recognition tasks like MNIST and STL10.
- IPAM graduate summer school on deep learning and feature learning: has video lectures by many of the leaders of the field.
- Torch7: is a Matlab-like environment for state-of-the-art machine learning algorithms used at the IPAM summer school and several research groups on deep learning.
- deeplearning.net lists these and many other resources.
May 15, 2014
That pesky Z: multinomial vs log-linear generative models
Let us start with the log-linear model. Let \(x\) be the data we are trying to model, \(\phi(x)\) a vector of features, and \(w\) a vector of model parameters. (Here I use \(x\) to denote the whole training set, more typically denoted as \(\{(x_1,y_1), (x_2,y_2), \ldots, (x_m,y_m)\}\), to reduce notational clutter). The generative log-linear model defines the probability of a particular dataset we observed, \(x_o\), as: \[ \begin{eqnarray*} p(x_o) &=& (1/Z) \exp w^\top \phi(x_o) \\ Z &=& \sum_{x} \exp w^\top \phi(x) \end{eqnarray*} \] where \(Z\), the normalization constant, or as physicists call it the partition function, is computed as a sum over all possible observations \(x\). To maximize the log likelihood of the observed data \(x_o\) we compute its gradient with respect to coefficients \(w_k\) that correspond to features \(\phi_k(x)\): \[ \begin{eqnarray*} \log p(x_o) &=& w^\top \phi(x_o) - \log Z \\ \frac{\partial}{\partial w_k} w^\top \phi(x_o) &=& \phi_k(x_o) \\ \frac{\partial}{\partial w_k}\log Z &=& (1/Z) \sum_{x} (\exp w^\top \phi(x)) \, \phi_k(x) \\ &=& \sum_{x} p(x) \phi_k(x) = \langle\phi_k\rangle \\ \frac{\partial}{\partial w_k}\log p(x_o) &=& \phi_k(x_o) - \langle\phi_k\rangle \end{eqnarray*} \] which shows that at the maximum, where the derivative is zero, the model expectation of the k'th feature \(\langle\phi_k\rangle\) should be equal to its value in the observed data \(\phi_k(x_o)\). That is very nice, but it does not tell us how to set \(w_k\) to achieve that maximum, for that we still need to run an optimizer. To compute the gradient for the optimizer, or even the likelihood of a particular \(w\) we need to compute a sum over all possible \(x\), which is completely impractical. The typical solution is to sample a random \(x\) from the distribution \(p(x)\) and use \(\phi_k(x)\) as an unbiased estimate to replace \(\langle\phi_k\rangle\). But sampling from a distribution with an unknown \(Z\) is no easy matter either. Now you know why generative log-linear models are not very popular.
Let us now turn to multinomial models like HMM. How do they define a distribution over all possible observations, yet avoid computing very large sums? They (very carefully) get rid of the big \(Z\). In particular, they come up with a generative story where the observed data is built piece by piece, and the choice for each piece depends only on a subset of the previously selected pieces. For example, an HMM is a simple machine which picks a current state based on its previous state (or last few states), and its current output based on the current state. To compute the probability of a particular state-output sequence we just multiply the conditional probabilities for each of the choices.
It is difficult to find a notation to express multinomial models that is simple, general, and that makes their relation to log-linear models obvious. Here is an attempt: let \(d\) be a decision, \(c\) be a condition, \(p(d|c)\) the probability of making decision \(d\) under condition \(c\), and \(n_o(c,d)\) the number of times we made decision \(d\) under condition \(c\) while constructing the observed data \(x_o\). For example in an HMM, \(d\) would be a member of states and outputs, \(c\) would be a member of states. We can then express the probability of the observed data \(x_o\) as: \[ \begin{eqnarray*} p(x_o) &=& \prod_{c,d} p(d|c)^{n_o(c,d)} \\ \log p(x_o) &=& \sum_{c,d} n_o(c,d) \log p(d|c) \end{eqnarray*} \] If we define \(\phi\) and \(w\) vectors indexed by \((c,d)\) pairs as: \[ \begin{eqnarray*} \phi_{c,d}(x_o) &=& n_o(c,d)\\ w_{c,d} &=& \log p(d|c) \end{eqnarray*} \] we can write the log likelihood as: \[ \log p(x_o) = w^\top \phi(x_o) \] which looks exactly like the log-linear model except for the missing \(Z\) term! Of course that doesn't mean we can pick any old \(w\) vector we want. In particular the probability of our decisions under each condition should sum to 1: \[ \forall c \sum_d p(d|c) = \sum_d \exp w_{c,d} = 1 \] To find the \(w\) that satisfies these constraints and maximizes log-likelihood, we use Lagrange multipliers: \[ \begin{eqnarray*} L(w,\lambda) &=& w^\top \phi(x_o) + \sum_c \lambda_c (1 - \sum_d \exp w_{c,d}) \\ \frac{\partial}{\partial w_{c,d}} L &=& \phi_{c,d}(x_o) - \lambda_c \exp w_{c,d} \\ &=& n_o(c,d) - \lambda_c p(d|c) = 0 \\ p(d|c) &=& n_o(c,d) / \lambda_c \end{eqnarray*} \] which shows that the maximum likelihood estimates for conditional probabilities \(p(d|c)\) are proportional to the observed counts \(n_o(c,d)\). To satisfy the sum to 1 constraint, we obtain \(p(d|c) = n_o(c,d)/n_o(c)\).
So, multinomial models are specific cases of log-linear models after all. However they avoid the big \(Z\) calculation by carefully defining the features and the corresponding parameters so as to ensure \(Z=1\). They do this by decomposing the observation into many small parts and making sure the decisions that generate each part are individually normalized. If each decision involves one of 10 choices and we need to make 20 decisions to construct the whole observation the set of all possible observations would have \(10^{20}\) elements. Normalizing 20 decisions with 10 elements each is easier than normalizing a set with \(10^{20}\) elements.
Questions for the interested reader:
- Show that normalization of each part ensures the normalization of the full observation.
- The log-linear solution above used an explicit Z, whereas the multinomial solution used constrained optimization. Can you use constrained optimization for log-linear models or an explicit Z for multinomial models, and does that make a difference?
- LSP mentions disadvantages of locally normalized conditional models compared to globally normalized conditional models. Do the same arguments apply to globally normalized generative log-linear models vs locally normalized generative multinomial models?
- Unsupervised learning can be thought of as hiding part of x from view and learning weights that maximize the likelihood given the visible portion. Is inference with multinomial models still easy in the unsupervised case? How about log-linear models?
- In the world of neural networks, log-linear models are analogous to "energy based", "undirected" or "symmetric" models like Hopfield nets and Boltzmann machines whereas multinomial models are analogous to "causal", "directed" models like feed-forward neural nets or sigmoid belief nets. Show that analogous arguments apply to these models.
- In the world of probabilistic graphical models, log-linear models are instances of "undirected" or "Markov" networks, whereas multinomial models are instances of "directed, acyclic" or "Bayesian" networks. Show that analogous arguments apply to these networks.