Naive Bayes and Maximum Entropy models are both popular in NLP applications. In this post I will show that Naive Bayes is really a type of joint Maximum Entropy model (and much easier to compute). Maximum Entropy models aim to find the distribution that maximizes entropy while still obeying certain feature expectation constraints. Maximizing entropy with feature expectation constraints turns out to be equivalent to maximizing likelihood in a log-linear (aka exponential) model. Logistic regression is a specific type of conditional maxent model. Naive Bayes is a specific type of joint maxent model.
Consider a classification problem with a D dimensional input vector x and a discrete output variable y. Naive Bayes models assume that the individual components of x are independent given the output y:
\[ p(\mathbf{x}, y) = p(y) \prod_{d=1}^D p(x_d|y) \]
The factors on the right hand side are then estimated using maximum likelihood, which boils down to counting frequencies if x and y are both discrete. In comparison here is a joint maximum entropy model:
\[ p(\mathbf{x}, y) = \frac{1}{Z} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y)) \]
where fm are arbitrary real valued "feature functions" of the input and the output and Z is a normalization constant. This looks nothing like Naive Bayes at first glance, but if we choose feature functions that look at the output y with at most one component of the input x:
\[ p(\mathbf{x}, y) = \frac{1}{Z} \exp(\lambda_0 f_0(y)) \prod_{d=1}^D \exp(\lambda_d f_d(x_d, y)) \]
the similarity becomes more apparent. In fact if the feature functions are binary, this expression will have exactly the same form as the Naive Bayes expression, and maximizing likelihood will give exactly the same answers. Let us illustrate with an example:
\[ \begin{array}{cccc}
x_1 & x_2 & y & n \\
1 & 1 & 1 & 3 \\
1 & 1 & 0 & 1 \\
0 & 0 & 1 & 1 \\
0 & 0 & 0 & 3 \\
\end{array} \]
Here x1 and x2 are inputs, y is the output, and n is the number of times a particular combination has been observed. The Naive Bayes maximum likelihood estimates will give:
\[ \begin{array}{l}
p(y=1) = p(y=0) = 1/2 \\
p(x_1=1|y=1) = p(x_2=1|y=1) = 3/4 \\
p(x_1=1|y=0) = p(x_2=1|y=0) = 1/4 \\
\end{array} \]
which will lead to the following not-so-good estimates:
\[ \begin{array}{lll}
x_1 & x_2 & p(y=1|x_1, x_2) \\
1 & 1 & 0.9 \\
1 & 0 & 0.5 \\
0 & 1 & 0.5 \\
0 & 0 & 0.1 \\
\end{array} \]
The equivalent joint maximum entropy model will have 10 feature functions. The following table specifies when each feature function will return 1, star indicating "don't care":
\[ \begin{array}{lllll}
f & y & x_1 & x_2 & \lambda \\
f_0 & 0 & * & * & \lambda_0 \\
f_1 & 1 & * & * & \lambda_0 \\
f_2 & 0 & 0 & * & \lambda_1 \\
f_3 & 0 & 1 & * & \lambda_2 \\
f_4 & 1 & 0 & * & \lambda_2 \\
f_5 & 1 & 1 & * & \lambda_1 \\
f_6 & 0 & * & 0 & \lambda_1 \\
f_7 & 0 & * & 1 & \lambda_2 \\
f_8 & 1 & * & 0 & \lambda_2 \\
f_9 & 1 & * & 1 & \lambda_1 \\
\end{array} \]
Because of the symmetry of this example, not all lambda coefficients will be distinct, in fact only three distinct lambdas are necessary as indicated in the last column of the table. The maximum likelihood estimates of the lambdas will satisfy the expectation constraints: the empirical frequency of each feature will be equal to the expected frequency:
\[ \frac{1}{N} \sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{\mathbf{x},y} p(\mathbf{x}, y) f(\mathbf{x}, y) \]
Here the sum on the left is over the instances in the dataset, whereas the sum on the right is over all possible x, y pairs. This equation can be derived by setting the derivative of the log likelihood expression to zero. The following parameters satisfy these constraints and thus maximize the likelihood:
\[ \lambda_0 = \lambda_2 = 0, \lambda_1 = \log(3), Z = 32 \]
This joint maximum entropy model gives exactly the same results as the Naive Bayes model. For example:
\[ p(x_1=1, x_2=1, y=1) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_1) = \frac{9}{32} \]
\[ p(x_1=1, x_2=1, y=0) = \frac{1}{Z} \exp(\lambda_0 + 2 \lambda_2) = \frac{1}{32} \]
This example is from Chris Manning's NLP class (lecture 7) where he compares the Naive Bayes model to a conditional Maximum Entropy model:
\[ p(y | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp(\sum_{m=1}^M \lambda_m f_m(\mathbf{x}, y)) \]
A conditional maxent model will satisfy conditional expectation constraints and maximize conditional likelihood (which can again be derived by setting the derivative of the log conditional likelihood to zero):
\[ \sum_{n=1}^N f(\mathbf{x}_n, y_n) = \sum_{n=1}^N \sum_y p(y|\mathbf{x}_n) f(\mathbf{x}_n, y) \]
Using the same set of feature functions and parameters, one can derive the values that satisfy the constraints:
\[ \lambda_0 = \lambda_2 = 0, \lambda_1 = \frac{1}{2} \log(3) \]
The conditional model will give probabilities that match the data perfectly:
\[ \begin{array}{lll}
x_1 & x_2 & p(y=1|x_1, x_2) \\
1 & 1 & 3/4 \\
1 & 0 & 1/2 \\
0 & 1 & 1/2 \\
0 & 0 & 1/4 \\
\end{array} \]
However it is important to understand that this model handles feature interactions better than Naive Bayes not because of any magic associated with maxent models, but as a result of maximizing conditional instead of joint likelihood. When you maximize joint likelihood, maxent gives the same results as Naive Bayes. Of course with maxent, one always has the option of defining features with cross terms (more than one input variable) to handle feature interaction.
Full post...
Deep Learning is cheap Solomonoff induction?
1 hour ago