The SVM algorithm solves the following constrained optimization problem: \[ \min_{w,b} \frac{1}{2}|w|^2 + C\sum_i \xi_i\; \mbox{ s.t. }\; y_i(w^\top x_i+b)\geq 1-\xi_i,\; \xi_i \geq 0 \] which is minimizing a combination of the L2 norm of \(w\) and the total hinge loss \(\xi_i=(1-y_i(w^\top x_i+b))_+\). We know the L2 regularization drives \(w_i\) toward zero but not the whole way, so \(w\) won't be sparse. To see about sparsity of support vectors we need to look at the dual form. The Lagrangian for the above constrained optimization is: \[ L = \frac{1}{2}|w|^2 + C\sum_i \xi_i - \sum_i \alpha_i (y_i(w^\top x_i+b)-1+\xi_i) - \sum_i \beta_i \xi_i \] where \(\alpha_i \geq 0, \beta_i \geq 0\) are called the dual variables, and by setting various derivatives to zero we learn what the solution looks like: \[\begin{eqnarray*} \nabla_w L &=& w - \sum_i \alpha_i y_i x_i = 0 \\ \partial_b L &=& \sum_i \alpha_i y_i = 0 \\ \partial_{\xi_i} L &=& C - \alpha_i - \beta_i = 0 \end{eqnarray*}\] The first equation tells us that we can write \(w\) as a linear combination of training instances with weights \(\alpha_i\), and the last equation tells us that \(\alpha_i + \beta_i = C\) for each training instance. In fact there are three types of instances:

- \(\alpha_i = 0,\; \beta_i = C,\; \xi_i = 0,\; y_i(w^\top x_i+b)\gt 1 \): These are the points that satisfy the margin.
- \(0\lt\alpha_i\lt C,\; 0\lt\beta_i\lt C,\; \xi_i = 0,\; y_i(w^\top x_i+b)=1 \): These are the points exactly at the margin.
- \(\alpha_i = C,\; \beta_i = 0,\; \xi_i \gt 0,\; y_i(w^\top x_i+b)\lt 1 \): These are the points that violate the margin.

It turns out the optimal value of the original minimization problem and the optimal value of the dual maximization problem are the same. The dual always lower bounds the original problem, and in this case the bound is tight. So the following holds at the optimum \(w, \alpha\): \[ \frac{1}{2}|w|^2 + C\sum_i \xi_i = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i.x_j) \] The last sum is equal to \(|w|^2\), so rearranging gives us: \[ \|\alpha\|_1 = \|w\|_2^2 + C \sum_i \xi_i \] which means for a separable problem where all \(\xi_i=0\), we have \(\|\alpha\|_1 = \|w\|_2^2\)! If we are minimizing the L2 norm of \(w\), we are also minimizing the L1 norm of the corresponding \(\alpha\)! So how come the perceptron solves the 9vs4 on MNIST with half the number of support vectors?

I think it has to do with the fact that the SVM support vectors are always at (or on the wrong side of) the margin. Sometimes instances on the correct side of the margin may let us express the same hypothesis more compactly. The following picture may help:

We see a bunch of points at the margin, (at least some of) which will be picked by the SVM as support vectors. We also see two points at the poles. A perceptron which encounters one of the two pole points first will never make a mistake again, so it will express the same hypothesis more compactly. RVMs are also known to pick interior points as support vectors and generate more compact models.

**Questions for the curious reader:**

- Can the projectron solve the problem given in the figure compactly given a bad ordering of points?
- Can we maximize the margin while minimizing the number of support vectors?
- Can we prove learning bounds on hypotheses representable by a small number of support vectors?
- Does L1 SVM sparsify the dual and does it lead to fewer support vectors?
- Can we have nonparametric algorithms which stop growing when no longer necessary and still provide good bounds?

## No comments:

Post a Comment