capable of performing linear and nonlinear classification, regression even outlier detection
particularly well suit for classification of complex but small or medium size datasets

Linear SVM Classification¶

不仅仅在匹配一条可以准确分割两类的线了，而是要fit出一条尽可能宽的街道。
在两个边界之外的实例与decision boundary（决定边界）没有关系，影响边界的只有街道里面的实例，被称为support vector（支持向量）
sensitive to the feature scale ; need feature scaling like StandardScaler

Soft Margin(边缘) Classification

all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street

import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris[‘data‘][:, (2, 3)]
y = (iris[‘target‘]==2).astype(np.float64)

svm_clf = Pipeline((
    (‘scaler‘, StandardScaler()),
    (‘linear_svc‘, LinearSVC(C=1, loss=‘hinge‘)),
))

svm_clf.fit(X, y)

Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘linear_svc‘, LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,
     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))

svm_clf.predict([[5.5, 1.7]])  #don‘t output probabilities

array([ 1.])

from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

svc_clf = Pipeline((
    (‘scaler‘, StandardScaler()),
    (‘svc‘, SVC(kernel=‘linear‘, C=1)),
))
m = len(X)
C =1
sgd_clf = Pipeline((
    (‘scaler‘, StandardScaler()),
    (‘sgd‘, SGDClassifier(loss=‘hinge‘, alpha=1/(m*C)))
))

svc_clf.fit(X, y)
sgd_clf.fit(X, y)
print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))

[ 1.] [ 1.]

sgd慢，但可以处理巨大数据集和online classification task,svc更慢
LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
loss = hinge
for better preformance set dual=False, unless more features than training instances

Nonlinear SVM Classification¶

add more features like Polynomial to use linear Classification

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline((
    (‘poly_features‘, PolynomialFeatures(degree=3)),
    (‘scaler‘, StandardScaler()),
    (‘svm_clf‘, LinearSVC(C=10, loss=‘hinge‘))
))

polynomial_svm_clf.fit(X, y)

Pipeline(steps=((‘poly_features‘, PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), (‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,
     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))

import matplotlib.pyplot as plt

def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

def plot_datasets(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], ‘bs‘)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], ‘g^‘)
    plt.axis(axes)
    plt.grid(True, which=‘both‘)
    plt.xlabel(r‘$x_1$‘, fontsize=10)
    plt.ylabel(r‘$x_2$‘, fontsize=10, rotation=0)
    
plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])

plt.show()

可以使用grid search需要最优参数。常常首先粗略的搜索，在细致的搜索

Adding Similarity Feature

add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
$$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距离，由此可计算出新的Features,并抛弃之前的$$
how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset

Gauss RBF Kernel

数据集很大的时候计算非常费时
SVM的魔力就在于不必真实的添加feature就可以获得与添加之后相似的结果

rbf_kernel_svm_clf = Pipeline((
    (‘scaler‘, StandardScaler()),
    (‘svm_clf‘, SVC(kernel=‘rbf‘, gamma=5, C=0.001))
))

rbf_kernel_svm_clf.fit(X,y)

Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=5, kernel=‘rbf‘,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

%matplotlib inline
from sklearn.svm import SVC

gamma1, gamma2 = 0.1, 5
C1, C2 = 0.001, 1000
hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)

svm_clfs = []
for gamma, C in hyperparams:
    rbf_kernel_svm_clf = Pipeline((
            ("scaler", StandardScaler()),
            ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))
        ))
    rbf_kernel_svm_clf.fit(X, y)
    svm_clfs.append(rbf_kernel_svm_clf)

plt.figure(figsize=(14, 9))

for i, svm_clf in enumerate(svm_clfs):
    plt.subplot(221 + i)
    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])
    gamma, C = hyperparams[i]
    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)

plt.show()

$\gamma$增大，instance的影响范围就变小，bell-shape curve变瘦
$\gamma$增小，instance的影响范围就变大，bell-shape curve变胖
$C$和$\gamma$类似

其他的kernel用的少或者应用面窄。
String kernels常用在文本分类和DNA序列如：使用string subsequence kernel或者其他基于Levenshtein distance
how to choose:
1. linear first, LinearSVC is fast
2. then SVC(kernel=‘linear‘)
3. not to large, try the rbf kernel
4. try others kernel using cross-validation and grid search

Computational Complexity

Class	Time Complexity	Out-of-core Support	Scaling required	Kernel Trick
LinearSVC	$O(m\times n)$	No	Yes	No
SGDClassifier	$O(m\times n)$	Yes	Yes	No
svc	$O(m^2\times n)\\ to\\ O(m^3\times n)$	No	Yes	Yes

SVM Regression¶

与分类器的区别在于要使support vector在margin被限制的条件下尽可能多

from sklearn.svm import LinearSVR
import numpy.random as rnd
rnd.seed(42)
m = 50
X = 2 * rnd.rand(m, 1)
y = (4 + 3 * X + rnd.randn(m, 1)).ravel()

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss=‘epsilon_insensitive‘, max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

from sklearn.svm import SVR

svm_poly_reg = SVR(kernel=‘poly‘, degree=2, C=100, epsilon=0.1)

svm_poly_reg.fit(X,y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma=‘auto‘,
  kernel=‘poly‘, max_iter=-1, shrinking=True, tol=0.001, verbose=False)

svm_reg1 = LinearSVR(epsilon=1.5)
svm_reg2 = LinearSVR(epsilon=0.5)
svm_reg1.fit(X, y)
svm_reg2.fit(X, y)

def find_support_vectors(svm_reg, X, y):
    y_pred = svm_reg.predict(X)
    off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)
    return np.argwhere(off_margin) #nonzero（满足条件）的位置 

svm_reg1.support_ = find_support_vectors(svm_reg1, X, y)  #实际上是margin之外的
svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)

eps_x1 = 1
eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilon

def plot_svm_regression(svm_reg, X, y, axes):
    x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)
    y_pred = svm_reg.predict(x1s)
    plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")
    plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")
    plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")
    plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors=‘#FFAAAA‘)
    plt.plot(X, y, "bo")
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.legend(loc="upper left", fontsize=18)
    plt.axis(axes)

plt.figure(figsize=(9, 4))
plt.subplot(121)
plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)
plt.ylabel(r"$y$", fontsize=18, rotation=0)
#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)
plt.annotate(
        ‘‘, xy=(eps_x1, eps_y_pred), xycoords=‘data‘,
        xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),
        textcoords=‘data‘, arrowprops={‘arrowstyle‘: ‘<->‘, ‘linewidth‘: 1.5}
    )
plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)
plt.subplot(122)
plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)
plt.show()

Under the Hood¶

Decision Function and Predictions

$$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.

iris = datasets.load_iris()
X = iris[‘data‘][:, (2, 3)]
y = (iris[‘target‘]==2).astype(np.float64)

from mpl_toolkits.mplot3d import Axes3D

def plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):
    x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])
    X_crop = X[x1_in_bounds]
    y_crop = y[x1_in_bounds]
    x1s = np.linspace(x1_lim[0], x1_lim[1], 20)
    x2s = np.linspace(x2_lim[0], x2_lim[1], 20)
    x1, x2 = np.meshgrid(x1s, x2s)
    xs = np.c_[x1.ravel(), x2.ravel()]
    df = (xs.dot(w) + b).reshape(x1.shape)
    m = 1 / np.linalg.norm(w)
    boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]
    margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]
    margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]
    ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100)
    ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")
    ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")
    ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)
    ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")
    ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")
    ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")
    ax.axis(x1_lim + x2_lim)
    ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)
    ax.set_xlabel(r"Petal length", fontsize=15)
    ax.set_ylabel(r"Petal width", fontsize=15)
    ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18)
    ax.legend(loc="upper left", fontsize=16)

svm_clf2 = LinearSVC(C=10, loss=‘hinge‘)
svm_clf2.fit(X, y)

fig = plt.figure(figsize=(11, 6))
ax1 = fig.add_subplot(111, projection=‘3d‘)
plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])

plt.show()

Training Objective

要求所有的都分类正确（hard margin）$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
于是就转化为了约束优化问题$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之所以不用\left \| w \right \|是因为其不可微 $$
为了得到soft margin，一如松弛变量$:\ \varsigma^{(i)} \geq 0$,用来测量违反margin的程度
$w$表示斜率，越小margin越宽;$\varsigma$越小表示违反程度越低，但margin也会越小
$C$用来平衡$w$和$\varsigma$
$$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$

Quadratic Programming

$$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM

The Dual Problem

a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
for SVM, the primal and dual problem has the same solution
Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$

Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$， without having to compute(or enev to know about) the transformaton $\phi$
$$ \begin{align*} Linear &:\ \ \ K(a,b)=a^T \cdot b \Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\end{align*} $$

只要K满足Mercer‘s Condition(连续，可交换)就知道对于：$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$一定存在。尽管可能不知道是啥，但这样就可以做kernel
计算$\widehat w$时会有一个$\phi (x)$，可能是很难计算的。将$\widehat w$带入cost function中： $$ \begin{align*} h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \&= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \&= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \&= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)})) \end{align*} $$

Online SVMs

means learning icrementrally
for Linear SVM Classifier: $$ J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b)) $$
$hinge\ loss\ function = max(0,\ 1-4)$

Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

Linear SVM Classification¶

Nonlinear SVM Classification¶

SVM Regression¶

Under the Hood¶