机器学习：线性回归

时间：2015-05-21 22:46:38 阅读：235 评论：0 收藏：0 [点我收藏+]

Supervised Learning

首先给出一些基本定义， $x^{(i)}$ 表示输入变量或者输入特征， $y^{(i)}$ 表示输出变量或者目标值。 $(x^{(i)}, y^{(i})$ 称为一对样本，一组样本 $\{(x^{(i)}, y^{(i)}); i=1,2,...,m\}$ 称为训练集，其中 $\mathbf{X}$ 表示输入变量的值域， $\mathbf{Y}$ 表示输出变量的值域，一般来说，我们都只考虑实数域。

The goal of supervised learning can be described as follows: given a training set, to learn a function $h: \mathbf{X} \rightarrow \mathbf{Y}$ so that $h(x)$ can predict the output value as close as the corresponding value of $y$ . The function $h$ is called hypothesis.

基于目标值的形式不同，Supervised Learning 主要分成两大类.

Regression Problem（回归问题）: 输出变量是连续值.
Classification Problem（分类问题）: 输出变量是离散值.

线性回归

我们先考虑一个简单的线性回归问题，我们假设输入变量 $\boldsymbol{x}$ 是一个在二维空间 $\mathbb{R}^2$ 的向量，我们有以下的表达式：

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2

$\begin{equation*} h_\theta(\boldsymbol{x})=\theta_0+\theta_1x_1+\theta_2x_2 \end{equation*}$

θi $\theta_i$ 是参数 (有时称为权值)，这些参数可以将输入域

X $\mathbf{X}$ 线性映射到输出域

Y $\mathbf{Y}$ . 我们引入截距的概念，令

x0=1 $x_0=1$ , 上述表达式可以改写成：

h θ (x) = \sum i = 0 n θ i x i = θ T x

$\begin{equation*} h_\theta(\boldsymbol{x})=\sum_{i=0}^{n}\theta_ix_i=\boldsymbol{\theta}^T\boldsymbol{x} \end{equation*}$
现在，给出一组有

m $m$ 个样本的训练集，我们希望找到合适的参数

θ $\theta$ ，使得预测值

hθ(x) $h_\theta(x)$ 与目标值尽可能接近。为了估计参数

θ $\theta$ ，我们定义如下的 cost function:

J (θ) = 1 2 \sum i = 1 m (h θ (x i) ? y i) 2

$J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(\boldsymbol{x}^i)-y^i)^2$
这个函数与最小均方误差函数很像，接下来，我们讨论如何求解参数

θ $\theta$ 。

LMS Algorithm

为了求得最优的参数 $\theta$ 使得函数的均方误差最小，我们可以采用梯度下降算法进行求解，参数 $\theta$ 的更新可以
表示成如下：

θ j : = θ j ? α ? ? θ j J (θ)

$\begin{equation*} \theta_j: =\theta_j-\alpha \frac{\partial}{\partial \theta_j} J(\theta) \end{equation*}$
其中，

α $\alpha$ 称为 Learning rate. 我们先给出函数的偏导数，为了简便起见，先考虑只有一对样本的情况

(x,y) $(\boldsymbol{x},y)$ .

? ? θ j J (θ) = ? ? θ j 1 2 (h θ (x) ? y) 2 = 2 ? 1 2 ? (h θ (x) ? y) ? ? ? θ j (h θ (x) ? y) = (h θ (x) ? y) ? ? ? θ j (\sum i = 0 n θ i x i ? y) = (h θ (x) ? y) x j

$\begin{equation*} \begin{split} \frac{\partial}{\partial \theta_j} J(\theta) & = \frac{\partial}{\partial \theta_j}\frac{1}{2}( h_\theta(\boldsymbol{x})-y)^2 \& =2\cdot\frac{1}{2}\cdot( h_\theta(\boldsymbol{x})-y)\cdot\frac{\partial}{\partial \theta_j}( h_\theta(\boldsymbol{x})-y) \& = (h_\theta(\boldsymbol{x})-y)\cdot \frac{\partial}{\partial \theta_j}(\sum_{i=0}^{n}\theta_i x_i-y) \& = (h_\theta(\boldsymbol{x})-y)x_j \\end{split} \end{equation*}$
进而，我们可以得到只有一对样本时的更新准则：

θ j : = θ j + α (y i ? h θ (x i)) x i j

$\begin{equation} \theta_j: =\theta_j+\alpha (y^i-h_\theta(\boldsymbol{x}^i))x_{j}^{i} \end{equation}$
这就是有名的LMS更新原则，也叫Widrow-Hoff学习准则，参数

θ $\theta$ 更新的幅度取决于误差项的大小。从一对样本的情况，我们推导出参数

θ $\theta$ 如何更新使得函数可以收敛。事实上，对于含有多个训练样本的情况，有两个方法可以对参数

θ $\theta$ 进行更新，一个是 batch model，另外一个是 stochastic model。

batch model的更新原则是每一次更新要遍历所有的样本，如下所示：

Repeat until convergence {θ j : = θ j ? α \sum i = 1 m (y i ? h θ (x i)) x i j for every j}

$\begin{equation*} \begin{split} & \text{Repeat until convergence} \qquad \{ \& \theta_j: =\theta_j-\alpha \sum_{i=1}^{m} (y^i-h_\theta(\boldsymbol{x}^i))x_{j}^{i} \qquad \text{for every $j$} \& \} \end{split} \end{equation*}$

而stochastic model每遇到一个样本就会进行一次更新，如下所示：

Loop {for i = 1 to m, {θ j : = θ j ? α (y i ? h θ (x i)) x i j for every j}}

$\begin{equation*} \begin{split} & \text{Loop} \qquad \{ \& \qquad \text{for $i=1$ to $m$,} \qquad \{ \& \qquad \theta_j: =\theta_j-\alpha (y^i-h_\theta(\boldsymbol{x}^i))x_{j}^{i} \qquad \text{for every $j$} \& \qquad \} \& \} \end{split} \end{equation*}$

从上面两种方式可以看出，batch model 每做一次更新要遍历所有的样本，这是非常耗时的，特别是在样本数非常多的时候，而stochastic model 是即时更新，每遇到一个样本就更新一次，通常情况下，stochastic model会下降地比batch model 快，不过stochastic model的一个缺陷是有可能无法收敛到全局最小值，而是在最小值附近来回扰动，这大概也是称之为stochastic model的原因，最终收敛的值带有一定的随机性，然而虽然只是收敛到全局最小值附近，很多时候这种近似已经非常靠近全局最小值，而且stochastic model的效率要更高，所以stochastic model一般会作为优先考虑的方法。

更通常的一种做法是结合两种方式，将整个训练集分割成很多个小的batch，然后利用stochastic model进行更新，每遇到一个小的batch，进行一次更新，这样做利用了stochastic model的高效，也在一定程度上减轻了在全局最小值附近的扰动。