Stanford公开课机器学习---3.多变量线性回归（Linear Regression with multiple variable）

时间：2015-05-27 14:00:37 阅读：173 评论：0 收藏：0 [点我收藏+]

3.多变量线性回归（Linear Regression with multiple variable）

3.1 多维特征(Multiple Features)

n 代表特征的数量
$x^{(i)}$ 代表第 i 个训练实例,是特征矩阵中的第 i 行,是一个向量(vector)。
$x^{(i)}_j$ 代表特征矩阵中第 i 行的第 j 个特征,也就是第 i 个训练实例的第 j 个特征。

多维线性方程：

$h θ = θ 0 + θ 1 x + θ 2 x + . . . + θ n x$ $h_\theta = \theta_0 + \theta_{1}x + \theta_{2}x +...+ \theta_{n}x$

这个公式中有 n+1 个参数和 n 个变量,为了使得公式能够简化一些,引入 $x_0$ =1, 所以参数 $\theta$ 和训练样本 $X$ 都是n+1 纬的向量
$\theta = \begin{pmatrix} \theta_0 \ \theta_{1} \ \vdots \ \theta_{n} \ \end{pmatrix}$
$X = \begin{pmatrix} x_0 \ x_{1} \ \vdots \ x_{n} \ \end{pmatrix}$

多维线性方程简化为：

$h θ = θ T X$ $h_\theta = \theta^TX$

技术分享

3.2 多变量梯度下降(Gradient descent for multiple variables)

cost function :

$J (θ) = 1 2 m \sum 1 m (h θ (x (i)) ? y (i)) 2$ $J(\theta) = {\frac{1}{2m}}\sum_1^m{(h_\theta(x^{(i)})-y^{(i)})^2}$
在 Octave 中,写作: J = sum((X * theta - y).^2)/(2*m);

梯度下降公式：
$θ j : = θ j ? α ? ? θ j J (θ 0, θ 1)$ $\theta_j :=\theta_j - \alpha\frac\partial{\partial\theta_j}J(\theta_0,\theta_1)$ $= θ j ? α 1 m \sum 1 m （ (h θ (x (i)) ? y (i)) ? x (i) j ）$ $= \theta_j - \alpha{\frac{1}{m}}\sum_1^m{（(h_\theta(x^{(i)})-y^{(i)}) \cdot x^{(i)}_j）}$
在 Octave 中,写作:
$t h e t a = t h e t a ? a l p h a / m ? X' ? (X ? t h e t a ? y);$ $theta = theta - alpha / m * X‘ * (X * theta - y);$

技术分享

3.3 特征缩放(feature scaling)

以房价问题为例,假设我们使用两个特征,房屋的尺寸和房间的数量,尺寸的值为 0- 2000 平方英尺,而房间数量的值则是 0-5,绘制代价函数的等高线图,看出图像会显得很扁,梯度下降算法下降的慢，而且可能来回震荡才能收敛。
技术分享

mean normalization

解决的方法是尝试将所有特征的尺度都尽量归一化到-1 到 1 之间。最简单的方法是令 $x_i - \mu_i$ 代替 $x_i$ ,使得特征的平均值接近0（ $x_0$ 除外） :

x n = x n ? μ n s n

$x_n = {\frac{x_n - \mu_n}{s_n}}$
其中 ?

μn $\mu_n$ 是平均值,

sn $s_n$ 是标准差

sn $s_n$ 或特征范围

max(xi)?min(xi) $max(x_i) - min(x_i)$

技术分享

3.4 学习率(Learning rate)

确保梯度下降working correctly
绘制迭代次数和代价函数的图表来观测算法在何时趋于收敛。下降说明正常

若增大或来回波动，可能是 $\alpha$ 过大

技术分享

2.如何选取 $\alpha$
先在10倍之间取，找到合适的区间后，在其中再细化为3倍左右(log)
We recommend trying values of the learning rate α on a log-scale, at multiplicative steps of about 3 times the previous value
α=…,0.001,0.01,0.1,1,…
α=…,0.001,0.03,0.01,0.03,0.1,0.3,1,…

3.5 多项式回归(Features and Polynomial Regression)

房价预测问题
已知x1=frontage(临街宽度),x2=depth(纵向深度),则 $h_\theta = \theta_0 + \theta_{1}x_1+ \theta_{2}x_2$
若用 x=frontage*depth=area(面积),则 $h_\theta = \theta_0 + \theta_{1}x$ 会得到更有意义的回归方程

线性回归并不适用于所有数据,有时我们需要曲线来适应我们的数据,比如一个二次方模型或三次方模型（考虑到二次方程的话总会到最高点后随着size↑，price↓，不合常理；因此选用三次方程进行拟合更合适。）:
技术分享

或采用第二个式子：

技术分享

特征归一化很重要，使得不同feature之间有可比性

技术分享

3.6 正规方程(Normal Equation)

之前用梯度下降算法,但是对于某些线性回归问题,正规方程方法更好。
要找到使cost function $J(\theta)$ 最小的θ，就是找到使得导数取0时的参数θ：
技术分享

? ? θ j J (θ) = 1 m \sum 1 m （ (h θ (x (i)) ? y (i)) ? x (i) j ） = 0

$\frac\partial{\partial\theta_j}{J(\theta) }= {\frac{1}{m}}\sum_1^m{（(h_\theta(x^{(i)})-y^{(i)}) \cdot x^{(i)}_j）} = 0$

X是m×(n+1)的矩阵，y是m×1的矩阵,正规方程(Normal Equation):

$θ = (X T X) ? 1 X T y$ $\theta = (X^TX)^{-1}X^Ty$
在 Octave 中,正规方程写作:
$p i n v (X' ? X) ? X' ? y$ $pinv(X‘*X)*X‘*y$

技术分享

注:对于那些不可逆的矩阵(通常是因为特征之间不独立,或特征数量大于训练集的数量),正规方程方法是不能用的。

梯度下降	正规方程
需要选择学习率α	不需要
需要多次迭代	一次运算得出
当特征数量n大时也能较好适用	如果特征数量n较大则运算代价大,因为 $(X^TX)^{-1}$ 的计算时间复杂度为 O(n3)(当 n < 10000 时还是可以接受的)
适用于各种类型的模型	只适用于线性模型,不适合逻辑回归模型等其他模型
需要特征值归一化	不需要