机器学习笔记04：逻辑回归(Logistic regression)、分类(Classification)

时间：2016-04-28 01:56:56 阅读：510 评论：0 收藏：0 [点我收藏+]

标签：

之前我们已经大概学习了用线性回归（Linear Regression）来解决一些预测问题，详见：
1.《机器学习笔记01：线性回归(Linear Regression)和梯度下降(Gradient Decent)》
2.《机器学习笔记02：多元线性回归、梯度下降和Normal equation》
3.《机器学习笔记03：Normal equation及其与梯度下降的比较》

说明：本文章所有图片均属于Stanford机器学课程，转载请注明出处

面对一些类似回归问题，我们可以通过线性回归方法来拟合一个函数，以此来预测数据，但它的输出是连续的。有时候呢，我们需要一种方法给出一个判定结果，例如”同意(agree)”、”不同意(disagree)”。、下面呢就是关于这个方法的新内容，叫做分类(Classification)问题。又例如，如果我们需要预测一辆汽车是好的还是坏的，只有两种结果：好、坏。这种输出为0或者1的问题，就叫做分类问题，而我们对应与此种问题所采用的方法即是逻辑回归(Logistic regression)。

1.分类及其表示(Classification and Representation)

i.分类(Classification)

首先来看看分类(Classification)问题，在第一段中已经简单介绍了什么是分类问题，下面再来举几个例子：

Examples	Purposes
Email	Spam / Not Spam?
Online Transaction	Fraudulent (Yes / No?)
Tumor	Malignant / Benign?

第一个例子是判断垃圾邮件，对一封邮件，我们需要判断它是否为垃圾邮件；第二个例子是在线交易，我们需要判断这个交易是否有欺诈的嫌疑；最后一个例子是肿瘤评估，我们需要对一个病人的病情进行综合分析，来判断肿瘤是恶性的还是良性的。

详细地，我们以肿瘤评估为例。我们有如下图所示的一些样本，其横坐标表示肿瘤的大小，纵坐标表示性态（良性还是恶性）：

假设我们用一条直线

hθ(x)=θTX $h_\theta(x)=\theta^TX$ 来拟合这些数据，其图像可能大致如下：
技术分享

如上图所示，

hθ(x) $h_\theta(x)$ 为紫色的直线，如果我们选择

0.5 $0.5$ 作为一个基准点来判断一个肿瘤是良性还是恶性的:

I f h θ (x) \geq 0.5, p r e d i c t " y = 1 "

$If\quad h\theta(x) \ge 0.5 \quad,predict\quad "y=1"$

I f h θ (x) < 0.5, p r e d i c t " y = 0 "

$If\quad h\theta(x) < 0.5 \quad,predict\quad "y=0"$ 那么对于上面的数据，看起来好像还不错。但是我们增加一组额外的样本来看看：
技术分享

如上图所示，我们增加了一组数据，通过线性回归（Linear Regression）得到了一条蓝色的直线，但是其看起有点不那么理想，例如有几个恶性肿瘤，也会被分类为良性肿瘤。所以，在分类问题中，线性回归通常不是一个很好的办法。所以我们需要使用逻辑回归(Logistic regression)来解决分类问题。逻辑回归是一个分类算法(classification algorithm)在逻辑回归中，我们要求

0≤hθ(x)≤1 $0\le h_\theta(x) \le 1$ ，下面我们就来看看逻辑回归的假设函数。

ii.假设函数(Hypothesis)

上面我们提到了，在只有两种结果的分类问题中，它的输出不是 $0$ 即是 $1$ ，所以我们想要将分类器(classifier)的输出控制在 $[0,1]$ 上。在线性回归中，我们的假设函数为 $h_\theta(x) = \theta^TX$ ，显然其输出并不只限于区间 $[0,1]$ ，所以线性回归中的假设函数在逻辑回归(Logistic regression)中是不合适的。这里我们使我们的假设函数为：

h θ (x) = g (θ T X)

$h_\theta(x) =g( \theta^TX)$ 其中，函数

g $g$ 的形式为：

g (z) = 1 1 + e ? z

$g(z)=\frac{1}{1+e^{-z}}$ 其图像为：

其与

y $y$ 轴的交点为

(0,0.5) $(0, 0.5)$ ，所以假设函数为：

h θ (x) = 1 1 + e ? θ T X

$h_\theta(x)=\frac{1}{1+e^{-\theta^TX}}$

现在我们来看一下逻辑回归(Logistic regression)的假设函数的具体意义是什么。
这里的函数 $h_\theta(x)$ 代表的是关于输入 $x$ ，使得 $y=1$ 的可能性。来举个例子：
假设有两个特征：

[x 1 x 2] = [1 t u m o r S i z e]

$\left[ \begin{matrix} x_1 \\ x_2 \end{matrix} \right]=\left[ \begin{matrix} 1 \\ tumorSize \end{matrix} \right]$ 其中

x1 $x_1$ 为 1，这是我们之前约定好的（文章开头列出的文章），

x2 $x_2$ 表示肿瘤的大小。假如

hθ(x)=0.7 $h_\theta(x) = 0.7$ ，这就表示病人的肿瘤为恶性肿瘤的可能性为

0.7 $0.7$ 。进一步地，可以将假设函数表示为：

h θ (x) = P (y = 1 | x; θ)

$h_\theta(x) = P(y=1|x;\theta)$
即给定参数

θ $\theta$ ，关于输入

x $x$ ，使得

y=1 $y=1$ 的可能性。进一步，我们也可以知道如下的结论：

P (y = 0 | x; θ) + P (y = 1 | x; θ) = 1

$P(y=0|x;\theta)+P(y=1|x;\theta)=1$

P (y = 0 | x; θ) = 1 ? P (y = 0 | x; θ)

$P(y=0|x;\theta)=1-P(y=0|x;\theta)$ 假设函数的形式就讲到这里，下面讲一讲判定界限(Decision boundary)。

iii.判定界限(Decision Boundary)

前面提到了 $h_\theta(x) = P(y=1|x;\theta)$ ，那什么时候 $h_\theta(x)$ 的值为 $1$ ，什么时候为 $0$ 呢？一般规定：

{10 if h θ (x) \geq 0.5; if h θ (x) < 0.5 .

$\begin{cases} 1 & \text{if $h_\theta(x)\ge 0.5$} \text{ ;} \0 & \text{if $h_\theta(x)< 0.5$} \text{ .} \end{cases}$ 同时，我们发现对于函数

g(z) $g(z)$ ：

当

z≥0 $z\ge0$ 时，

hθ(x)≥0.5 $h_\theta(x)\ge 0.5$ ,当

z<0 $z<0$ 时，

hθ(x)<0.5 $h_\theta(x)< 0.5$ 。即对于

hθ(x)=g(θTX)≥0.5 $h_\theta(x) =g( \theta^TX) \ge 0.5$ ，有

θTX≥0 $\theta^TX \ge 0$ ；同理，对于

hθ(x)=g(θTX)<0.5 $h_\theta(x) =g( \theta^TX) < 0.5$ ，有

θTX<0 $\theta^TX < 0$ 。

现在我们就来看看判定界限(Decision boundary)的具体内容，假如我们有如图所示的样本集合：

同时假设，假设函数(hypothesis function)为

hθ(x)=g(θ0+θ1x1+θ2x2) $h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2)$ ，并假设

θ0=?3,θ1=θ2=1 $\theta_0=-3, \theta_1=\theta_2=1$ 。所以此时有：

z = ? 3 + x 1 + x 2

$z=-3+x_1+x_2$ 根据前面的内容，我们知道若要

y=1 $y=1$ ，就必须使得

z≥0 $z\ge0$ ，在这里即使得：

? 3 + x 1 + x 2 \geq 0

$-3+x_1+x_2\ge0$ 其等价于：

x 1 + x 2 \geq 3

$x_1+x_2 \ge 3$ 。我们将直线

x1+x2=3 $x_1+x_2 = 3$ 的图像添加到上面的样本分布图中可以得到如下图像：
技术分享

根据高中就学过的线性规划知识，为与直线右上方的点都能满足不等式

?3+x1+x2≥0 $-3+x_1+x_2\ge0$ ，即满足

z≥0 $z\ge0$ 。而这条直线就是所谓的判定界限(Decision boundary)。同时需要指出的是，这条直线只跟参数

θ $\theta$ 有关，跟样本集无关。

再来看看非线性的情况，样本集如下：

若假设函数为

hθ(x)=g(θ0+θ1x1+θ2x2+θ3x21+θ4x22) $h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)$ ，假设

θ = ? ? ? ? ? ? ? ? ? 1 0011 ? ? ? ? ? ? ? ?

$\theta=\left[\begin{matrix} -1\\0\\0\\1\\1 \end{matrix}\right]$ 则若要

hθ(x)≥0.5 $h_\theta(x)\ge 0.5$ （或者说要是得

y=1 $y=1$ ），就必须使得：

? 1 + x 21 + x 22 \geq 0

$-1+x_1^2+x_2^2\ge0$ 即：

x 21 + x 22 \geq 1

$x_1^2+x_2^2\ge1$ 若把曲线

x21+x22=1 $x_1^2+x_2^2=1$ 的图像添加到上面的样本集中，可以得到如下图像：
技术分享

所以图中这条紫色的线也就是函数

hθ(x)=g(θ0+θ1x1+θ2x2+θ3x21+θ4x22) $h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)$ 的判定界限(Decision boundary)。如果我们的假设函数更加复杂，其判定界限的形状会更加的奇怪，并且不仅只限于二维、三维，也可以是一条高维的曲线，只是我们无法用图形表示出来。接下来讨论误差函数。

逻辑回归模型(Logistic Regression Model)

i.误差函数(Cost Function)

同线性回归一样，我们需要一个误差函数来帮助我们选择最佳的参数 $\theta$ 。假设有 $m$ 组训练集 $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}$ ，其中

x = ? ? ? ? ? ? ? ? x 0 x 1 x 2 . . . x n ? ? ? ? ? ? ? ?, x 0 = 1, y \in {0, 1}

$x=\left[\begin{matrix} x_0\\x_1\\x_2\\...\\x_n \end{matrix}\right], \quad x_0=1,\quad y\in\{0,1\}$ ，我们有假设函数：

h θ (x) = 1 1 + e ? θ T X

$h_\theta(x)=\frac{1}{1+e^{-\theta^TX}}$ 那么到底怎么得到最优的

θ $\theta$ 呢？首先要做的就是更改误差函数的形式。

在线性回归中，误差函数为：

J (θ 0, θ 1, . . ., θ n) = 1 2 m \sum i = 1 m (h θ (x (i)) ? y (i)) 2

$J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
将求和前面的

12 $\frac{1}{2}$ 放到求和部分里面得到：

J (θ 0, θ 1, . . ., θ n) = 1 m \sum i = 1 m 1 2 (h θ (x (i)) ? y (i)) 2

$J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{m} \sum_{i=1}^m \frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$ 在这里，我们换一种形式来表示函数来代替

12(hθ(x(i))?y(i))2 $\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$ ：

C o s t (h θ (x (i)), y (i)) = 1 2 (h θ (x (i)) ? y (i)) 2

$Cost(h_\theta(x^{(i)}),y^{(i)})=\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$ , 如果把上标

i $i$ 去掉，得到：

C o s t (h θ (x), y) = 1 2 (h θ (x) ? y) 2

$Cost(h_\theta(x),y)=\frac{1}{2}(h_\theta(x)-y)^2$ 然而非常不幸的是，如果我们将假设函数

hθ(x)=11+e?θTX $h_\theta(x)=\frac{1}{1+e^{-\theta^TX}}$ 代入函数

Cost(hθ(x),y) $Cost(h_\theta(x),y)$ ，再将函数

Cost(hθ(x),y) $Cost(h_\theta(x),y)$ 代入误差函数

J(θ) $J(\theta)$ ，所得到的误差函数并不是一个凹函数或者凸函数，意思是函数

J(θ) $J(\theta)$ 将会有局部最优点(local optima)，所以不能对误差函数执行梯度下降法：

而我们需要的误差函数应该是这样的：
技术分享

为了能够使用梯度下降发求得最佳

θ $\theta$ ，我们将误差函数做一些改变。这里，我们引入新的误差函数：

C o s t (h θ (x), y) = {? l o g (h θ (x)) ? l o g (1 ? h θ (x)) if y = 1; if y = 0 .

$Cost(h_\theta(x),y)=\begin{cases} -log(h_\theta(x)) & \text{if $y=1$} \text{ ;} \-log(1-h_\theta(x)) & \text{if $y=0$} \text{ .} \end{cases}$
为什么要把上面这个分段函数作为误差函数呢？我们可以看出，当

y=1 $y=1$ 的时候，其图像为：
技术分享

从图中可以看出，在训练的过程中，如果样本的输出

y=1 $y=1$ ，预测值

hθ(x) $h_\theta(x)$ 也为

1 $1$ ，那么其误差

Cost=0 $Cost = 0$ 。而当样本的输出

y=1 $y=1$ ，预测值

hθ(x) $h_\theta(x)$ 为

0 $0$ 时，那么其误差

Cost=∞ $Cost = \infty$ ，所以这是一个比较好的误差函数模型。
而当

y=0 $y=0$ 的时候，其图像为：
技术分享

跟上面同理，如果样本的输出

y=0 $y=0$ ，预测值

hθ(x) $h_\theta(x)$ 为

1 $1$ ，那么其误差

Cost=∞ $Cost = \infty$ 。而当样本的输出

y=0 $y=0$ ，预测值

hθ(x) $h_\theta(x)$ 也为

0 $0$ 时，那么其误差

Cost=0 $Cost = 0$ 。而且我们可以看到，这个误差函数是没有局部最优值的，所以我们可以在这个误差函数上执行梯度下降法。

ii.简化的误差函数和梯度下降(Simplified Cost Function and Gradient Descent)

简化的误差函数(Simplified Cost Function)

之前我们提到误差函数：

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x (i)), y (i))

$J(\theta)=\frac{1}{m} \sum_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})$

C o s t (h θ (x), y) = {? l o g (h θ (x)) ? l o g (1 ? h θ (x)) if y = 1; if y = 0 .

$Cost(h_\theta(x),y)=\begin{cases} -log(h_\theta(x)) & \text{if $y=1$} \text{ ;} \-log(1-h_\theta(x)) & \text{if $y=0$} \text{ .} \end{cases}$
注意：其中

y $y$ 总是为

1 $1$ 或

0 $0$ 。，但是上面这个形式不利于我们进行一些计算，比如求偏导。所以我们把函数

Cost(hθ(x),y) $Cost(h_\theta(x),y)$ 改写为：

C o s t (h θ (x), y) = ? y l o g (h θ (x)) ? (1 ? y) l o g (1 ? h θ (x))

$Cost(h_\theta(x),y)=-ylog(h_\theta(x))-(1-y)log(1-h_\theta(x))$ 由上面这个式子可知：

如果	则
$y=1$	$Cost(h_\theta(x),y)=-ylog(h_\theta(x))$
$y=0$	$Cost(h_\theta(x),y)=-(1-y)log(1-h_\theta(x))$

所以我们可以将误差函数改写为：

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x (i)), y (i)) = ? 1 m \sum i = 1 m [y (i) l o g (h θ (x (i))) + (1 ? y (i)) l o g (1 ? h θ (x (i)))]

$\begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)}) \&=-\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} log(h_\theta(x^{(i)})) +(1-y^{(i)}) log(1-h_\theta(x^{(i)})) \right] \end{aligned}$ 这个形式的误差函数就便于我们进行梯度下降了。

梯度下降(Gradient descent)

跟线性回归如出一辙，在逻辑回归中，我们也需要用梯度下降来求解 $\theta$ 。和线性回归一样，梯度下降的形式如下：

R e p e a t {θ j : = θ j ? α ? ? θ j J (θ)}

$\begin{align*} & Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \newline & \rbrace \end{align*}$ 和线性回归相同，我们通过对

θj $\theta_j$ 求偏导直到收敛：

R e p e a t {θ j : = θ j ? α m \sum i = 1 m (h θ (x (i)) ? y (i)) x (i) j}

$\begin{align*} & Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace \end{align*}$ 其中

∑mi=1hθ(x(i)) $\sum_{i=1}^m h_\theta(x^{(i)})$ 可以向量化为

g(Xθ) $g(X \theta )$ ，

∑mi=1y(i) $\sum_{i=1}^m y^{(i)}$ 可以向量化为

y? $\vec{y}$ ，

∑mi=1x(i)j $\sum_{i=1}^m x_j^{(i)}$ 可以向量化为

XT $X^T$ ，所以将上面这个式子向量化后得到：

θ : = θ ? α m X T (g (X θ) ? y ?)

$\large \theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$ 其中

X = ? ? ? ? ? ? ? ? ? ? ? x (1) 0 x (2) 0 x (3) 0 . . . x (m) 0 x (1) 1 x (2) 1 x (3) 1 . . . x (m) 1 x (1) 2 x (2) 2 x (3) 2 . . . x (m) 2 . . . . . . . . . . . . . . . x (1) n x (2) n x (3) n . . . x (m) n ? ? ? ? ? ? ? ? ? ? ?

$X=\left[\begin{matrix} x_0^{(1)}&x_1^{(1)}&x_2^{(1)}&...&x_n^{(1)}\x_0^{(2)}&x_1^{(2)}&x_2^{(2)}&...&x_n^{(2)}\x_0^{(3)}&x_1^{(3)}&x_2^{(3)}&...&x_n^{(3)}\...&...&...&...&...\x_0^{(m)}&x_1^{(m)}&x_2^{(m)}&...&x_n^{(m)}\\end{matrix} \right]$ 所以

X T = ? ? ? ? ? ? ? ? ? ? x (1) 0 x (1) 1 x (1) 2 . . . x (1) n x (2) 0 x (2) 1 x (2) 2 . . . x (2) n x (3) 0 x (3) 1 x (3) 2 . . . x (3) n . . . . . . . . . . . . . . . x (m) 0 x (m) 1 x (m) 2 . . . x (m) n ? ? ? ? ? ? ? ? ? ?

$X^T=\left[\begin{matrix} x_0^{(1)}&x_0^{(2)}&x_0^{(3)}&...&x_0^{(m)}\x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&...&x_1^{(m)}\x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&...&x_2^{(m)}\...&...&...&...&...\x_n^{(1)}&x_n^{(2)}&x_n^{(3)}&...&x_n^{(m)}\\end{matrix} \right]$ 另外需要注意

(g(Xθ)?y? ) $(g(X \theta ) - \vec{y})$ 是一个

m $m$ 维的列向量，上式的正确性是可以肯定的。

也许有人会问，前面的误差函数一大堆嵌套，为什么求偏导还是等于 $\frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$ ，下面就来求一求(高能预警，计算量巨大)。

1.为了方便后面的计算，我们先求函数 $g(z)=\frac{1}{1+e^{-z}}$ 的导数：

$\begin{aligned}g(x)’ &=\left(\frac{1}{1+e^{-x}}\right)’ =\frac{-(1+e^{-x})’}{(1+e^{-x})^2} =\frac{-1’-(e^{-x})’}{(1+e^{-x})^2} =\frac{0-(-x)’(e^{-x})}{(1+e^{-x})^2} =\frac{-(-1)(e^{-x})}{(1+e^{-x})^2} =\frac{e^{-x}}{(1+e^{-x})^2} \newline &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right) =g(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right) =g(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right) =g(x)(1 - g(x))\end{aligned}$

好了，然后再来求 $J(\theta)$ 的偏导：

? ? θ j J (θ) = ? ? θ j ? 1 m \sum i = 1 m [y (i) l o g (h θ (x (i))) + (1 ? y (i)) l o g (1 ? h θ (x (i)))] = ? 1 m \sum i = 1 m [y (i) ? ? θ j l o g (h θ (x (i))) + (1 ? y (i)) ? ? θ j l o g (1 ? h θ (x (i)))] = ? 1 m \sum i = 1 m ? ? y ( i ) ? ? θ j h θ ( x ( i ) ) h θ ( x ( i ) ) + ( 1 ? y ( i ) ) ? ? θ j ( 1 ? h θ ( x ( i ) ) ) 1 ? h θ ( x ( i ) ) ? ? = ? 1 m \sum i = 1 m ? ? y ( i ) ? ? θ j σ ( θ T x ( i ) ) h θ ( x ( i ) ) + ( 1 ? y ( i ) ) ? ? θ j ( 1 ? σ ( θ T x ( i ) ) ) 1 ? h θ ( x ( i ) ) ? ? = ? 1 m \sum i = 1 m ? ? y ( i ) σ ( θ T x ( i ) ) ( 1 ? σ ( θ T x ( i ) ) ) ? ? θ j θ T x ( i ) h θ ( x ( i ) ) + ? ( 1 ? y ( i ) ) σ ( θ T x ( i ) ) ( 1 ? σ ( θ T x ( i ) ) ) ? ? θ j θ T x ( i ) 1 ? h θ ( x ( i ) ) ? ? = ? 1 m \sum i = 1 m ? ? y ( i ) h θ ( x ( i ) ) ( 1 ? h θ ( x ( i ) ) ) ? ? θ j θ T x ( i ) h θ ( x ( i ) ) ? ( 1 ? y ( i ) ) h θ ( x ( i ) ) ( 1 ? h θ ( x ( i ) ) ) ? ? θ j θ T x ( i ) 1 ? h θ ( x ( i ) ) ? ? = ? 1 m \sum i = 1 m [y (i) (1 ? h θ (x (i))) x (i) j ? (1 ? y (i)) h θ (x (i)) x (i) j] = ? 1 m \sum i = 1 m [y (i) (1 ? h θ (x (i))) ? (1 ? y (i)) h θ (x (i))] x (i) j = ? 1 m \sum i = 1 m [y (i) ? y (i) h θ (x (i)) ? h θ (x (i)) + y (i) h θ (x (i))] x (i) j = ? 1 m \sum i = 1 m [y (i) ? h θ (x (i))] x (i) j = 1 m \sum i = 1 m [h θ (x (i)) ? y (i)] x (i) j

$\begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)})) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)})) \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})} \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T x^{(i)}))}{1 - h_\theta(x^{(i)})} \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} + \frac{- (1-y^{(i)}) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})} \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})} \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j \right ] \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline &= \frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right ] x^{(i)}_j \end{align*}$

所以说，不要怀疑，偏导数的确是这么多。误差函数就讲到这里。