码迷,mamicode.com
首页 > 其他好文 > 详细

笔记:CS231n+assignment2(作业二)(二)

时间:2016-08-14 14:26:50      阅读:683      评论:0      收藏:0      [点我收藏+]

标签:

一、参数更新策略

       1.SGD

  也就是随机梯度下降,最简单的更新形式是沿着负梯度方向改变参数(因为梯度指向的是上升方向,但是我们通常希望最小化损失函数)。假设有一个参数向量x及其梯度dx,那么最简单的更新的形式是:

x += - learning_rate * dx

其中learning_rate是一个超参数,表示的是更新的幅度。这是一个重要的参数,lr过大可能会出现loss异常的情况,过小会使训练时间过长,后面也会介绍lr参数更新的一些trick。

  2. Momentum

    又被成为动量法,我觉得很多地方对这个方法并没有说清楚,我个人简单的理解是,在更新梯度的时候,我们要注意保留之前的梯度的信息,可以相信,一个梯度一直朝下减小的函数,一般不会遇到突然向左减小,所以加上动量以后,就可以处理这种情况,这个如果知道共轭点的话(不确定是不是这么叫的,也就是一个梯度下降的时候,向下一直减小,但是左右一直变化幅度很大的情况),就知道momentum的强大了。


def sgd_momentum(w, dw, config=None):    
    """    
    Performs stochastic gradient descent with momentum.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - momentum: Scalar between 0 and 1 giving the momentum value.                
    Setting momentum = 0 reduces to sgd.    
    - velocity: A numpy array of the same shape as w and dw used to store a moving    
    average of the gradients.   
    """   
    if config is None: config = {}    
    config.setdefault(‘learning_rate‘, 1e-2)   
    config.setdefault(‘momentum‘, 0.9)    
    v = config.get(‘velocity‘, np.zeros_like(w))    
    next_w = None    
    v = config[‘momentum‘] * v - config[‘learning_rate‘] * dw    
    next_w = w + v    
    config[‘velocity‘] = v    

    return next_w, config

 

 

 3.Nestero

  这个是动量法的进阶版,既然我们要用之前梯度的信息,那么为什么不在更新当前梯度的时候,直接取最后的方向呢。

x_ahead = x + mu * v
# 计算dx_ahead(在x_ahead处的梯度,而不是在x处的梯度)
v = mu * v - learning_rate * dx_ahead
x += v

     在实际编程的时候,我们采用的是下面的方法,因为这样就可以和之前的代码统一起来了 ,因为一般来说我们只有dx:

v_prev = v # 存储备份
v = mu * v - learning_rate * dx # 速度更新保持不变
x += -mu * v_prev + (1 + mu) * v # 位置更新变了形式

     为什么可以这样做?据说只要根据X_ahead=x+mu*v应该是可以换算的,但是我试了一下...没推出来.我怀疑只是实际使用的时候效果相似,不一定推出来,因为你不可能去计算dx_ahead的..

4.RMSProp and Adam

  这两个是学习速率的更新策略

def rmsprop(x, dx, config=None):    
    """    
    Uses the RMSProp update rule, which uses a moving average of squared gradient    
    values to set adaptive per-parameter learning rates.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared                  
    gradient cache.    
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - cache: Moving average of second moments of gradients.   
    """    
    if config is None: config = {}    
    config.setdefault(learning_rate, 1e-2)  
    config.setdefault(decay_rate, 0.99)    
    config.setdefault(epsilon, 1e-8)    
    config.setdefault(cache, np.zeros_like(x))    
    next_x = None    
    cache = config[cache]    
    decay_rate = config[decay_rate]    
    learning_rate = config[learning_rate]    
    epsilon = config[epsilon]    
    cache = decay_rate * cache + (1 - decay_rate) * (dx**2)    #用一种更为平滑的方式更新lr,注意到这里是累加的,因为越往后惩罚项越大,lr越小
    x += - learning_rate * dx / (np.sqrt(cache) + epsilon)  
    config[cache] = cache    
    next_x = x    

    return next_x, config

def adam(x, dx, config=None):    
    """    
    Uses the Adam update rule, which incorporates moving averages of both the  
    gradient and its square and a bias correction term.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - beta1: Decay rate for moving average of first moment of gradient.    
    - beta2: Decay rate for moving average of second moment of gradient.   
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - m: Moving average of gradient.    
    - v: Moving average of squared gradient.    
    - t: Iteration number.   
    """    
    if config is None: config = {}    
    config.setdefault(learning_rate, 1e-3)    
    config.setdefault(beta1, 0.9)    
    config.setdefault(beta2, 0.999)    
    config.setdefault(epsilon, 1e-8)    
    config.setdefault(m, np.zeros_like(x))    
    config.setdefault(v, np.zeros_like(x))    
    config.setdefault(t, 0)   
    next_x = None    
    m = config[m]    
    v = config[v]    
    beta1 = config[beta1]    
    beta2 = config[beta2]    
    learning_rate = config[learning_rate]    
    epsilon = config[epsilon]   
    t = config[t]    
    t += 1    
    m = beta1 * m + (1 - beta1) * dx    
    v = beta2 * v + (1 - beta2) * (dx**2)    
    m_bias = m / (1 - beta1**t)    
    v_bias = v / (1 - beta2**t)    
    x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)    
    next_x = x    
    config[m] = m    
    config[v] = v    
    config[t] = t    

    return next_x, config

二、Batch Normalization

  接下来要介绍两个在训练神经网络的时候,非常有用技巧,首先是batch normalization,简单的解释:在每次NN的输入的时候,我们都知道要进行数据预处理,一般是让数据是zero means和单位方差的,这样对于训练是有好处的,但当数据走过几层以后,就基本不可能还保持这个特性了,BN做的事情就是在每一层的开始,加上这个操作,但是有的数据可能会因此丢失了一些信息,所以再加上beta和gama来恢复原始数据,这里beta和gama是可学习的。

      

#BN参数的初始化
if self.use_batchnorm and i < len(hidden_dims): 
                self.params[gamma + str(i+1)] = np.ones((1, layers_dims[i+1]))        
                self.params[beta + str(i+1)] = np.zeros((1, layers_dims[i+1]))

BN的的关键是前向传播,后向传播以及实际使用的。

技术分享技术分享

按照上面推导出来的公式,即可进行BN层的构建:

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param[mode]  #因为train和test是两种不同的方法
    eps = bn_param.get(eps, 1e-5)
    momentum = bn_param.get(momentum, 0.9)
    N, D = x.shape
    running_mean = bn_param.get(running_mean, np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get(running_var, np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == train:    
        sample_mean = np.mean(x, axis=0, keepdims=True)       # [1,D]    
        sample_var = np.var(x, axis=0, keepdims=True)         # [1,D] 
        x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps)    # [N,D]    
        out = gamma * x_normalized + beta    
        cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)    
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean    #通过moument得到最终的running_mean和running_var
        running_var = momentum * running_var + (1 - momentum) * sample_var
    elif mode == test:    
        x_normalized = (x - running_mean) / np.sqrt(running_var + eps)    #test的时候如何通过BN层
        out = gamma * x_normalized + beta
    else:    
        raise ValueError(Invalid forward batchnorm mode "%s" % mode)

    # Store the updated running means back into bn_param
    bn_param[running_mean] = running_mean
    bn_param[running_var] = running_var

    return out, cache

def batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
    N, D = x.shape
    dx_normalized = dout * gamma       # [N,D]
    x_mu = x - sample_mean             # [N,D]
    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]
    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \                                
                                   2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
    dx1 = dx_normalized * sample_std_inv
    dx2 = 2.0/N * dsample_var * x_mu
    dx = dx1 + dx2 + 1.0/N * dsample_mean
    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
    dbeta = np.sum(dout, axis=0, keepdims=True)

    return dx, dgamma, dbeta

Batch Normalization解决的一个重要问题就是梯度饱和,配合Relu可以说基本解决了梯度饱和的问题。

 

三、Dropout

  dropout是非常好理解的,就是在训练的时候以一定的概率来去每层的神经元,如下图所示:

  技术分享

  个人理解:每次训练进行deoptout的操作,可以防止过拟合,为什么呢,因为每次训练的模型都长的不一样,但是他们的参数实际上是共享的,可以简单的理解为是个bagging的操作,众所周知..bagging在machine learning中是防止过拟合的神器,每次限制神经元的数量也防止过大的神经网络对数据集的过拟合。

  还可以理解为dropout是一个正则化的操作,他在每次训练的时候,强行让一些feature为0,这样提高了网络的稀疏表达能力

  

def dropout_forward(x, dropout_param):
    p, mode = dropout_param[p], dropout_param[mode]
    if seed in dropout_param:  
        np.random.seed(dropout_param[seed])

    mask = None
    out = None
    if mode == train:    
        mask = (np.random.rand(*x.shape) < p) / p    #注意这里除以了一个P,这样在test的输出的时候,维持原样即可
        out = x * mask
    elif mode == test:    
        out = x

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param[mode]
    dx = None

    if mode == train:    
        dx = dout * mask
    elif mode == test:    
        dx = dout

    return dx

四、总结

  总的来说,很多trick都是在train model的时候很有用,不过向BN已经逐渐要被替代和取消了,所以说明实时的follow最新的会议是多么的重要......然后,一些推导之后还是要在自己做一做,预备在后面复习cs231n的讲义的时候做一遍推导。 

笔记:CS231n+assignment2(作业二)(二)

标签:

原文地址:http://www.cnblogs.com/daihengchen/p/5769999.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!