标签:pos sed put src NPU you 计算 对比 链式
(2)只能形成短期记忆,不能形成长期记忆。 因为梯度逐层减少,只有比较临近的层梯度才会相差不多,所以对临近的信息记忆比较多,对较远的信息记忆差。
所有 RNN 都具有一种重复神经网络模块的链式的形式。在标准的 RNN 中,这个重复的模块只有一个非常简单的结构,例如一个 tanh 层。
在RNN中,$x_t\in \mathbb{R}^{D} ,h_t\in \mathbb{R}^H, W_x\in\mathbb{R}^{H\times D}, W_h\in\mathbb{R}^{H\times H},b\in\mathbb{R}^{H}$。
在LSTM中,$x_t\in \mathbb{R}^{D}, h_t\in \mathbb{R}^H, W_x\in\mathbb{R}^{4H\times D}, W_h\in\mathbb{R}^{4H\times H},b\in\mathbb{R}^{4H}$。
第一步还是一样,$a\in\mathbb{R}^{4H}$ , $a=W_xx_t + W_hh_{t-1}+b$,RNN得到a直接就可以直接激活当作下一个状态了,而LSTM中得到了四个输出。
i = \sigma(a_i) \hspace{2pc}
f = \sigma(a_f) \hspace{2pc}
o = \sigma(a_o) \hspace{2pc}
g = \tanh(a_g)
c_{t} = f\odot c_{t-1} + i\odot g \hspace{4pc}
h_t = o\odot\tanh(c_t)
下一步是确定什么样的新信息被存放在细胞状态中。这里包含两个部分。第一,sigmoid 层称 “输入门层” 决定什么值我们将要更新。然后,一个 tanh 层称“阻塞门层”创建一个新的候选值向量加入到状态中。下一步,我们会讲这两个信息来产生对状态的更新。
正向计算和反向求梯度 def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b): """ Forward pass for a single timestep of an LSTM. The input data has dimension D, the hidden state has dimension H, and we use a minibatch size of N. Inputs: - x: Input data, of shape (N, D) - prev_h: Previous hidden state, of shape (N, H) - prev_c: previous cell state, of shape (N, H) - Wx: Input-to-hidden weights, of shape (D, 4H) - Wh: Hidden-to-hidden weights, of shape (H, 4H) - b: Biases, of shape (4H,) Returns a tuple of: - next_h: Next hidden state, of shape (N, H) - next_c: Next cell state, of shape (N, H) - cache: Tuple of values needed for backward pass. """ next_h, next_c, cache = None, None, None ############################################################################# # TODO: Implement the forward pass for a single timestep of an LSTM. # # You may want to use the numerically stable sigmoid implementation above. # ############################################################################# H=Wh.shape[0] a = np.dot(x, Wx) + np.dot(prev_h, Wh) + b # (1) i = sigmoid(a[:, 0:H]) # (2-5) f = sigmoid(a[:, H:2*H]) o = sigmoid(a[:, 2*H:3*H]) g = np.tanh(a[:, 3*H:4*H]) next_c = f * prev_c + i * g # (6) next_h = o * np.tanh(next_c) # (7) cache = (i, f, o, g, x, Wx, Wh, prev_c, prev_h,next_c) return next_h, next_c, cache def lstm_step_backward(dnext_h, dnext_c, cache): """ Backward pass for a single timestep of an LSTM. Inputs: - dnext_h: Gradients of next hidden state, of shape (N, H) - dnext_c: Gradients of next cell state, of shape (N, H) - cache: Values from the forward pass Returns a tuple of: - dx: Gradient of input data, of shape (N, D) - dprev_h: Gradient of previous hidden state, of shape (N, H) - dprev_c: Gradient of previous cell state, of shape (N, H) - dWx: Gradient of input-to-hidden weights, of shape (D, 4H) - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H) - db: Gradient of biases, of shape (4H,) """ dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None ############################################################################# # TODO: Implement the backward pass for a single timestep of an LSTM. # # # # HINT: For sigmoid and tanh you can compute local derivatives in terms of # # the output value from the nonlinearity. # ############################################################################# i, f, o, g, x, Wx, Wh, prev_c, prev_h,next_c =cache do=dnext_h*np.tanh(next_c) dnext_c+=o*(1-np.tanh(next_c)**2)*dnext_h di,df,dg,dprev_c=dnext_c*(g,prev_c,i,f) da=np.hstack([i*(1-i)*di,f*(1-f)*df,o*(1-o)*do,(1-g*g)*dg]) dx=np.dot(da,Wx.T) dWx=np.dot(x.T,da) dprev_h=np.dot(da,Wh.T) dWh=np.dot(prev_h.T,da) db=np.sum(da,axis=0) return dx, dprev_h, dprev_c, dWx, dWh, db def lstm_forward(x, h0, Wx, Wh, b): """ Forward pass for an LSTM over an entire sequence of data. We assume an input sequence composed of T vectors, each of dimension D. The LSTM uses a hidden size of H, and we work over a minibatch containing N sequences. After running the LSTM forward, we return the hidden states for all timesteps. Note that the initial cell state is passed as input, but the initial cell state is set to zero. Also note that the cell state is not returned; it is an internal variable to the LSTM and is not accessed from outside. Inputs: - x: Input data of shape (N, T, D) - h0: Initial hidden state of shape (N, H) - Wx: Weights for input-to-hidden connections, of shape (D, 4H) - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H) - b: Biases of shape (4H,) Returns a tuple of: - h: Hidden states for all timesteps of all sequences, of shape (N, T, H) - cache: Values needed for the backward pass. """ h, cache = None, None ############################################################################# # TODO: Implement the forward pass for an LSTM over an entire timeseries. # # You should use the lstm_step_forward function that you just defined. # ############################################################################# N,T,D=x.shape H=h0.shape[1] h=np.zeros((N,T,H)) cache={} prev_h=h0 prev_c=np.zeros((N,H)) for t in range(T): xt=x[:,t,:] next_h,next_c,cache[t]=lstm_step_forward(xt,prev_h,prev_c,Wx,Wh,b) prev_h=next_h prev_c=next_c h[:,t,:]=prev_h return h, cache def lstm_backward(dh, cache): """ Backward pass for an LSTM over an entire sequence of data.] Inputs: - dh: Upstream gradients of hidden states, of shape (N, T, H) - cache: Values from the forward pass Returns a tuple of: - dx: Gradient of input data of shape (N, T, D) - dh0: Gradient of initial hidden state of shape (N, H) - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H) - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H) - db: Gradient of biases, of shape (4H,) """ dx, dh0, dWx, dWh, db = None, None, None, None, None ############################################################################# # TODO: Implement the backward pass for an LSTM over an entire timeseries. # # You should use the lstm_step_backward function that you just defined. # ############################################################################# N, T, H = dh.shape D = cache[0][4].shape[1] dprev_h = np.zeros((N, H)) dprev_c = np.zeros((N, H)) dx = np.zeros((N, T, D)) dh0 = np.zeros((N, H)) dWx= np.zeros((D, 4*H)) dWh = np.zeros((H, 4*H)) db = np.zeros((4*H,)) for t in range(T): t = T-1-t step_cache = cache[t] dnext_h = dh[:,t,:] + dprev_h dnext_c = dprev_c dx[:,t,:], dprev_h, dprev_c, dWxt, dWht, dbt = lstm_step_backward(dnext_h, dnext_c, step_cache) dWx, dWh, db = dWx+dWxt, dWh+dWht, db+dbt dh0 = dprev_h return dx, dh0, dWx, dWh, db
标签:pos sed put src NPU you 计算 对比 链式