Vanishing Gradients and Fancy RNNs
Slides
-
Vanishing gradient
-
Gradient clipping
-
Long Shor-Term Memory(LSTM)
- hidden state h(t) and cell state c(t)
- the cell stores long-term information
- gates:
- also vector of length n;
- on each timestep, each element of gates can be open(1) or closed(0) or between 0 and 1;
- the gates are dynamics: their value is computed based on the current context;
- Slides 23 ★
-
Gated Recurrent Units(GRU)
- hidden state h(t) (no cell state)
- Update gate; Reset gate;
- Slide 28 ★
-
LSTM vs GRU
- GRU is quicker to compute, fewer parameters
- LSTM is a good default choice
- Rule of thumb: start with LSTM, but switch to GRU for effiency
-
In regard to the deep in space
- ResNet(Skip-connections)
- DenseNet
- HighwayNet
Notes
- Gradient vainish
- start off from an identity matrix initialization
- use ReLU instead of sigmoid
Suggested Readings
-
Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.3, 10.5, 10.7-10.12)
教材章节,还没看 -
Learning long-term dependencies with gradient descent is difficult
one of the original vanishing gradient papers; 扎实的数学语言描述梯度消失问题,行吧。。。 分析梯度消失问题并在梯度传递loss的角度提出了几种alternative,数学过程没细看 -
On the difficulty of training Recurrent Neural Networks
proof of vanishing gradient problem; 用Dynamical Systems描述梯度消失、下降看不太懂;用一种正则化函数约束dxt/dx(t-1)在相关方向上的梯度接近1 -
Vanishing Gradients Jupyter Notebook
demo for feedword networks -
Understanding LSTM Networks
blog post overview
Comments | NOTHING