CS224N lecture07 笔记

发布于 2020-06-20  162 次阅读

Vanishing Gradients and Fancy RNNs


  • Vanishing gradient

  • Gradient clipping

  • Long Shor-Term Memory(LSTM)

    • hidden state h(t) and cell state c(t)
    • the cell stores long-term information
    • gates:
      • also vector of length n;
      • on each timestep, each element of gates can be open(1) or closed(0) or between 0 and 1;
      • the gates are dynamics: their value is computed based on the current context;
    • Slides 23 ★
  • Gated Recurrent Units(GRU)

    • hidden state h(t) (no cell state)
    • Update gate; Reset gate;
    • Slide 28 ★
  • LSTM vs GRU

    • GRU is quicker to compute, fewer parameters
    • LSTM is a good default choice
    • Rule of thumb: start with LSTM, but switch to GRU for effiency
  • In regard to the deep in space

    • ResNet(Skip-connections)
    • DenseNet
    • HighwayNet


  • Gradient vainish
    • start off from an identity matrix initialization
    • use ReLU instead of sigmoid

Suggested Readings

  1. Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.3, 10.5, 10.7-10.12)

  2. Learning long-term dependencies with gradient descent is difficult
    one of the original vanishing gradient papers; 扎实的数学语言描述梯度消失问题,行吧。。。 分析梯度消失问题并在梯度传递loss的角度提出了几种alternative,数学过程没细看

  3. On the difficulty of training Recurrent Neural Networks
    proof of vanishing gradient problem; 用Dynamical Systems描述梯度消失、下降看不太懂;用一种正则化函数约束dxt/dx(t-1)在相关方向上的梯度接近1

  4. Vanishing Gradients Jupyter Notebook
    demo for feedword networks

  5. Understanding LSTM Networks
    blog post overview