CS224N lecture14 笔记

发布于 16 天前  23 次阅读

Transformers and Self-Attention


  • Self-Attention
    • Constant "path length" between any two positions
    • Unbounded memory
    • Trival to parallelize(per layer)
    • Models self-similarity
    • Relative attention provides expressive timing, equivariance, and extends naturally to graphs

Suggested Readings

  1. Image Transformer
  2. Music Transformer Generating music with long-term structure