CS224N lecture13 笔记

发布于 2020-07-17  571 次阅读

Contextual Word Representations and Pre-training


  • Tips for <UNK> with word vectors

    • If the <UNK> word at test time appears in your unsupervised word embeddings, use that vector as is at test time.
    • Additionly, for other words, just assign them a random vector, adding them to your vocabulary.
    • Collapsing things to word classes(like unknown number, capitalized thing, etc.) and having an <UNK-class> for each.
  • Problem for word2vec, glove, fastText

    • Always the same representation for a word type regardless of the context in which a word token occurs.
    • Just have one representation for a word, but words have different aspecets, including semantics, synactic behaviors, and register/connotations.
  • TagLM-"Pre-ELMo"

  • Cove

    • Choose NMT as the objective
  • ELMo

    • 2 biLSTM layers; character CNN; residual connections;
    • Lower layer is better for lowel-level syntax
    • Higner layer is better for higher-layer semantics
  • ULMfit

    • Same general idea of transferring NLM knowledge
    • Train LM on big general domain corpus(use biLM)
    • Tune LM on target task data
    • Fine-tune as classifier on target task
  • Transformer

    • Motivation: we want parallelization but RNNs are inheretly sequential
    • Dot-prodect Attention
    • Scaled dot-product attention slide 42
    • Self-attention in the encoder
    • Multi-head attention
    • Residual connection and LayerNorm, LayerNorm(x + SubLayer(x))
    • LayerNorm changes input to have mean 0 and variance 1, per layer and per training point(and adds twp more parameters)
    • positional encoding
    • Decoder
      • Masked decoder self-attention on previously generated outputs
      • Encoder-Decoder Attention, where queries come from previous decoder layer and keys and values come from output of encoder
  • Tips and tricks of the Transformer

    • Byte-pair encodeings
    • Check point averaging
    • Adam optimizer
    • Label smoothing
    • Dropout during training at every layer just before adding residual
    • Auto-regressive decoding with beam search and length penalties
  • Bert

    • Masker LM
    • Next sentence prediction

Suggested Readings

  1. Contextual Word Representations: A Contextual Introduction
    • WordNet
      a lexical database that stores words and relationships among such as synonymy(when two words can mean the same thing) and hyponymy(when one word's meaning is a more specific case of anonther's)
    • Cautionary Notes
      • Word vectors are biased
      • Language is a lot more than words
      • NLP is not a singel problem