CS224N lecture12 笔记

发布于 2020-07-14  575 次阅读

Information from parts of words: Subwords Models


  • Human language sounds: Phonetics and Phonology

  • Morphology: Parts of words

    • traditionally, we have morphemes as samallest semantic unit
    • an easy alternative is to work with character n-grams
  • Subword models: two trends

    • same architecture as for work-level model but use smaller units——"word pieces"
    • Hybrid architecture: main model has words; something else for characters
  • Byte Pair Encoding

    • originally a compresson algorithm
      Most frequent byte pair → a new byte
    • A word segmentation algorithm:
      • start with a vocabulary of characters
      • most frequent n-gram pairs → a new n-gram
    • Have a target vocabulary size and stop when reach it
    • Do deterministic longest piece segmentation of words
    • Segementation is only within words identified by some prior tokenizer(commonly Moses tokenizer for MT)
  • Wordpiece/Sentencepiece model

    • use a greedy approximation to maximizing language model log likelihood to choose the pieces
    • add n-gram that maximally reduces perplexity
    • Wordpiece model tokenizes inside words
    • Sentencepiece model works from raw text
  • FastText embedding

    • a next generation efficient word2vec-like embedding better for rare words and languag with lots of morphology
    • an extenson of the word2vec skip-gram model with character n-grams
    • represent word as char n-grams augmented with boundry symbols and as while word:
      where = <wh; whe; her; ere; re>; <where>;