Information from parts of words: Subwords Models
Slides
-
Human language sounds: Phonetics and Phonology
-
Morphology: Parts of words
- traditionally, we have morphemes as samallest semantic unit
- an easy alternative is to work with character n-grams
-
Subword models: two trends
- same architecture as for work-level model but use smaller units——"word pieces"
- Hybrid architecture: main model has words; something else for characters
-
Byte Pair Encoding
- originally a compresson algorithm
Most frequent byte pair → a new byte - A word segmentation algorithm:
- start with a vocabulary of characters
- most frequent n-gram pairs → a new n-gram
- Have a target vocabulary size and stop when reach it
- Do deterministic longest piece segmentation of words
- Segementation is only within words identified by some prior tokenizer(commonly Moses tokenizer for MT)
- originally a compresson algorithm
-
Wordpiece/Sentencepiece model
- use a greedy approximation to maximizing language model log likelihood to choose the pieces
- add n-gram that maximally reduces perplexity
- Wordpiece model tokenizes inside words
- Sentencepiece model works from raw text
-
FastText embedding
- a next generation efficient word2vec-like embedding better for rare words and languag with lots of morphology
- an extenson of the word2vec skip-gram model with character n-grams
- represent word as char n-grams augmented with boundry symbols and as while word:
where =<wh
;whe
;her
;ere
;re>
;<where>
;
Comments | NOTHING