The probability of a sentence? Recurrent Neural Networks and Language Models
Slides
-
Language Model
the task of predicting what word comes next; assigns probability to a piece of text -
n-gram Language Model(pre-Deep learning)
- A n-gram is a chunk of n consecutive words
- to handle sparse problem- smoothing; back off;
-
neural Language Model
- a fixed-window neural Language Model(like the NER model in lecture 03)
- RNN(apply the same weights W repeatedly, symmerty)
-
Evaluating Language Models
- the standard evaluation metric-perplexity: equal to the exponential of the cross-entropy loss exp(J(Θ)), lower perplexity is better
-
Recurrent Neural Network
- take sequential input of any length
- apply the same weights on each step
- can optionally produce output on each step
Notes
-
Keyphrases: Language Models; RNN; Bi-directional RNN; GRU; LSTM; Deep RNN
-
n-gram Language models
-
sparsity problems: use smoothing; use backoff; increase n → sparsity worse, typically n <= 5
-
storage problems: increase n → model size increase
-
-
Window-based Neural Language Model
example by Bengio (经典论文) -
Recurrent Neural Network(RNN)
- RNN loss and perplexity, 公式表达
Suggested Readings
-
N-gram Language Models(textbook chapter)
- extrinsic evaluation(end to end)
- perplexity
- out of vocabulary(OOV)
open vocabulary ——<UNK>
for unknown words; a LM can achieve low perplexity by choosing a small vocabulary and assigning the unknown word a high probalility - Smoothing
- Laplace smoothing(add one smoothing); discounted probability
- Add-k smoothing
- backoff
- interpolation
mix the probability estimates from all the n-gram estimators - Katz backoff
rely on the discounted probability in order not to beyond 1; often combined with a smoothing method called "Good-Turing" - Kneser-Ney smoothing(most commonly)
assume that words appear in more context tend to appear more in new context
-
The Unreasonable Effectiveness of Recurrent Neural Networks
blog post overview; RNN对文本结构的捕获能力令人印象深刻,latex格式,linux源码.... -
Sequence Modeling: Recurrent and Recursive Neural Nets(Section 10.1 and 10.2)
关于上一个状态RNN传入下一个hidden state形式的几种变体, 没细看 -
On Chomsky and the Two Cultures of Statistical Learning
对统计模型的一些碎碎念,没怎么细看。
Comments | NOTHING