N Grams
Contents
N Grams#
Corpus:
<s> I like peas </s>
<s> You like peas </s>
left: first word
top: second word
| <s> | I | like | peas | You | </s> | |
|---|---|---|---|---|---|---|
| <s> | 0 | 1 | 0 | 0 | 1 | 0 | 
| I | 0 | 0 | 1 | 0 | 0 | 0 | 
| like | 0 | 0 | 0 | 2 | 0 | 0 | 
| peas | 0 | 0 | 0 | 0 | 0 | 2 | 
| You | 0 | 0 | 1 | 0 | 0 | 0 | 
Probabilities (divide each element by sum of row)
| <s> | I | like | peas | You | </s> | |
|---|---|---|---|---|---|---|
| <s> | 0 | 1/2 | 0 | 0 | 1/2 | 0 | 
| I | 0 | 0 | 1 | 0 | 0 | 0 | 
| like | 0 | 0 | 0 | 1 | 0 | 0 | 
| peas | 0 | 0 | 0 | 0 | 0 | 1 | 
| You | 0 | 0 | 1 | 0 | 0 | 0 | 
Probability of I like peas =
P(I | <s>) * P(like | I) * P(peas | like) * P(</s> | peas)
= 0.5 * 1 * 1 * 1 = 0.5
Generally express with logs, log(0.5)
Smoothing#
Having 0 probabilities for some n-grams is bad as they could make the test set occur with 0 probability
Lets smooth those probabilities!
Laplace Smoothing#
Add 1 or k to every probability cell
Bigrams:
Recover adjusted counts#
We just modified the probability distribution. What does this say about the actual counts of each n gram?
Recover counts by multiplying by sum of row.
Backoff#
If trigram doesn’t exist, use bigrams, if they don’t exist, unigrams, etc.
Stupid backoff#
Just backs off and keeps multiplying by 0.4