N Grams
Contents
N Grams#
Corpus:
<s> I like peas </s>
<s> You like peas </s>
left: first word
top: second word
<s> |
I |
like |
peas |
You |
</s> |
|
---|---|---|---|---|---|---|
<s> |
0 |
1 |
0 |
0 |
1 |
0 |
I |
0 |
0 |
1 |
0 |
0 |
0 |
like |
0 |
0 |
0 |
2 |
0 |
0 |
peas |
0 |
0 |
0 |
0 |
0 |
2 |
You |
0 |
0 |
1 |
0 |
0 |
0 |
Probabilities (divide each element by sum of row)
<s> |
I |
like |
peas |
You |
</s> |
|
---|---|---|---|---|---|---|
<s> |
0 |
1/2 |
0 |
0 |
1/2 |
0 |
I |
0 |
0 |
1 |
0 |
0 |
0 |
like |
0 |
0 |
0 |
1 |
0 |
0 |
peas |
0 |
0 |
0 |
0 |
0 |
1 |
You |
0 |
0 |
1 |
0 |
0 |
0 |
Probability of I like peas =
P(I | <s>) * P(like | I) * P(peas | like) * P(</s> | peas)
= 0.5 * 1 * 1 * 1 = 0.5
Generally express with logs, log(0.5)
Smoothing#
Having 0 probabilities for some n-grams is bad as they could make the test set occur with 0 probability
Lets smooth those probabilities!
Laplace Smoothing#
Add 1 or k to every probability cell
Bigrams:
Recover adjusted counts#
We just modified the probability distribution. What does this say about the actual counts of each n gram?
Recover counts by multiplying by sum of row.
Backoff#
If trigram doesn’t exist, use bigrams, if they don’t exist, unigrams, etc.
Stupid backoff#
Just backs off and keeps multiplying by 0.4