N Grams#

Probabilities (divide each element by sum of row)

Probability of I like peas =

P(I | <s>) * P(like | I) * P(peas | like) * P(</s> | peas)

= 0.5 * 1 * 1 * 1 = 0.5

Generally express with logs, log(0.5)

Smoothing#

Having 0 probabilities for some n-grams is bad as they could make the test set occur with 0 probability

Lets smooth those probabilities!

Add 1 or k to every probability cell

Bigrams:

\[P_{add_k}(w_i | w_{i-1}) = \frac{Count(w_{n-1} w_n) + k}{Count(W_{n-1}) + V}\]

We just modified the probability distribution. What does this say about the actual counts of each n gram?

Recover counts by multiplying by sum of row.

If trigram doesn’t exist, use bigrams, if they don’t exist, unigrams, etc.

Just backs off and keeps multiplying by 0.4