N Grams
=========
Corpus:
I like peas
You like peas
left: first word
top: second word
.. list-table::
:header-rows: 1
:stub-columns: 1
* -
-
- I
- like
- peas
- You
-
* -
- 0
- 1
- 0
- 0
- 1
- 0
* - I
- 0
- 0
- 1
- 0
- 0
- 0
* - like
- 0
- 0
- 0
- 2
- 0
- 0
* - peas
- 0
- 0
- 0
- 0
- 0
- 2
* - You
- 0
- 0
- 1
- 0
- 0
- 0
Probabilities (divide each element by sum of row)
.. list-table::
:header-rows: 1
:stub-columns: 1
* -
-
- I
- like
- peas
- You
-
* -
- 0
- 1/2
- 0
- 0
- 1/2
- 0
* - I
- 0
- 0
- 1
- 0
- 0
- 0
* - like
- 0
- 0
- 0
- 1
- 0
- 0
* - peas
- 0
- 0
- 0
- 0
- 0
- 1
* - You
- 0
- 0
- 1
- 0
- 0
- 0
Probability of I like peas =
P(I | ) * P(like | I) * P(peas | like) * P( | peas)
= 0.5 * 1 * 1 * 1 = 0.5
Generally express with logs, log(0.5)
Smoothing
------------
Having 0 probabilities for some n-grams is bad as they could make the test set occur with 0 probability
Lets smooth those probabilities!
Laplace Smoothing
******************
Add 1 or k to every probability cell
Bigrams:
.. math::
P_{add_k}(w_i | w_{i-1}) = \frac{Count(w_{n-1} w_n) + k}{Count(W_{n-1}) + V}
Recover adjusted counts
*********************************
We just modified the probability distribution. What does this say about the actual counts of each n gram?
Recover counts by multiplying by sum of row.
Backoff
----------
If trigram doesn't exist, use bigrams, if they don't exist, unigrams, etc.
Stupid backoff
****************
Just backs off and keeps multiplying by 0.4