Meta#

Zipf’s Law#

\[frequency * rank = count\]

\[frequency = \frac{count}{rank}\]

Extrinsic - plug into downstream system and see how well they perform

Intrinsic - ask humans to evaluate

\[\text{P(label)} = \frac{\text{agree}}{\text{agree + disagree}}\]

\[\frac{\text{P(a) - E[a]}}{1 - E[a]}\]

Where P(a) is according to the annotators and E[a] is the probability of a having this label at random.

Intrinsic Performance measure used for language models

“inverse probability of test set normalized by # of words”

“kind of like weighted branch factor of language”

Should only be used to compare models which use the same vocab

Low Perplexity is good

Is 2 to the cross entropy

For a bigram model, can define as:

\[\text{PP(w)} = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{p(w_i | w_{i-1})}}\]