Basic Terms#

  • normalization
    • scale values into [0,1]

  • standardization
    • mean 0 std 1

  • inductive bias
    • certain assumptions the model makes like “the decision boundary is linear”

  • Entropy

    measures expected amount of information (in bits) needed to represent an event. For example, the result of flipping a fair coin can be encoded with a single bit (i.e. -log_2 (0.50)). We only expect to do this 50% of the time, so the entropy of this event would be -log_2 (0.50) * 0.50 AKA the average amount of bits needed to encode the info.

  • Information gain

    Expected reduction in entropy caused by partitioning on an attribute, a, which could be “weather”

    High gain means high reduction in entropy

    \[Gain (data, a) = \text{Entropy(data}) - \sum_{\text{v = for each value of a}}^{} \frac{\text{size of v}}{\text{size of data}} * \text{Entropy(v)}\]
  • representation learning

    Learning allgos that automatically learn useful feature representations

  • Accuracy

    (tp + tn) / (tp + tn + fp + fn)

  • precision

    % of predicted positives that are actually positive

  • recall

    % of actual positives predicted to be positive

  • F-score

    weighted harmonic mean of precision, recall

    \[\frac{1} {(a)*\frac{1}{P} + (1-a)*\frac{1}{R}}\]
  • F1-score

    a = 0.5

  • bias

    Set of possible models with your configuration

  • k-fold cross validation for hyperparam selection

    for each hyperparam combo, do k-fold cross val and get avg error to find which is best

  • the error can be decomposed into [estimation error] and (approximation) error. Error(f) = [error(f) - minposserror] + (minposserror)

Excercises#

  • Derive entropy, information gain, compute on sample data

  • Derive F1