Basic Terms ============== * normalization * scale values into [0,1] * standardization * mean 0 std 1 * inductive bias * certain assumptions the model makes like "the decision boundary is linear" * Entropy measures expected amount of information (in bits) needed to represent an event. For example, the result of flipping a fair coin can be encoded with a single bit (i.e. -log_2 (0.50)). We only expect to do this 50% of the time, so the entropy of this event would be -log_2 (0.50) * 0.50 AKA the average amount of bits needed to encode the info. * Information gain Expected reduction in entropy caused by partitioning on an attribute, a, which could be "weather" High gain means high reduction in entropy .. math:: Gain (data, a) = \text{Entropy(data}) - \sum_{\text{v = for each value of a}}^{} \frac{\text{size of v}}{\text{size of data}} * \text{Entropy(v)} * representation learning Learning allgos that automatically learn useful feature representations * Accuracy (tp + tn) / (tp + tn + fp + fn) * precision % of predicted positives that are actually positive * recall % of actual positives predicted to be positive * F-score weighted harmonic mean of precision, recall .. math:: \frac{1} {(a)*\frac{1}{P} + (1-a)*\frac{1}{R}} * F1-score a = 0.5 * bias Set of possible models with your configuration * k-fold cross validation for hyperparam selection for each hyperparam combo, do k-fold cross val and get avg error to find which is best * the error can be decomposed into [estimation error] and (approximation) error. Error(f) = [error(f) - minposserror] + (minposserror) Excercises ------------- * Derive entropy, information gain, compute on sample data * Derive F1