Summary

Machine learning is applied statistics, for the purpose of predicting outcomes for unseen experiences by extrapolating from a set of experiences.

Machine Learning Basics

Back to Data-Science

Machine learning is a form of applied statistics, with a lesser emphasis on confidence and greater on statistical estimates.

Linear Regression Example

Simple algorithm for solving a regression problem, to predict some scaler y from input vector x in Rn: y = wTx, w is a vector n parameters

Normal Equation is an analytical solution to the linear regression problem, using a least-squares cost function

Capacity, Overfitting and Underfitting

In machine learning, we want to perform new tasks unseen before, this is called generalization

Non-parametric models allow for high capacity (well fit models)

The No Free Lunch Theorem

Logically, inferring general rules of a limited set of examples in not valid

Regularization

The no free lunch theorem implies that the algorithm must be designed well for a task, the only way so far is by changing its capacity (thus affecting the hypothesis space)

Hyperparameters

A hyperparameter is not learned by the algorithm, it is predefined like the capacity in a polynomial regression or λ in weight decay

Estimators, Bias, Variance

Point Estimation attempts to provide the single best prediction

Maximum Likelihood Estimation

The MLE can be simplified to finding θ that maximizes the average log likelihood of each event in our sample, the training data.

In ML, the MLE is used to minimize dissimilarity between the empircal distribution of the training set, and the model distribution, as measured by the KL divergence

Conditional Log-Likelihood and MSE

Supervised Learning

Association of some input with some output

We can generalize linear regression to the classification scenario by defining different families of probability distributions

Support Vector Machines

Similar to logistic regression, also driven by linear funcition wTx + b (weighted vector funciton)

Often data doesn't divide nicely. We can approach this by mapping data into a higher dimension:

We define φ to be a transformation function that maps the dataset X to a higher dimensionality dataset X' (think R2 to R3 as shown above)

It turns out that SVMs only compute pair-wise dot products and thus we don't need to always work in a higher dimensional space during training/testing