Summary
Machine learning is applied statistics, for the purpose of predicting outcomes for unseen experiences by extrapolating from a set of experiences.
 goal is to minimize generalization error, error incurred when running a model on new testing data, a model that was trained exclusively on a training data set
 overfitting and underfitting are undesirable outcomes of model training, and happen when the capacity of a model is not optimal
 overfitting is controlled by restricting the hypothesis space, dictating the model's capacity
 in regressions, we have a fixed parameter (weighting) vector
 nonparametric models allow for extreme fitting of data, with a nonfixed parametric vector
 the no free lunch theorem states that the average performance of all classification algorithms over all data sets is equal
 to design a learning algorithm, regularizers are used to specify preference of kinds of functions learned, controlling capacity
 regularization is any modification to the learning algorithm with the intent of reducing generalization error, not training error
Machine Learning Basics
Back to DataScience
Machine learning is a form of applied statistics, with a lesser emphasis on confidence and greater on statistical estimates.
 Formally, learning is improving a performance measure P, by learning from experience E with respect to some class of tasks T
 in a formal sense, learning is not a task, it is our means of attaining the ability to perform a task

tasks describe how we should process examples, vectors of quantitative features
 Classification: specifying one of
k
categories some input belongs to
 Classification with missing input: for n inputs, we will need 2^n functions to be learned, for every combination of inputs
 Regression: based on numeric input, learn a function that can predict outputs outside of the examples
 Transcription: observe realtively unstructured data, and transcribe it into discrete textual form
 Machine Translation: language translation (like English to French)
 Structured Output: any task with vector output (superset of transcription and machine translation)
 Anomaly Detection: program sifts through a set of events and flags them as being unusual
 Synthesis: generating new examples, based on the training set
 Imputation of Missing Values: fill in missing inputs
 Denoising: for some corrupted example x, and clean example y, find the conditional probability distribution p(yx)
 Density estimation (PMF estimation): determine general clustering in a probability mass function

performance measure is a quantitative measure for an algorithm for success of task T
 generally test how well it performs on data it hasn't seen, so the test set

experience is often the dataset, containing features, and in the case of supervised learning it contains annotations with labels or targets
Linear Regression Example
Simple algorithm for solving a regression problem, to predict some scaler y from input vector x in Rn: y = w^{T}x, w is a vector n parameters
 w is simply the weights of each xi feature
 set aside a design matrix X^{(test)} of m examples with a corresponding target vector y^{(test)} containing correct values of y, not used for training but as performance measure P
 common performance measure is based on the sum of squared deviation
 this is minimized when the sum of squared differences function's gradient is 0
 Mean Square Error is a function of the sum of square differences
 MSE_{test} =. 1/m • ∑(ŷ^{(test)}  y^{(test)})^{2} from 1 to i
 we want improve weights w to minimize MSE_{test}, by minimizing MSE_{train}
 solve for ∇_{w}MSE_{train} = 0 the normal equations
Normal Equation is an analytical solution to the linear regression problem, using a leastsquares cost function
 starting with: MSE_{test} =. 1/m • ∑(ŷ^{(test)}  y^{(test)})^{2} from 1 to i
 ∇_{w}MSE_{train} = 0
 ∇_{w} 1/m • ∑(ŷ^{(test)}  y^{(test)})^{2} = 0
 ∇_{w} ∑(ŷ^{(test)}  y^{(test)})^{2} = 0
 y^{(test)} = w^{T}x is our prediction or hypothesis function for x
 ∇_{w} ∑(w^{T}x  y^{(test)})^{2} = 0
 X, the design matrix replaces our summation term, since it represents the m rows of n+1 features, +1 for the intercept
 use transposed matrix multiplaction (essentially squaring each
Capacity, Overfitting and Underfitting
In machine learning, we want to perform new tasks unseen before, this is called generalization
 one goal is to minimize generalization error, or test error

generalization error is the expected value of the error for inputs, based on a distribution similar to what real inputs will be
 estimate generaliation error with a test set
 in linear regression, training error is minimized, but we actually care about test error, however you cannot use the test error to affect the result of training since the test error is not part of the training set
 in statistical learning theory, arbitrary collection of examples doesn't allow us to make any adjustments, but assuming data is collected over some probability distribution (data generating process)
 theoretically, we can follow the i.i.d assumption: independent identically distributed draws
 therefore the test and the train examples are from the same data generating distribution, p_{data}
 expected train error = expected test error
 in practice, splitting the sets and minimizing the training error almost certainly means that expected test error ≥ expected train error
 we want to reduce training error and reduce the gap between training and test error
 correspond to underfitting and overfitting, and can be mitigated with capacity
 a models capacity is generally it's ability to fit many functions
 high capacity corresponds to overfitting, low to underfitting
 one way to control capacity is to restrict the hypothesis space
 linear regression models are restricted to the set of linear functions
 by introducing an
x^2
term into the linear regression equation ŷ = b + w1x + w2x^{2} increases the model capacity (increased using more features, x^2, and their corresponding parameters w2)
 machine learning algorithms work best when they correspond to the appropriate capacity for the true complexity of the task
 in statistical learning theory, model capacity is quantified by the VapnikChervonenkis (VC) dimension
 measure capacity of binary classifiers
 VC dimension is the largest possible value of
m
for which there exists a training set of m different x points that the classifier can label arbitrarily
 simpler functions are more likely to generalize (have small difference between training and test error), still need to have sufficient complexity to keep training error low
Nonparametric models allow for high capacity (well fit models)
 nonparametric models are not necessarily bounded by a function (often describe nonpractical algorithms)
 linear regression has a fixed length vector of weights, while the nearest neighbour regression (nonparametric) uses X, the training matrix and y, the target vector of corresponding values to each point
 model looks up the closest training entry to the test point x, and returns the regression target
 nonparametric learning algorithms can be constructed through nested parametric learning algorithms
 outer algorithm, for learning the number of parameters needed for the inner regression algorithm
 error in mapping x to y can be seen as inherently stochastic or deterministic with nonconsidered variables
 the error in a "perfect model" for a given distribution incurs Bayes error
 for nonparametric models, the generalization error decreases with more examples until it reaches the best possible error
 any fixedparameter model without enough capacity will bound itself to a value greater than Bayes error
The No Free Lunch Theorem
Logically, inferring general rules of a limited set of examples in not valid
 learning theory claims that algorithms can generalize well from a finite training set
 must have information about every member of the set, otherwise you can't get something from nothing
 learning is based on probabilistic rules rather than logical reasoning
 no free lunch states that every classification algorithm has the same average error over the set of all tasks
 no learning algorithm is inherently better than any other, without information and assumptions
 important to understand distributions that are relevant to real world experiences and algorithms that work well on data drawn from these data generating distributions
Regularization
The no free lunch theorem implies that the algorithm must be designed well for a task, the only way so far is by changing its capacity (thus affecting the hypothesis space)
 can have more granularity in algorithm design by introducing more kinds of function to the hypothesis space
 modify what we're trying to minimize, by summing the MSE with a criterion, which is predefined by the user (λ would control the preference for a preferred function)
 weight decay is an example, where we minimize J(w), second term is just λ • L^{2} norm:
 result is a weighted tradeoff between fitting the training data and being small
 generally, we can regularize a model that learns function f(x;θ) by adding a penalty called a regularizer to the cost function
 in weight decay, the regularizer Ω(w) = w^{T}w
 exclusion from the hypothesis space is infinite preference against it
 to design a learning algorithm, regularizers are used to specify preference of kinds of functions learned, controlling capacity
 regularization is any modification to the learning algorithm with the intent of reducing generalization error, not training error
Hyperparameters
A hyperparameter is not learned by the algorithm, it is predefined like the capacity in a polynomial regression or λ in weight decay
 often parameters that are not easy to optimize (capacity learned from any training set will result in the highest possible capacity, resulting in overfitting)
 a validation set can be formed (partition of the test dataset) to train hyperparameters (over iterations?)
 after validation set optimizes hyperparameters, generalization error may be estimated with the test set
 crossvalidation is the use of all examples in the training and test sets
 kfold crossvalidation partitions the dataset into k nonoverlapping subsets, and iterate through each subset as the single test subset, and average the test error across k trials
 when dataset
D
is too small for simple train/test split to yield accurate estimation of generalization error (mean loss on a small test has high variance)
Estimators, Bias, Variance
Point Estimation attempts to provide the single best prediction
 generally predicting a parameter (or vector of them) e.g. the weights in linear regression
 θhat is used to denote an estimator for a parameter θ
 point estimators is just a function of the data,

Consistency formally describes the probabilistic convergence of a point estimate to the actual value
Maximum Likelihood Estimation

Maximum Likelihood Estimation helps to formally derive what a good estimator function is for a models
 we have an unknown distribution, and some sample
 likelihood of a specific sample coming up is the product of the individual probabilities of each event in the sample
 now we want to estimate any θ, some parameter of the unknown distribution; let's do the mean
 then θhat is our estimate of θ and it is the selected value of θ for the unknown distribution that maximizes the probability of our sample coming up

KL Divergence is a measure of nonsymmetry between two probability distributions
 KullbackLeibler divergence
The MLE can be simplified to finding θ that maximizes the average log likelihood of each event in our sample, the training data.
In ML, the MLE is used to minimize dissimilarity between the empircal distribution of the training set, and the model distribution, as measured by the KL divergence
Conditional LogLikelihood and MSE
Supervised Learning
Association of some input with some output
 human often acts as "supervisor" annotating data
 many supervised learning algorithms are probabilistic estimating a probability distribution
 using MLE to find best parameter vector
We can generalize linear regression to the classification scenario by defining different families of probability distributions
 in binary classification, we have class 0 and class 1, and combined the probability distributions for both classes should sum to 1 for all domain values
 since they sum to 1, we only calculate one distribution
 logistical regression: squash a linear regression into the range (0,1) using a logistic sigmoid, and interpret as probability
 logistic regression is hard, linear regression has optimal weight parameters for its function since there is a closedform solution
 in logistic regression, we must search for optimal weights by maximizing the loglikelihood funciton, or minimize the negative loglikelihood with gradient descent
 applies generally to almost all supervised learning problems
Support Vector Machines
Similar to logistic regression, also driven by linear funcition w^{T}x + b (weighted vector funciton)
 the goal is similar to logistic regression: find a hyperplane that divides a dataset into two classes
 support vectors are the data points nearest to the hyperplane, if removed would alter the position of the dividing hyperplane
 margins are the magnitude of the support vectors, and we want to maximize the margins if possible
 while logistic regression provides probabilities, SVMs predict one class for positive output values, and other for negative
Often data doesn't divide nicely. We can approach this by mapping data into a higher dimension:

kernel trick is used to accomplish this  source
 applying kernels to improve classification results
 kernel K(v,w) is a function K: R^{N}, R^{N} > R computing the dot product between v and w
 so a linear SVM finds a hyperplane that best separates data points called the decision boundary
 in most datasets, a linear decision boundary cannot be found in the original feature space/dimensionality
 the challenge in training a linear SVM classifier in finding a decision boundary is that we must apply a good transformation to map the dataset to a higher dimension
We define φ to be a transformation function that maps the dataset X to a higher dimensionality dataset X' (think R2 to R3 as shown above)
 want to train a linear SVM on transformed dataset X' to get a classifier f_{SVM}
 after training, during testing, every new example x will first be transformed to x' = φ(x)
 so the output is a label for the class. In otherwords f_{SVM}(φ(x)) = f_{SVM}(x') in {0,1} for a binary classification problem
 can achieve nonlinear SVMs by applying linear SVMs in higher dimensions
 this is very simple to understand, but it is also computationally expensive to perform
 we can apply different transformations, one example is a polynomial kernel, and this one maps from R2 > R5, and by simply transforming the input before performing a linear SVM is extremely expensive in high dimensions
It turns out that SVMs only compute pairwise dot products and thus we don't need to always work in a higher dimensional space during training/testing
 during the training, the optimization problem only uses pairwise dot products of examples
 there exists a closed form method to compute dot products in higher dimensions without needing to transform the vectors in the dataset

kernel functions let us compute this
 we can implicitly transform datasets to higher dimensional R^{M} faster and without extra memory
 in other words, instead of preprocessing data input, then performing dot products, the kernel function can replace dot products and be applied to untransformed vectors with the same effect
 the intuition behind this is that the kernel function K can effectively perform a dot product while working exclusively in R^{N}
 different kernels, or kernel functions, exist to perform dot products in different dimensions with different transformations
 the chosen dimensional transform is important for finding an effective decision boundary
 selecting the correct kernel for a task requires a lot of hyperparameter tuning for the model to get good performance

KFold Cross Validation is a technique that might help