Back to DataScience
Background
 Truncated Taylor series: if the Taylor series exists, then you can truncate it at any nth derivative term, and there exists some real value z as input to the derivative to complete it, f(x) = f(0) + f'(z)x

1+x ≤ e^x
forall real x. Proving inequalities a ≤ b is usually done by showing that b a is nonnegative
 Gamma function, for a > 0, Γ(a) = 0∫∞(x^(a1)•e^(x)dx)
 Indicator Variable is either 0 or 1 to indicate the presence of some quantity
 Variance is the average squared deviation from the expected value
 covariance between xi and xj is the average product their deviations
 bounds on tail probability, for nonnegative random variables:
P(x > a) ≤ E(x)/a
 for symmetric matrix A, the number of eigenvalues including multiplicity equals dimension(A)
 m x n matrix A^{T} is the transpose of the n x m matrix A (columns become rows)
 an orthogonal matrix Q is square, with orthonormal unit vectors for rows and columns
 Q^{T}Q = I, so Q^{T} = Q^{1}
 conjugate transpose of a matrix, A * is the transpose of A and then taking the complex conjugate of all entries
 Real Spectral Theorem: let A be a real symmetric matrix, then
 the
n
eigenvalues are real, as are the components of the corresponding n
eigenvectors
 (spectral decomposition) A is orthogonally diagonalizable; A = VDV^{T} where V is the matrix with columns vi = 1 and D is a diagonal matrix
 A is symmetric <=> A is orthogonally diagonalizable
2: HighDimensional Space
Starting off with some simple highdimensional intuition:
 for X ~ N(0,1), X is a coordinate of some ddimensional vector, then as d > infinity, the volume of the unit ball (all vectors x s.t. x ≤ 1) > 0
 this is intuitive because for x to be ≤ 1, the sum of its squared components must be ≤ 1, and since they are normally distributed, we know that for large d, the sum of squared x coordinates will almost surely exceed 1
 for a set of n points in ddimensions, also with normally distributed coordinates, as d > infinity, the distance between each of the points approaches the same value
 this is essentially saying that the sum of d independent squared differences approaches a uniform value as d approaches infinity
 and the law of large numbers states that: P((x1 + ... + xn)/n  E(x) ≥ ε) ≤ Var(x)/n•ε^{2}
 this is a bound of sample mean, for set of X ~ N(μ, σ^2), where we think of ε to be error
 law of large numbers inequality is proved by Markov's inequality and Chebyshev's inequality

Markov's inequality: P( X ≥ α) ≤ E(X)/α

Chebyshev's inequality: P(YE(Y) ≥ α) ≤ Var(Y)/α^2
Properties of the Unit Ball
for a unit ball of ddimensions, volume > 0 as d > infinity.
 most of the volume is concentrated near the outer annulus width of 1/d
 by shrinking some ddimensional ball radius by factor f, we reduce the volume by a factor of f^d
Random Projection and Johnson Lindenstrauss Lemma
The nearest Neighbour Problem is an example of a problem that benefits from dimension reduction with projection f: R^d > R^k
k << d
 basis of random projection: take k random ddimensional vectors, u1, ..., uk, then f(v) = (u1 • v, ..., uk • v)
 show that f(v) ~= k^(1/2) * v with high probability
 f(v1  v2) = f(v1)  f(v2), allowing us to approximate distance between ddimension points in kdimensions
 note that ui's are not orthogonal, and not necessarily unit length
 then Random Projection Theorem bounds the probability that the random projection deviates from the expected value by a factor of ε
 it follows that the probability that the projection length differs substantially is exponentially small with respect to k
 Johnson Lindenstrauss Lemma bounds the distance between f(vi) and f(vj) is between 0 and k^(1/2) with high probability
Eigendecomposition of a Matrix
Recall an Eigenvector of a linear mapping is a nonzero vector that does not change direction after the mapping is applied
 v is an eigenvector of T if T(v) = λv (scalar multiple)
 λ is the eigenvalue
3: Best Fit Subspaces & SVD
Singular Value Decomposition of a matrix is finding the bestfitting kdim subspace (k is a natural number)
 minimizing the sum of squared perpendicular distances of points to the subspace
 equivalent to maximizing the sum of squared lengths of projections onto the subspace
The best fitting subspace algorithm:
 start with special case, 1dim line of best fit; line through the origin
 perform
k
applications of the best fitting line algorithm, where in the i
th iteration, we find the best fitting line perpendicular to each of the i1
lines