Foundations of Data Science

Truncated Taylor series: if the Taylor series exists, then you can truncate it at any n-th derivative term, and there exists some real value z as input to the derivative to complete it, f(x) = f(0) + f'(z)x
1+x ≤ e^x forall real x. Proving inequalities a ≤ b is usually done by showing that b -a is non-negative
Gamma function, for a > 0, Γ(a) = 0∫∞(x^(a-1)•e^(-x)dx)
Indicator Variable is either 0 or 1 to indicate the presence of some quantity
Variance is the average squared deviation from the expected value
covariance between xi and xj is the average product their deviations
bounds on tail probability, for non-negative random variables: P(x > a) ≤ E(x)/a
for symmetric matrix A, the number of eigenvalues including multiplicity equals dimension(A)
m x n matrix A^T is the transpose of the n x m matrix A (columns become rows)
an orthogonal matrix Q is square, with orthonormal unit vectors for rows and columns
- Q^TQ = I, so Q^T = Q^-1
conjugate transpose of a matrix, A * is the transpose of A and then taking the complex conjugate of all entries
Real Spectral Theorem: let A be a real symmetric matrix, then
- the n eigenvalues are real, as are the components of the corresponding n eigenvectors
- (spectral decomposition) A is orthogonally diagonalizable; A = VDV^T where V is the matrix with columns |vi| = 1 and D is a diagonal matrix
A is symmetric <=> A is orthogonally diagonalizable

2: High-Dimensional Space

Starting off with some simple high-dimensional intuition:

for X ~ N(0,1), X is a coordinate of some d-dimensional vector, then as d -> infinity, the volume of the unit ball (all vectors x s.t. |x| ≤ 1) -> 0
- this is intuitive because for |x| to be ≤ 1, the sum of its squared components must be ≤ 1, and since they are normally distributed, we know that for large d, the sum of squared x coordinates will almost surely exceed 1
for a set of n points in d-dimensions, also with normally distributed coordinates, as d -> infinity, the distance between each of the points approaches the same value
- this is essentially saying that the sum of d independent squared differences approaches a uniform value as d approaches infinity
- and the law of large numbers states that: P(|(x1 + ... + xn)/n - E(x)| ≥ ε) ≤ Var(x)/n•ε²
- this is a bound of sample mean, for set of X ~ N(μ, σ^2), where we think of ε to be error
  - law of large numbers inequality is proved by Markov's inequality and Chebyshev's inequality
Markov's inequality: P( X ≥ α) ≤ E(X)/α
Chebyshev's inequality: P(|Y-E(Y)| ≥ α) ≤ Var(Y)/α^2

for a unit ball of d-dimensions, volume -> 0 as d -> infinity.

most of the volume is concentrated near the outer annulus width of 1/d
by shrinking some d-dimensional ball radius by factor f, we reduce the volume by a factor of f^d

The nearest Neighbour Problem is an example of a problem that benefits from dimension reduction with projection f: R^d -> R^k k << d

basis of random projection: take k random d-dimensional vectors, u1, ..., uk, then f(v) = (u1 • v, ..., uk • v)
- show that |f(v)| ~= k^(1/2) * |v| with high probability
- f(v1 - v2) = f(v1) - f(v2), allowing us to approximate distance between d-dimension points in k-dimensions
- note that ui's are not orthogonal, and not necessarily unit length
then Random Projection Theorem bounds the probability that the random projection deviates from the expected value by a factor of ε
- it follows that the probability that the projection length differs substantially is exponentially small with respect to k
Johnson Lindenstrauss Lemma bounds the distance between f(vi) and f(vj) is between 0 and k^(1/2) with high probability

Recall an Eigenvector of a linear mapping is a non-zero vector that does not change direction after the mapping is applied

Singular Value Decomposition of a matrix is finding the best-fitting k-dim subspace (k is a natural number)

minimizing the sum of squared perpendicular distances of points to the subspace
equivalent to maximizing the sum of squared lengths of projections onto the subspace

The best fitting subspace algorithm:

start with special case, 1-dim line of best fit; line through the origin
perform k applications of the best fitting line algorithm, where in the ith iteration, we find the best fitting line perpendicular to each of the i-1 lines