Back to Data-Science
Background
- Truncated Taylor series: if the Taylor series exists, then you can truncate it at any n-th derivative term, and there exists some real value z as input to the derivative to complete it, f(x) = f(0) + f'(z)x
-
1+x ≤ e^x
forall real x. Proving inequalities a ≤ b is usually done by showing that b -a is non-negative
- Gamma function, for a > 0, Γ(a) = 0∫∞(x^(a-1)•e^(-x)dx)
- Indicator Variable is either 0 or 1 to indicate the presence of some quantity
- Variance is the average squared deviation from the expected value
- covariance between xi and xj is the average product their deviations
- bounds on tail probability, for non-negative random variables:
P(x > a) ≤ E(x)/a
- for symmetric matrix A, the number of eigenvalues including multiplicity equals dimension(A)
- m x n matrix AT is the transpose of the n x m matrix A (columns become rows)
- an orthogonal matrix Q is square, with orthonormal unit vectors for rows and columns
- conjugate transpose of a matrix, A * is the transpose of A and then taking the complex conjugate of all entries
- Real Spectral Theorem: let A be a real symmetric matrix, then
- the
n
eigenvalues are real, as are the components of the corresponding n
eigenvectors
- (spectral decomposition) A is orthogonally diagonalizable; A = VDVT where V is the matrix with columns |vi| = 1 and D is a diagonal matrix
- A is symmetric <=> A is orthogonally diagonalizable
2: High-Dimensional Space
Starting off with some simple high-dimensional intuition:
- for X ~ N(0,1), X is a coordinate of some d-dimensional vector, then as d -> infinity, the volume of the unit ball (all vectors x s.t. |x| ≤ 1) -> 0
- this is intuitive because for |x| to be ≤ 1, the sum of its squared components must be ≤ 1, and since they are normally distributed, we know that for large d, the sum of squared x coordinates will almost surely exceed 1
- for a set of n points in d-dimensions, also with normally distributed coordinates, as d -> infinity, the distance between each of the points approaches the same value
- this is essentially saying that the sum of d independent squared differences approaches a uniform value as d approaches infinity
- and the law of large numbers states that: P(|(x1 + ... + xn)/n - E(x)| ≥ ε) ≤ Var(x)/n•ε2
- this is a bound of sample mean, for set of X ~ N(μ, σ^2), where we think of ε to be error
- law of large numbers inequality is proved by Markov's inequality and Chebyshev's inequality
-
Markov's inequality: P( X ≥ α) ≤ E(X)/α
-
Chebyshev's inequality: P(|Y-E(Y)| ≥ α) ≤ Var(Y)/α^2
Properties of the Unit Ball
for a unit ball of d-dimensions, volume -> 0 as d -> infinity.
- most of the volume is concentrated near the outer annulus width of 1/d
- by shrinking some d-dimensional ball radius by factor f, we reduce the volume by a factor of f^d
Random Projection and Johnson Lindenstrauss Lemma
The nearest Neighbour Problem is an example of a problem that benefits from dimension reduction with projection f: R^d -> R^k
k << d
- basis of random projection: take k random d-dimensional vectors, u1, ..., uk, then f(v) = (u1 • v, ..., uk • v)
- show that |f(v)| ~= k^(1/2) * |v| with high probability
- f(v1 - v2) = f(v1) - f(v2), allowing us to approximate distance between d-dimension points in k-dimensions
- note that ui's are not orthogonal, and not necessarily unit length
- then Random Projection Theorem bounds the probability that the random projection deviates from the expected value by a factor of ε
- it follows that the probability that the projection length differs substantially is exponentially small with respect to k
- Johnson Lindenstrauss Lemma bounds the distance between f(vi) and f(vj) is between 0 and k^(1/2) with high probability
Eigendecomposition of a Matrix
Recall an Eigenvector of a linear mapping is a non-zero vector that does not change direction after the mapping is applied
- v is an eigenvector of T if T(v) = λv (scalar multiple)
- λ is the eigenvalue
3: Best Fit Subspaces & SVD
Singular Value Decomposition of a matrix is finding the best-fitting k-dim subspace (k is a natural number)
- minimizing the sum of squared perpendicular distances of points to the subspace
- equivalent to maximizing the sum of squared lengths of projections onto the subspace
The best fitting subspace algorithm:
- start with special case, 1-dim line of best fit; line through the origin
- perform
k
applications of the best fitting line algorithm, where in the i
th iteration, we find the best fitting line perpendicular to each of the i-1
lines