CS462 Formal Languages and Parsing

Back to UWaterloo

Combinatorics on Word: Definitions
Finite Automata and Regular Languages
3.8 Automata, graphs and boolean matrices
3.9 Myhill-Nerode theorem
3.10 Minimization of finite automata
3.11 State complexity
3.12 Partial Order
4 Context-free grammars
Turing Machines

Combinatorics on Word: Definitions

A primative x and power w form the word w = x^k for integer k
P₂ means the set of all primatives over the alphabet of length 2: {0,1}
x Ш y means shuffle, as in shuffling a deck (interleave)
lexicographic order is defined for same length strings
radix order sorts first by length then lexicographical
z is purely periodic if z = x^omega for some finite x
z is ultimately periodic if the suffix is purely periodic
x and y are conjugates if x is a cyclic shift of y
x is a border of y if x is not empty, x != y, x is prefix and suffix of y "entaglement"
Lyndon-Schutzenberger 1st theorem, the "alfalfa" theorem
can not avoid squares over a finite alphabet
a morphism is a map from one alphabet to another, preserving concatenation
if every subword of an infinite word appears at least twice, then every subword appears infinitely many times
if every subword appears infinitely many times, the infinite word is recurrent
thue-morse sequence is recurrent because any subword is from position i to j, and for the binary representation of j, you can always find the same subword ending at 11j
- another way to think about this is the complementary construction. complement of complement gets you the original string back at a later position

Proof tricks:

pigeon hole principle is useful for showing eventual repetition in infinite strings

Finite Automata and Regular Languages

moore vs mealy machines are representations of a DFA for computation, resulting in some output
- moore outputs different values at each state. there exists a mealy machine with the same # of states
- mealy outputs values at each transition. there exists a moore machine with |Q| |Δ| states
Quotient operator on languages: L1 / L2 is defined as {x in Sigma * where y in L2 and xy in L1 }
- we can define prefix(L) = L / Sigma*
Concatenation of languages: L1 L2 is {xy where x in L1, y in L2 }
Theorem: for language L and regular language R, then R / L is regular
- idea: start with a DFA for R, modify its final (accepting) states
example: remove trailing zeroes while preserving regularity. (L / '0'*) n {epsilon u sigma* u sigma - {0}
Δ^* is the set of all languages
substitution maps words to languages. It preserves concatenation.
- "substitution by regular languages" means regularity is preserved during substitution.
Theorem: substitution preserves regularity. Prove by induction on the regular expression operators.
inverse morphism on a language L is, given a morphism, the set of all words that when the morphism is applied to them, they result in a word in L
transducer is a generalization of the mealy machine which transforms input to a new output, which has finite states, nondeterministic transitions, and each transition takes some input and produce output of possibly a different alphabet

Proof tricks:

for induction, always construct the problem recursively. Then make sure during the induction step, we apply it to pieces recursively.
start with reprentations of an object (regular language represented by DFA, NFA, etc) and modify it to reach an answer

3.8 Automata, graphs and boolean matrices

A transition in a NFA can be represented using an incidence matrix: a boolean matrix where for a given input character a in Sigma, Ma[i][j] = 1 if there is a transition from state i to state j on character a, or delta(qi, a) = qj. (we use AND instead of multiply, OR instead of addition).

If you multiply matrices, we end up "following paths" along the NFA. Ma x Mb gives us the incidence matrix for word w = ab.

We learned from algos cs240 that there are more effective implementations of matrix multiplication than the naive O(n^3). This gives us a more efficient computation for NFA simulation.

There are a finite number of possible matrices, even after infinite operations on them since each state can only be 0 or 1.

Proof trick: construct a second DFA from the original, using tuples of matrices as the states (this works because there are finitely many possible incidence matrices). Define the final states using a sort of "query" on all states that satisfy some property, enabling traversal from the initial state (row 0) to final states in the original DFA.

3.9 Myhill-Nerode theorem

The following are equivalent:

L is regular
L can be written as the union some of the equivalence classes (indistinguishable) of E, where E is a right-invariant (xRy => xzRyz x,y in S, forall z in S) equivalence relation with finite index
- an equivalence relation R partitions a set into disjoint subsets of "equal" things (equivalence class)
  - R1 is a refinement of R2 if R1 subset of R2
- index is the number of classes
R_L is of finite index.
- Myhill-Nerode equivalence relation is a right-invariant relation R_L, s.t. if xz inlanguage L iff yz in L
- if L is the union of any right-invarant equivalence relation E's equivalence classes, then E is a refinement of R_L

3.10 Minimization of finite automata

Myhill-Nerode theorem allows minimal DFAs to be constructed from the Myhill-Nerode equivalence relation

Naively, enumerate all pairs of states. "Mark" all pairs that are immediately distinguishable (one's a final state, the other isn't). Run for n iterations, marking new pairs along the way. Any unmarked states after n iterations yields the equivalence classes.

3.11 State complexity

sc(L) is the minimum number of states in a DFA that accepts regular language L
sc(L1 n L2) <= sc(L1) x sc(L2). This can be seen by forming a DFA whose states are pairs, one state from L1 and one from L2. The upper bound case can be formulated using L1 = number of 0s in the string = 0 mod n, and L2 is number of 1s in the string = 0 mod m
NFAs are harder since the minimal NFA is not unique
we can find lower bounds if we construct the problem in a way that we can apply known theorems to it
- if the shortest string accepted by L is of length n, then nsc(L) >= n+1. Prove using pigeon hole principle to show some state is repeated, and pumping lemma idea to remove a cycle to find a shorter string.
- if you can find n words under the following constraints, then you can show that nsc(L) >= n: split each of the n words into a pair--some non-overlapping prefix-suffix pair--then show that for all choices of two of pairs, if you swap the prefixes, then at least one of the words is not in L.

3.12 Partial Order

subword (contiguous, x S y) is one natural partial ordering on strings, and subsequence (a | b) is another
if R is a partial ordering, then if a R b or b R a, then a and b are comparable (otherwise incomparable)
antichain is a set of pairwise incomparable elements
- subword can have infinite antichains (for ex. a b^n a)
- subsequence has no infinite antichains
  - proof using contradiction on minimality, using division-free subsequences
a minimal element has nothing less than or equal to it except itself
the set of minimal elements is finite (though hard to find in some cases, like powers of 2 in base 10)

4 Context-free grammars

recall pushdown-automata. ∂(q,a,A) is of the form (p, 𝛾) where A is the current top of stack, and 𝛾 is a string that replaces A. Input accepted if at final state or stack is empty, where stack starts with some unique Z0.
CFL closed under union, concat, kleene star, substitution (of symbol for a CFL) and therefore morphisms, inverse morphisms
recall the pumping lemma for context free languages. If a language is context free, then we can always find some pumping length such that all strings in the language greater than the pumping length must posess structural properties.
to show that a language is NOT context free, we assume some n exists. Then construct a string in L greater than any choice of pumping length n (so it must depend on n). Then we show that it does not hold the correct structural properties of a CFL.

4.3 Ogden's Lemma

Ogden's Lemma is a generalization of the pumping lemma for context free grammars. Similarly, we have a pumping length, based on some n = d^{k+1} where d is the max length of an production rule's right side (the max number of possible children for each vertex in the parse tree) and k+1 for k variables to satisfy the height requirement to arrive at the pigeon hole principle argument. The difference here is that we can mark symbols in a string.
proof-trick: factorial trick n! + n gives us a way to trap any a^n (decomposed as a^{ij} a^k) to be equal to a^{n! + n} for some i, for any choice of j,k.

4.2 & 4.6 Parikh's Theorem

all unary CFLs are regular. We can rewrite any unary CFL as the union of a finite number of regular languages, where the finite number is based on the number of rules and the size of them.
Parikh's theorem generalizes this to certain CFLs over any alphabet.
linear and semi-linear are general properties of sets.
linear sets of numbers can built using any r coefficients (finite?), and saying that any number in that linear set can be constructed using the some of these r coefficients multiplied by r other integers.
- this is generalized in this course as a linear set of k-tuples, described by r coefficients (which need be k-tuples)
semi-linear sets are described as the union over a finite number of linear sets.
- I thought about whether semi-linear sets can be re-described using the constraints of linear sets. You can't. As as geometric example, if you have r = 1 and k = 2, then this plots out some line in 2D space. If you take the union of two lines in 2D space, you get an X shape (or parallel lines). But if you tried to represent this with r = 2, you would end up describing a grid of points.
finally, Parikh's map ψ(w) applies these ideas to formal languages. It maps a string to a k-tuple where k = |∑|, counting the occurences of each letter in some string. It maps a language to the set of k-tuples that represent strings in the language.
if L is a CFL, then ψ(L) is a semilinear set of k-tuples.
if X is semilinear, than there exists a regular language L so that ψ(L) = X
- any finite set of strings can be mapped to a semilinear set by, for each word in the finite set, setting each to the "y-intercept" and have all 0 coefficients then taking the union of all.

so all in all, if L is a CFL, then there exists some regular language R s.t. ψ(L) = ψ(R)

so I guess we think about Parikh's theorem as a way to count things. It's kinda just saying if you forget about the order, we're just looking at patterns in the counting of each letter.

5 Parsing and recognition

CYK is a dynamic programming approach to parsing in O(n^3). Practically, large programs require better runtimes, so we use more restricted grammars to achieve these runtimes.

first convert grammar to CNF, then we construct an n x n DP table for some string w of length n to which we want to find a parse tree for.
- each entry at CYK[i][j] is the list of nonterminals that derive the substring w[i..j]
- each table entry can be computed in O(n) by finding all rules which derrive any of CYK[i][k] x CYK[k][j]

Earley's Method has worst case O(n^3), but runs in O(n^2) for unambiguous and O(n) for LR(1)

again, first convert to Chomsky Normal Form A -> α . β, where we add a dot to mark "everything before this point has been derived, AKA has a parse tree"
bottom-up parsing algorithm, using a table to store intermediate results
each table entry M[i][j] is a set of production rules A -> α . β where α is a nonterminal that expands to a derivation of w_i,j (substring from i to j) and also the start symbol S derives w_0,i. A

Knuth NFA

the subword that is derived by the some nonterminal is a handle. At each step of the algorithm, a handle is replaced by a nonterminal based on some item (a production rule augmented with a dot . symbol).
for some grammars, there are multiple handles for a partially parsed word, but LR(0) grammars are defined so that there is only ever one handle. This is how LR(0) grammars are parsable in linear time. From cs444, we think about this as representable as a DFA, so it has linear runtime.
"What makes LR(0) parsing work is Knuth’s amazing
observation that the set of items valid for viable prefix γ can
be computed by a finite automaton."
the Knuth NFA's language is the set of all viable prefixes, and states are the different items (production rules augmented with the dot . )
- the Knuth NFA is also used to define the set of valid items for every viable prefix (and a valid item is used to replace some handle with a nonterminal).
- we say an item A -> α . β is valid for a viable prefix if there is a (rightmost) derivation (from the start symbol) S derives δαβw and δα is a viable prefix from the Knuth NFA (where δ,α,β, are terminals or non terminals and w is a subword).

Turing Machines

01010101010... is a "simple" string because we can describe a pattern for it
random digits could be seen as "complex", or without pattern
complexity and randomness are inherently linked
Kolmogorov Complexity of a string C(x) is the shortest program P on input i that prints x and halts
- if C(x) is large, x is hard to describe
- could think of Kolmogorov as encoding or compression (from storage perspective) to recover x
- Kolmogorov complexity is upper bounded by O(x) since you can just input x and print it
by pigeon hole principle, there are incompressible or random strings (since for length n, there are 2^n strings, but only 2^n - 1 shorter representations, like counting nodes in a binary tree 2^(n-1) + 2^(n-2) + ... 4 + 2+ 1)
Kolmogorov complexity is uncomputable