Data Structures and Data Management

Back to UWaterloo

Summary

Introduction

Problem

Given a problem instance I (input) find a solution (output with correct answer) in finite time with Size(I)

Example:

Running Time Simplifications

Matrix multiplcation performs in n2(n+n-1) time or O(n3) in naive implementation

Order notation:

"Best Case & Worst Case" refer to instances of inputs into the algorithm, but this has nothing to do with upper/lower bounds. Lets look at a contrived example:

Proof: first derive an appropriate c and n0, then show that desired relationship holds for all n ≥ n0

Prove that 2010n2 + 1388n ∈ o(n3) from first principles.

Now let's compare two implementations of exponentiation:

Bruteforce(b,x) we iteratively call ret *= b x times
Squaring(b,x) we recursively call ExpSquaring(b*b, x/2) or b*ExpSquaring(b*b, (x-1)/2)

we see that Tb(x) = x, and Te(x) = {3log2(x), 4log2(x)

Loop Analysis

Start from inner-most loop, and use Θ-bounds throughout to obtain a θ-bound for the complexity of the algorithm

Priority Queue

Priority Queue is an ADT with a collection of items (associated with a priority)

Heaps are a type of binary tree. A max-heap has additional properties:

Sorting and Randomization

Randomized algorithms

Average vs expected running time

Sorting

Comparison sorts are lower bounded at Ω(n log n)

Radix Sort

Quick Sort vs Merge Sort, quick sort has a slower worst-case, but under expected case, they are comparable. But quick sort retains order and is an in place sorting algorithm

Tail call elimination can be done using loops to reduce the number of stack frames, and optimized further by only recurising on the smaller partition

Tutorial: if A is a partial sorted array where n^k elements are already sorted, 0 < k < 1, then we can sort the remaining elements in Θ(n) time. First use merge-sort in Θ(n^k log (n^k)) = Θ(kn^k log n) = Θ(n^k logn) in o(n) since log n is smaller than n^c for any c > 0 and since k < 1.

Consider Stooge Sort: partition sort by thirds. The sort A1 U A2, sort A2 U A3, then finally resort A1 U A2. By the algorithm, we know the recurrence is

T(n) =

Consider Bogo sort. We want the expected case running time. To do so, take a probabilistic approach: T(n) = 1cn + (1/n!)d + (1-1/n!)T(n). Note that this is a probabilistic analysis, so each coefficient is just a probability. 1cn is 100% chance to shuffle n object and check n objects if they're sorted.

Dictionary ADT

Tries

Tries, or Radix Tree, are a dictionary for strings. The basic trie is for binary strings and is thus a binary string.

Compressed Tries (Patricia Tries) store extra information at each node. They store which bit position the next strings differ at, allowing us to encode strings with idential portions very efficiently.

Skip List

Inserting into a skip-list, we can use RNG to determine the height, based on a binary string seed.

We can make modifications to a linked list to optimize search times

Midterm

Multi-dimensional Data

d-dimensional data, with aspects/coordinates

Pattern Matching

DFAs give us a nice representation of how we can reuse information about the state of what we've matched thus far. Refer to cs241 notes for details.

KMP matches left to right, and shifts the pattern upon a bad match based on the largest prefix of our pattern that is a suffix of what we've matched so far.

Boyer-Moore is reverse order searching (right to left)

We have to build two extra data structures to perform BM: Last occurence function and good suffix array

Rabin-Karp Fingerprint uses hashing to compare our pattern to a substring (think hashing to check if your downloaded file is correct)

What if we want to search many patterns on the same text? Then preprocess text T

screen shot 2017-07-13 at 11 02 17 am

Compression

Transmission and storage of data

Text Transformation

Preprocessing our text so that we can make better use of compression algorithms

Burrows-Wheeler Transform is a compression technique, that reorders the characters in the text, to be more easily compressible with Move To Front

Encoding for BWT:

  1. place all cyclic shifts (or rotations) of S in a list L (n^2 size list)
  2. Sort the strings in L lexicographically (alphabetically, $ is last)
  3. Make a list C, of last characters of each string in L

Decoding for BWT, given string C from encoding:

  1. make an array A of tuples (C[i],i)
  2. Sort A by the characters ($ is first, same characters are sorted by their i increasing) into array N
  3. set j to index of $ in C
  4. recursively set j to N[j] (kind of a linked list structure) and build output string S appending C[j] to S

The reason this works well is because sorting all possible rotations will group any repeated substrings together in the encoding phase, for list L. It happens that the last character of each rotation will correspond to the first character of exactly another rotation, and since we've sorted it, all occurences of some substring in the text gets grouped together. This happens not only for the first letter of the substring, but also each subportion of repeated characters will have its characters grouped together.

It's pretty remarkable that the output is simple and deterministically reversable through decoding. We can think of the inverse as, at any iteration of our decoding, we can treat the last column as the letter that preceeds some prefix of a rotation. Since we have all possible rotations, at every step of the recursion, we can always place each letter of C at the start of the corresponding string and resort. This gives us a new list of reordered, possible prefixes that must all exist because of the property of rotations. Our decoding algorithm is just a restricted set of instructions to perform this.

Memory

2-3 Trees are better than AVL trees since loading pages is expensive, and AVL trees do the Θ(log n) page loads

B-Trees is a generalization of 2-3 trees. An (a,b)-tree of order M:

Hashing in external memory may result in lots of disk transfers since data is often scattered. This gets worse when we use things like cuckoo hashing