CS348: Databases

Back to UWaterloo

Introduction
Relational Model
SQL
Data Modeling and Entity-Relationship Model
Normalization Theory
Query Execution
Transaction Execution

Introduction

DBMS supports an underlying data model, and authorize users
- offer durability and concurrency control
a schema instance is a collection of data

Relational Model

set notation can describe subsets of data, this is a query: \(\{ (x_1, x_2, \dots, x_k) | \phi \}\), find some tuple given a logical equation \(\phi\)
answer to the query above over DB is the relation \(\{ (\theta(x_1), \dots, \theta(x_k)) \ | \ DB, \theta \models\phi \}\)
- so the relation is a set of results that satisfy our logical conditions \(\phi\) for tuples being mapped over our valuation \(\theta\)
example forgets about finite constraint, and models a three column table to compute addition
\(\{(x,y) | PLUS(x,y,5)\}\) is a query for all pairs that sum to 5
apply predicate logic for more expressive queries

relational model organizes information in a finite number of relations

universe is a set of values D with equality
relation, canonically a table
- has a predicate name R, and arity k of R (number of arguments, in this case number of columns)
- relation instance is \(R \subseteq D^k\)
database has a signature defining a finite set \(\rho = (R_1, \dots, R_n)\) of predicate names
- instance is a relation Ri for each \(R_i\), bold means relation for each predicate name
- notation: \(DB = (D, =, R_1, \dots, R_n)\), for example \(DB = (Z, =, PLUS, TIMES)\)
- AUTHOR might be some \(R_i\), the label, its relation instance Ri might have arity of 2, id and name, part of the universe of tuples \(Z \times Names\)
on truthiness, when a relationship between objects (tuples) is present, it is true, and false if relationship is absent
a valuation is a function \(\theta\) that maps variable names to values in the universe D: \(\theta : \{x_1, x_2, \dots\} \rightarrow D\)
- \(\theta[x \mapsto v]\) defines x to map to v in our function
the truth of a formula \(\phi\) (truth being existence of a relation) depends of a database instance and valuation

integrity constraints are laws, that must be true in every valid database instance

addition is commutative
- is a binary function
- is a total function
- is monotone

example:

every boss manages a unique department: \(\forall x_1,y_1, x_2, y_2 ,z.EMP(x_1,y_1,z) \land .EMP(x_2,y_2,z) \rightarrow y_1 = y_2\)

SQL

builds on the relational calculus

SQL is a standard, has many implementations
has three parts: Data Manipulation, Data Definition, Data Control

First Order SQL

Summary:

captures all of relational calculus
polynomial runtime (PTIME)
shortcomings: no aggregation (counting) or recursion (path in graph)

Details:

simple \(\exists, \land\) queries use SELECT, FROM, WHERE to declare
R AS p defines a tuple variable (correlation) p, with attribute names \(a_1, a_2, \dots a_n\), similar to relational calculus where we would say \(R(p.a_1, p.a_2, \dots, p.a_n)\)
- AS keyword is optional: SELECT r.publication FROM wrote r
- this is even still equivalent to SELECT publication FROM wrote
for \(\lnot, \lor\) we need set operators UNION, EXCEPT, INTERSECT
- we can use this to connect queries together, but both queries must return the same signature (be union-compatible)
- OR in the WHERE clause looks like UNION, but often doesn't cover the empty set case
- of course with \(\lnot, \exists\) we can form \(\forall\) queries now
for nested queries (something like \((A \lor B) \land C\)) can use the keyword WITH to name the results of a child query for use in another query
- WITH can be omitted by inlining queries SELECT book.title FROM (SELECT * FROM books) book
WHERE subqueries are syntactic sugar that allow you to inline queries that look like predicates, with keywords IN, SOME, ALL, EXISTS, NOT
- SELECT title FROM publication WHERE pubid in (SELECT pubid FROM article)
parametric subqueries allow queries to depend on (attributes from) the main query
- semantically, subqueries let us say "All x in R s.t. part of x doesn't appear in S"

Example: List all authors who always publish with someone else

SELECT DISTINCT a1.name 
FROM author a1, a2 
WHERE NOT EXISTS (
  SELECT * FROM publications p, wrote w 
  WHERE p.pubid == w.publication
    AND a1.aid = w.author
    AND a2.aid NOT IN (
      SELECT author
      FROM wrote
      WHERE publication = p.pubid
        AND author <> a1.aid
    )
)

Modifying Database

More SQL syntax for actually modifying state based on first order logic

Aggregation

Extension of first order SQL

finding a number of tuples (counting), min/max, mapping values over a column

Data Modeling and Entity-Relationship Model

Normalization Theory

anomalies in schema design lead to violations of transactional consistency
functional dependencies let us reason about ways to perform schema decomposition, thereby avoiding anomalies
computeX+(X, F) is an algorithm for determining which columns can be determined from the key + functional dependency set pair (by continuously adding any attributes to the set of X+ that are determined by elements in X+ currently)
a decomposition is dependency preserving if we get some equivalent F' and none of the functional dependencies are interrelational (requires a join)
Normalization is the process of decomposing a schema (set of relation schemas), so that it is in some standard form
BCNF is the most desirable form, followed by 3NF
multivalued dependencies capture anomalies/redundancies not captured by functional dependencies (when X ->> Y, means a set of values for Y is determined instead of a single unique value)
a dependency basis is a way to partition a relation schema so that we can further decompose schemas with multivalued dependencies, since splitting right-hand sides of dependencies (as in minimal cover) doesn't work for multivalued (because it uniquely defines a large set, the cartesian product of all right-side attributes' values)
- breaking down the example for Course-Teacher-Hour-Room-Student-Grade relation, "dependency basis for X = C is R-X = [Y1 = T, Y2 = HR, Y3 = SG] s.t. F |= C ->> Z iff Z - C is a union of some of the partitions of R-X"
5NF is even stronger, using join dependencies, but has no axioms

Query Execution

We use relational algebra for implementing query execution

all relational calculus queries are representable in relational algebra
naive implementations of each operator are intuitive, but slow
one optimization is to use indexes
calculating the cost of different physical plans can be done using the simple cost model for I/O, using several parameters to estimate relative cost of operations, based on assumptions of uniformity and independence