Skip to content

Latest commit

 

History

History
112 lines (96 loc) · 6.94 KB

CS348-Databases.md

File metadata and controls

112 lines (96 loc) · 6.94 KB
  • DBMS supports an underlying data model, and authorize users
    • offer durability and concurrency control
  • a schema instance is a collection of data
  • set notation can describe subsets of data, this is a query: ${ (x_1, x_2, \dots, x_k) | \phi }$, find some tuple given a logical equation $\phi$
  • answer to the query above over DB is the relation ${ (\theta(x_1), \dots, \theta(x_k)) \ | \ DB, \theta \models\phi }$
    • so the relation is a set of results that satisfy our logical conditions $\phi$ for tuples being mapped over our valuation $\theta$
  • example forgets about finite constraint, and models a three column table to compute addition
  • ${(x,y) | PLUS(x,y,5)}$ is a query for all pairs that sum to 5
  • apply predicate logic for more expressive queries

relational model organizes information in a finite number of relations

  • universe is a set of values D with equality
  • relation, canonically a table
    • has a predicate name R, and arity k of R (number of arguments, in this case number of columns)
    • relation instance is $R \subseteq D^k$
  • database has a signature defining a finite set $\rho = (R_1, \dots, R_n)$ of predicate names
    • instance is a relation Ri for each $R_i$, bold means relation for each predicate name
    • notation: $DB = (D, =, R_1, \dots, R_n)$, for example $DB = (Z, =, PLUS, TIMES)$
    • AUTHOR might be some $R_i$, the label, its relation instance Ri might have arity of 2, id and name, part of the universe of tuples $Z \times Names$
  • on truthiness, when a relationship between objects (tuples) is present, it is true, and false if relationship is absent
  • a valuation is a function $\theta$ that maps variable names to values in the universe D: $\theta : {x_1, x_2, \dots} \rightarrow D$
    • $\theta[x \mapsto v]$ defines x to map to v in our function
  • the truth of a formula $\phi$ (truth being existence of a relation) depends of a database instance and valuation

integrity constraints are laws, that must be true in every valid database instance

  • addition is commutative
    • is a binary function
    • is a total function
    • is monotone

example:

  • every boss manages a unique department: $\forall x_1,y_1, x_2, y_2 ,z.EMP(x_1,y_1,z) \land .EMP(x_2,y_2,z) \rightarrow y_1 = y_2$

builds on the relational calculus

  • SQL is a standard, has many implementations
  • has three parts: Data Manipulation, Data Definition, Data Control

First Order SQL

Summary:

  • captures all of relational calculus
  • polynomial runtime (PTIME)
  • shortcomings: no aggregation (counting) or recursion (path in graph)

Details:

  • simple $\exists, \land$ queries use SELECT, FROM, WHERE to declare
  • R AS p defines a tuple variable (correlation) p, with attribute names $a_1, a_2, \dots a_n$, similar to relational calculus where we would say $R(p.a_1, p.a_2, \dots, p.a_n)$
    • AS keyword is optional: SELECT r.publication FROM wrote r
    • this is even still equivalent to SELECT publication FROM wrote
  • for $\lnot, \lor$ we need set operators UNION, EXCEPT, INTERSECT
    • we can use this to connect queries together, but both queries must return the same signature (be union-compatible)
    • OR in the WHERE clause looks like UNION, but often doesn't cover the empty set case
    • of course with $\lnot, \exists$ we can form $\forall$ queries now
  • for nested queries (something like $(A \lor B) \land C$) can use the keyword WITH to name the results of a child query for use in another query
    • WITH can be omitted by inlining queries SELECT book.title FROM (SELECT * FROM books) book
  • WHERE subqueries are syntactic sugar that allow you to inline queries that look like predicates, with keywords IN, SOME, ALL, EXISTS, NOT
    • SELECT title FROM publication WHERE pubid in (SELECT pubid FROM article)
  • parametric subqueries allow queries to depend on (attributes from) the main query
    • semantically, subqueries let us say "All x in R s.t. part of x doesn't appear in S"

Example: List all authors who always publish with someone else

SELECT DISTINCT a1.name 
FROM author a1, a2 
WHERE NOT EXISTS (
  SELECT * FROM publications p, wrote w 
  WHERE p.pubid == w.publication
    AND a1.aid = w.author
    AND a2.aid NOT IN (
      SELECT author
      FROM wrote
      WHERE publication = p.pubid
        AND author <> a1.aid
    )
)

Modifying Database

More SQL syntax for actually modifying state based on first order logic

Aggregation

Extension of first order SQL

  • finding a number of tuples (counting), min/max, mapping values over a column

Data Modeling and Entity-Relationship Model

  • anomalies in schema design lead to violations of transactional consistency
  • functional dependencies let us reason about ways to perform schema decomposition, thereby avoiding anomalies
  • computeX+(X, F) is an algorithm for determining which columns can be determined from the key + functional dependency set pair (by continuously adding any attributes to the set of X+ that are determined by elements in X+ currently)
  • a decomposition is dependency preserving if we get some equivalent F' and none of the functional dependencies are interrelational (requires a join)
  • Normalization is the process of decomposing a schema (set of relation schemas), so that it is in some standard form
  • BCNF is the most desirable form, followed by 3NF
  • multivalued dependencies capture anomalies/redundancies not captured by functional dependencies (when X ->> Y, means a set of values for Y is determined instead of a single unique value)
  • a dependency basis is a way to partition a relation schema so that we can further decompose schemas with multivalued dependencies, since splitting right-hand sides of dependencies (as in minimal cover) doesn't work for multivalued (because it uniquely defines a large set, the cartesian product of all right-side attributes' values)
    • breaking down the example for Course-Teacher-Hour-Room-Student-Grade relation, "dependency basis for X = C is R-X = [Y1 = T, Y2 = HR, Y3 = SG] s.t. F |= C ->> Z iff Z - C is a union of some of the partitions of R-X"
  • 5NF is even stronger, using join dependencies, but has no axioms

Query Execution

We use relational algebra for implementing query execution

  • all relational calculus queries are representable in relational algebra
  • naive implementations of each operator are intuitive, but slow
  • one optimization is to use indexes
  • calculating the cost of different physical plans can be done using the simple cost model for I/O, using several parameters to estimate relative cost of operations, based on assumptions of uniformity and independence

Transaction Execution