Basic concepts and algebraic operations

The relational model

A data model is a collection of conceptual tools used to describe data, relationships between data, data semantics, and consistency constraints. Today, the primary data model used in commercial data processing is the relational model (that is, a database that is queried using SQL statements).

A relational database consists of tables. For example, a teacher table named instructor has id, name, dept_name, salary and other information. For example, {101,Tang,CS,20000} indicates that “The salary of Mr. Tang in the computer department is 20K with the number 101”. We can say that there is a relationship between a set of values in this row. A table is a collection of associations.

From a “mathematical” perspective, a table can be called a Relation, and a row of records can be called a Tuple. A relation is an unordered Set of tuples. Each column in the relationship and call it an Attribute (Attribute), every Attribute is a contains a collection of all possible values, such as gender for {‘ male ‘and’ getting ‘}, the average adult height (cm) for {x | 140.0 x 200.0} or less or less. This collection is called the Domain of this property. A domain is atomic if every tuple in it is nondivisible.

The definition of “non-separable” is a subjective concept. For example, a field containing mobile phone numbers {0451-44448888,010-55556666} is atomic if we consider the area code and the local code as a whole. If you choose to separate the two numbers,{{0451,44448888},{010,55556666}}, then the field is not atomic again.

model

When we talk about databases, a key word often comes up — Schema. The schema of the database is divided into three levels:

Logical patterns, often referred to simply as patterns. Refers to the logical structure of all data in the database, and the description of characteristics. There are two important features:

A database has only one schema, which can be understood as: the complex logical relationship of all the tables in a database is a schema.
When defining a schema, we should not only define the logical structure of data (such as what data items constitute a record, the name, type and value range of data items, etc.), but also define the security and integrity requirements related to data.

External Schema, also known as user Schema, subschema. It is a description of logical structure and characteristics of some data that can be seen and used by database users according to application requirements. To put it more generally, it’s the user view.

An Internal Schema, also known as a Storage Schema, describes the physical structure and Storage mode of data and represents data in a database (for example, whether records are stored sequentially, according to the B-tree structure or according to the hash method). How indexes are organized; Whether data is compressed and stored and encrypted; What is the structure of data storage record?

code

Each tuple must be guaranteed to be unique, so we must select one or more attributes from the tuple that can uniquely identify the tuple and combine them into superkeys. Any superset of a hypercode is a hypercode, but we usually only select the smallest supercode (which contains the least number of attributes) as a candidate key. In some cases, a table can select multiple candidate codes, and any one of them can be selected as the primary key, or primary key.

In a table, the principal codes of all tuples cannot be repeated. This rule is called the principal code constraint, such as id in the instructor or DEPt_name in the department table. In contrast, there is a foreign code constraint: for example, the dept_name attribute in the faculty instructor must be taken from the dept_name field of another department, and dept_name happens to be the main code of the department. Thus dept_name is the foreign code of the instructor reference department. Instructor and dept_name are referential relations, while department and dept_name are referential relations.

SQL

Structured Query Language (SQL), originally known as Sequel. Today, SQL has clearly established itself as the standard database language. Each product supports the SQL language (though there may be minor differences).

In terms of categories, SQL can be broadly divided into two types:

Data-definition Language (DDL) : used to organize and define database schemas. For example, a table is created and constraints can be set on it if necessary.

Create table department(dept_name varchar(20),building char(15),budget numeric(12,2), primary key (dept_name));Copy the code

After defining the attributes and data types of a table, we add additional constraints, such as primary key primary or foreign key foreign. References mentioned in the section “code” can be used with the Foreign key(dept_name) References Department.

You can also add a constraint to the declaration of an attribute: a non-null constraint. For example, declaring building char(15) not null indicates that the building attribute cannot be null.

We use DDL to either create or delete a table in a database. The DDL creates some additional output, which is stored in the Data Dictionary. A data dictionary describes information about the database itself, which is also called meta-information.

Data-manipulation Language (DML) : a Language used to access or manipulate Data in a table. CRUD operations are in this range.

SELECT * FROM `instructor`
Copy the code

Database engine

From the perspective of “how data should be stored and how it should be operated”, database modules can be roughly divided into two layers: storage manager and query processor.

Storage managers are divided into:

Permission and integrity manager, used to detect integrity constraints, user permissions.
Transaction manager, to ensure that the system in the event of failure, the database can also maintain a consistent state; And is responsible for ensuring that concurrent transactions do not conflict with each other.
File manager: Manages disk space allocation.
Buffer manager: Is responsible for scheduling data from disk into memory and deciding what to do with that data.

The storage manager also stores the following data structures:

Data files, which store the data of the database itself.
Data dictionary, which stores information describing the database itself.
Index, which provides fast access to data.

According to DDL and DML, query processors can be divided into:

DDL interpreter: Interprets DDL statements and outputs the corresponding meta-information to the data dictionary.
DML compiler: Translates high-level DML into low-level instructions for the query execution engine to execute.
The query execution engine is really responsible for executing the low-level query instructions derived from the DML transformation.

Important relational algebra with SQL DML

All kinds of queries to the database can be abstracted into algebraic calculations. The SQL statements listed by the author are tested in the MySQL database. Note that the SQL standard and MySQL implementations are not always the same.

Project

Unary operation, used to filter out specific attributes. In the instructor table, we just want to screen id,name and salary, which can be denoted as: π ID,name,salary (instructor). A projection operation corresponds to a Select in an SQL statement.

In MySQL, some fields are used as database keywords. To avoid confusion, we'll put backquotes around these fields.
MySQL does not limit the case of keywords and generally keeps them all uppercase.
SELECT ID.`name`,salary FROM `instructor` 
Copy the code

Additionally, if you want to filter out duplicate results, you can add a distinct keyword in the SELECT clause.

Select

Unary operation, used to screen out tuples that meet the conditions. For example, we want to screen the name of the person whose salary is greater than 50K in the instructor table, and the symbol is:

Π name (sigma salary > 50000 (instructor))

However, choices in relational algebra correspond to WHERE in SQL statements:

SELECT `name` FROM `instructor`
WHERE salary > 50000
Copy the code

Join

There are quite a few subcategories of join operations: cross join (Cartesian product), inner join, outer join.

Cartesian product: cross join. Let’s call r1 times R2 a Cartesian product. Assuming r1 is a relationship with three attributes, two tuples, and R2 is a relationship with four attributes, five tuples, the result of the Cartesian product will be a relationship with seven attributes, ten tuples.

Cartesian products correspond to cross Joins in SQL statements. But in MySQL implementations, Cartesian products can also be expressed as joins, inner joins, and pure commas.

-- Take the Cartesian product of 'instrctor' and department
SELECT * FROM `instructor` CROSS JOIN department

This is a shorter and more familiar way of writing.
SELECT * FROM `instructor`,department

This would be equivalent if cartesian products were unconstrained.
Inner join, cross join, and join have the same semantics if there is no on condition.
SELECT * FROM `instructor` JOIN department

From a pure SQL definition, an INNER JOIN is a cross JOIN and a natural JOIN.
SELECT * FROM `instructor` INNER JOIN department
Copy the code

A cross join has no constraint on the two tables, and any two tables can be cross-joined. In general, however, arbitrary cross-joins don’t make sense — most of the time, two joined tables should have common attributes and concatenate two tuples that have the same common attribute values. Such as:

SELECT * FROM `instructor` CROSS JOIN department
WHERE `instructor`.dept_name = department.dept_name
Copy the code

This semantics is more like a natural join, which is part of an equivalent join (which, conversely, is not always a natural join). In particular, in the SQL standard, inner join and cross Join have different semantics. Inner Join needs to be used with ON to achieve a more general equivalent join (that is, the attributes of the join are not necessarily the common attributes of the two tables and are specified by the user). In the MySQL standard, cross join and inner join are “blurred”. Inner join and cross join in MySQL – zhihu (zhihu.com)

The following SQL statement is equivalent to the above conditional cross join:

The common attribute of both tables is depT_name. The natural join automatically joins by this attribute and removes duplicate columns.
SELECT * FROM `instructor` NATURAL JOIN department
Copy the code

In addition, a cross join preserves all attributes of the two tables, so the joined relationship has duplicate common attributes, while a natural join automatically removes those superfluous attributes. Occasionally, if you want to specify which attribute to naturally join by, you can use the keyword join… The using. For example: ` `

-- Using does not allow Spaces and is followed by parentheses to specify one or more attributes of the connection.
SELECT * FROM `instructor` JOIN department USING(dept_name)
Copy the code

The join conditional keyword is on rather than WHERE, although we mostly choose the latter for equivalence implementation. In terms of results alone, there is no problem either way:

SELECT * FROM `instructor` CROSS JOIN department on `instructor`.dept_name = department.dept_name
Copy the code

However, from the perspective of the query process, the execution order of the two SQL statements is not exactly the same. The ON condition precedes the execution of the WHERE condition. The ON condition is executed when the temporary table is generated, and where is filtered after the temporary table is generated. Therefore, in the outer connection mentioned later, the performance of the two will show subtle differences.

Join R1 join R2 on(…) In, tuples that cannot be joined correctly are discarded. In the outer join, however, we can optionally keep tuples that should have been discarded in R1 or R2 (or both), and the left half (or left half) left empty due to failure to connect will be filled with null values.

If you choose to keepr1The missing tuple in, then calledLeft connection, the key wordleft join.
If you choose to keepr2The missing tuple in, then calledThe right connection, the key wordright join.
The union of left join and right join is calledAll connection, the key wordfull join.

In contrast, previous connections that lose tuples are called inner joins. Conceptually, inner join involves equivalent join, and special natural join. However, in MySQL implementations, inner joins without on conditions are equivalent to cross joins. Left join and right join are symmetric. For example, R1 left join R2 is equivalent to R2 right join R1.

There are two other special types of connections: semi-connections and anti-connections. Take R1 and R2 as an example to make a semi-connection, let the attributes of the two tables used for semi-(reverse) connection be K1 and K2 (K1 and K2 can be the same attributes, and the general situation is considered here), and the domain of the k2 attribute corresponding to the R2 table is D. Let t be a tuple of R1 and be selected as long as t.k1 can fall into field D (in other words, as long as t.k1 can concatenate at least one value of D). So r3 is going to be a subset of R1. If t.k1 is not selected in field D, then this is an anti-join operation. These two connections correspond to the EXISTS and not EXISTS keywords in the SQL statement.

-- R1 half-connects r2
SELECT r1.k1 FROM r1
WHERE EXISTS {
	SELECT k2 FROM r2
	WHERE r1.k1 = r2.k2
}

-- R1 connects r2 in reverse
SELECT r1.k1 FROM r1
WHERE NOT EXISTS {
	SELECT k2 FROM r2
	WHERE r1.k1 = r2.k2
}

-- another form of anticonjunction
SELECT r1.k1 FROM r1
WHERE EXISTS {
	SELECT k2 FROM r2
	WHERE r1.k1 <> r2.k2
}
Copy the code

Therefore, exists remains semantically a category of connection rather than a simple conditional judgment. It is fundamentally different from between, any, and all.

Set operations

Union: Query the set of all courses offered in the spring semester of 2017. We call it:

Condition 1: π Course_id (σsemester= “fall” ∧ year = 2017 (section)).

Query the set of all courses starting spring semester 2018, we’ll call it:

Condition 2: π course_id(σsemester= “spring” ∧ year = 2018 (section)).

To query both sets together, we call it:

π Course_id (σsemester= “fall” ∧ year = 2017 (section)) ∪ π Course_id (σsemester= “spring” ∧ year = 2018 (section))

This corresponds to the SQL statement:

-- r1
SELECT course_id FROM section
WHERE semester = "fall" and `year` = 2017
union   - r1 ∪ r2
-- r2
SELECT course_id FROM section
WHERE semester = "spring" and `year` = 2018
Copy the code

However, there is a more concise way to write this SQL statement:

SELECT course_id FROM section
WHERE (semester = "fall" and `year`) or ( semester = "spring" and `year` = 2018)
Copy the code

In other words, the following relational algebra is equivalent to the above:

Roach course_id(σ(semester = “fall” ∧ year = 2017) ∨ (semester = “Spring” ∧ year = 2018)(section))

Intersect: To query for courses that satisfy both condition 1 and condition 2, relational algebra is denoted as:

Course_id (σsemester= “fall” ∧ year = 2017 (section)) ∩ π Course_id (σsemester= “spring” ∧ year = 2018 (section))

Obviously, from the previous code block, we can easily infer the corresponding SQL statement:

Intersect is not a Mysql keyword.
-- r1
SELECT course_id FROM section
WHERE semester = "fall" and `year` = 2017
intersect   - r1 studying r2
-- r2
SELECT course_id FROM section
WHERE semester = "spring" and `year` = 2018
Copy the code

However, the MySQL database does not support a keyword like INTERSECT. We might “reason” into the SQL statement…… from the above union operation

SELECT course_id FROM section
WHERE (semester = "spring" and `year` = 2018 ) and (semester = "fall" and `year` = 2017)
Copy the code

It’s kind of weird. In order to find the problem more clearly, here, according to the associative law of logical operations, make a little equivalent substitution for the WHERE condition:

SELECT course_id FROM section
WHERE (semester = "spring" and semester = "fall") and (`year` = 2017 and `year` = 2018 )
Copy the code

Select * from class where the academic year is 2017 and 2018 and the academic term is spring and fall. For each attribute, it should be the only exact value, just as the author’s name is either “Joe” or “Joe”, as in the beginning of the semester. Obviously there is a semantic problem here, so this SQL query can’t find anything. What about statements like this?

SELECT course_id FROM section
WHERE (semester = "spring" or semester = "fall") and (`year` = 2017 or `year` = 2018 )
Copy the code

Not true either. According to the distributive law of logical operations, it is equivalent to:

SELECT course_id FROM section
WHERE (semester = "fall" and `year` = 2017) or (semester = "spring" and `year` = 2018) 
  and (semester = "spring" and `year` = 2017) or (semester = "fall" and `year` = 2017)
Copy the code

Now we need to switch thinking: let’s look for a temporary relation table1 that satisfies one of these conditions, and then find a tuple from that temporary relation table1 that satisfies the other condition. There are two implementations: exists or in, both of which involve nested subqueries.

-- In version, easy to understand
SELECT course_id FROM section
WHERE semester = "fall" and `year` = 2017 and course_id in (
    SELECT course_id FROM section
    WHERE semester = "spring" and `year` = 2018
)

-- exists version 
SELECT course_id FROM section as t1
WHERE (semester = "fall" and `year` = 2017) and EXISTS (
    SELECT course_id FROM section as t2
    WHERE semester = "spring" and `year` = 2018
    and t1.course_id = t2.course_id 
)
Copy the code

Both methods yield the desired result, but the IN clause is better for traversing small tables. Otherwise, use the exists clause. The difference between IN and EXISTS IN MYSQL

Except: Its algebraic operation.

π Course_id (σsemester= “fall” ∧ year = 2017 (section)) – π course_id(σsemester= “Spring” ∧ year = 2018 (section)).

Its SQL statement should look like this:

Intersect is not a Mysql keyword.
-- r1
SELECT course_id FROM section
WHERE semester = "fall" and `year` = 2017
except   -- r1 - r2
-- r2
SELECT course_id FROM section
WHERE semester = "spring" and `year` = 2018
Copy the code

MySQL also does not have the except keyword. However, given the previous example, it is not difficult to implement, even with a small change — adding a not to exists and in before the previous code block. However, it is really worth noting that the difference set operation does not satisfy the commutative law, and R1 – R2 is not equivalent to R2 – R1. Therefore, the results of the following two queries are different (the same applies to an EXISTS query) :

-- (R1-R2): courses will start in 2017 but not in 2018.
SELECT course_id FROM section
WHERE semester = "fall" and `year` = 2017 and course_id not in (
    SELECT course_id FROM section
    WHERE semester = "spring" and `year` = 2018
)

-- (R2-R1): starting in 2018 but not in 2017.
SELECT course_id FROM section
WHERE semester = "spring" and `year` = 2018 and course_id not in (
    SELECT course_id FROM section
    WHERE semester = "fall" and `year` = 2017
)
Copy the code