This is the 8th day of my participation in the August More text Challenge. For details, see: August More Text Challenge
Physical design and physical structure of the database
For a given logical data model, the process of selecting a physical structure that best fits the requirements of the application is the physical design of the database.
The storage structure and access method of the database on the physical device are called the physical structure of the database, which depends on the implementation of the selected DBMS.
Relational database physical design content:
- Select an access method (establish an access path) for the relational schema.
- Select a physical storage structure for relational, index, log, backup, and other database files.
(index) access method
Indexes are used to access data in relational databases. In other words, an index represents a way to access data.
Access methods can be divided into index method and cluster method.
Clustering, or clustering, usually requires the creation of an index on top of this, that is, a clustered index or clustered index, but the index corresponds to the actual physical storage order of the data in the database.
From this we can see that designing the index, designing the way the data is accessed, is also designing the corresponding physical structure.
The index
Why create indexes?
Improve access efficiency – query, insert, delete, update efficiency.
This corresponds to how to select indexes, such as which attribute columns to index, which indexes need to be unique or combined, and choosing the appropriate indexing method.
A common RDBMS provides several common indexing methods: b-tree (B+ tree), hash (hash) R-tree, Bitmap, etc.
General rules for selecting index access methods
- If an attribute (or set of attributes) appears frequently in the query condition, consider creating an index (or combination index) on the attribute (or set of attributes);
- Consider indexing an attribute if it is often used as an argument to an aggregation function, such as a maximum and a minimum.
- Consider indexing an attribute (or set of attributes) if that attribute (or set of attributes) appears frequently in the join condition of the join operation.
How to Create indexes
CREATE [UNIQUE] INDEX name ON table name [USING INDEX method] (column name 1, column name 2, [,...]) );Copy the code
Such as:
CREATE UNIQUE INDEX studentname ON student
USINGHash (sname);Copy the code
This is just a generic pseudocode example. The creation, use, and management of indexes vary according to different databases. The SQL statements related to indexes are described in detail in the following section.
B+ tree index access
B+ tree index
- Multi-point balanced tree, high access efficiency
- Both random search, and can be sequential search
- Add, delete and modify operations to maintain balance
Concepts in B/B+ trees:
- Rank: Number of data blocks on a node.
The structure of a B+ number is as follows:
- B+ number random search 1: there are relevant records
- B+ number random search 2: no relevant record
- B+ tree scope lookup
B+ tree scope search: Firstly, search the leaf nodes according to the method of random search to find the entry point of the scope; A range lookup is then performed along the sequential chain.
Photo from renmin University of China MOOC “Introduction to Database System (Advanced)” courseware, with changes
Hash index access
A Hash is called a Hash or Hash. Is a structure in the form of key-value pairs.
First, a quick look at hash structures, or hash tables, or hash tables.
If the key and value are the same, it’s also called a Set; When the contents of Key and Value are different, it is called a Map, which is a set of Key and Value pairs.
-
The hash index
Hash index is implemented based on a hash table. For each row of data, the storage engine computes a hash code for its index column. Different index columns compute different hash codes. The hash index stores all the hash codes in the index and stores Pointers to each row in the hash table.
Because hash codes are small in size, they take up less space. At the same time, the hash index adopts the continuous storage mode, and the access speed will be faster.
For the same hash, a linked list is usually used to resolve the conflict. Because the structure of the index is very compact, hash index queries are fast.
There are many different hash functions that generate hash codes, so the probability of the same hash codes is different. The SHA256 hash function, for example, does not generate duplicate codes at this stage. It depends on the implementation of the RDBMS.
-
Access to a hash index
Therefore, the access to the hash index is also very simple, using the hash key, directly retrieve the row record (there may also be a hash key in the list processing).
-
Rules for selecting the Hash access method
If the properties of a relationship appear mainly in the equivalent join condition or mainly in the equivalent comparison selection condition, and one of the following two conditions is met:
- The size of the relationship is predictable and constant;
- The size of the relationship changes dynamically, but the selected database management system provides a dynamic Hash access method.
Hash indexes can also be used if the columns in the table are of the big data object type (text, VARCHar (Max)) and are frequently used as a query condition.
The extra overhead of the index
Whether or not indexes need to be created, and which indexes to select, impose various additional costs on the database. So combine your choices to make sure that the benefits of using indexes outweigh the disadvantages.
● Cost of maintaining an index ● cost of finding an index ● cost of storing an index
And index maintenance generated index fragmentation, etc.
clustering
In order to improve the query speed of a certain attribute (or attribute group), tuples with the same value on the attribute (attribute group) are stored in continuous physical blocks called clustering.
This attribute (or group of attributes) is called a cluster key. Clustering or aggregation represents the actual location or order in which the physical data is stored.
Most RDBMSS provide clustering capabilities.
How to build a cluster (index)
- Let’s create a cluster
CREATE CLUSTER <The cluster name> (<The cluster size>) SIZE (<The size of the>);
Copy the code
- Create an index on the cluster
CREATE INDEX <Index name> ON CLUSTER <The cluster name>;
Copy the code
Such as:
CREATE CLUSTER emp_dept_cluster (deptno number(6) ) SIZE 1024;
CREATE INDEX emp_dept_cluster_index ON CLUSTER emp_dept_cluster;
Copy the code
Advantages of clustering
- It can greatly improve the efficiency of querying by clustering attribute.
Take the name of the major department for example. Suppose you want to query all the students in the computer science department.
- The student data table is stored randomly. 500 students in the computer science department may be stored on 500 different physical blocks. Apart from the query process, it may take up to 500 I/O operations just to read the data.
- If the student tuples of the same department are clustered together according to the name of the major department, the number of accesses to the disk can be significantly reduced. 500 students in the department of Computer Science, assuming the cluster is stored on 50 different physical blocks, perform only 50 I/O operations. Ideally, if the cluster is clustered on a contiguous physical block, only one I/O will be performed.
-
When the SQL statement contains clauses or phrases related to clustering code such as ORDER BY, GROUP BY, UNION, or DISTINCT, using clustering indexes can eliminate or reduce sorting operations on the result set.
-
In multi-table join, clustering can store multiple relationships in advance in the form of effect similar to “pre-join”, improving the efficiency of join operation (essentially, improving the efficiency of query).
For example, the result of selecting (name, course, grade) from a student list and a class schedule can be grouped together using “student numbers” to increase efficiency.
SELECT sname, cname, grade from student, scourse where student.sno=scourse.sno;
Copy the code
The join operation is usually time-consuming.
- It is suitable for both independent clustering of a single relationship and combination clustering of multiple relationships.
Limitations of clustering
- At most one cluster index can be created on a base table.
Clustering is physically stored together, a basic table, can only be physically stored in one way.
-
Clustering can only improve performance for certain applications. SQL statements that use clustering keys will improve performance.
-
The overhead of setting up and maintaining clusters is considerable
- The clustering of an existing relationship will cause the physical storage location of tuples in the relationship to be moved, and the original index on the relationship will be invalid and must be rebuilt.
- When a tuple’s cluster code (value) changes, the tuple’s storage location changes accordingly. (For example, if the tuple is clustered by the family name, if the family name is changed and the family name is changed, the physical location of the tuple will also be changed, increasing the overhead)
Therefore, the application conditions of cluster index are also strict, or the following conditions need to be carefully considered:
- Base tables are rarely added or deleted.
- The variable-length columns are rarely modified.
Storage structure
The data storage arrangement and storage structure can be seen from the data classification, such as:
- Relationships (user data, system data below)
- The index
- Database buffer
- The log
- The backup
Aspects of a storage structure
Determine whether the storage is “memory/disk” or “row/column”, whether the storage is “centralized/scattered” or “sequential/random/clustered”, as well as the page size, block size, storage path, etc., all belong to the storage structure consideration.
Usually these aspects of the setting, all need to refer to the specific DBMS system parameter configuration and method
Basic principles of physical arrangement
The basic principles of physical arrangement are:
Based on the application and physical environment (disk or disk array capacity, memory size)
- Data in volatile and stable parts are stored separately
- The frequently accessed part is stored separately from the less frequently accessed part
- Store log files separately from database objects such as tables and indexes
- Separate large tables, such as partitions, to separate files or disks, especially in a multi-user environment.
To cope with the demand of massive data and many users, it is an effective way to improve and improve the system performance to divide the data into different databases and tables, and store the data on different disks or disk arrays.
Parameters about storage allocation
The parameters for storage allocation usually have the following default values:
- The number of users using the database simultaneously
- Number of database objects open at the same time
- Memory allocation parameter
- Buffer allocation parameters (length and number of buffers used)
- Storage allocation parameter
- Physical block size
- Physical block loading factor
- Database size
- The number of locks, etc
A personal understanding of clustering
By definition, the clustering of tuples of attributes (groups of attributes) with the same value into contiguous physical blocks is called clustering.
Do tuples of different values need to be considered? Clustering should mean that the same values are clustered together and stored together. Tuples of different values may be stored in another contiguous space.
How are tuples of different values handled for access? This will correspond to the cluster index, through the cluster index to maintain different cluster code values, to achieve the search.
reference
The main content of this section is based on the introduction to database Systems (Advanced), and some other information on the Internet.