The article brief introduction

This paper illustrated the method describes the stages and the important concepts of relational database design, and focus on two core concept design and logical design phases, emphasized the steps in the construction of the e-r model in addition to adding the relationship between the standardization of the relational schema and how to solve the model of candidate code and other important points.


1 Overview of database design

1.1 What is Database Design?

Database design refers to a given application environment, the construction of the optimal database mode, the establishment of the database and its application system, so that it can effectively store data, to meet the application requirements of various users (information requirements and processing requirements).

Database is the core and foundation of information system, it organizes a large amount of data in the information system according to a certain model, and provides the function of storing, maintaining and retrieving data, so that the information system can obtain the required information conveniently, timely and accurately from the database. That is, the design goal of the database is to provide users and various application systems with an information infrastructure and high efficiency operating environment, where the high efficiency represents less redundancy, easy maintenance and easy to use.

Database design can be roughly divided into the following stages:

Requirement analysis → conceptual design → logical design → physical design → database implementation → operation and maintenance stage.

In a word, database design is a process of continuous iteration and refinement. In this process, the most important stage is conceptual design and logical design, and this paper focuses on these two stages.


1.2 Several concepts that must be clarified

This paper mainly discusses the analysis of the database design for the relational database, in the formal start to understand the database design stage to clarify several important concepts of the relational database.

Relationship: The data structure of a relational database is a two-dimensional table, popularly speaking, the name of the two-dimensional table is the name of the relationship.

Attributes: The columns in a two-dimensional table are called attributes (fields), and each attribute has an attribute name.

Range: The range of values of attributes in a two-dimensional table is called range, and each attribute has a range.

Relational schema: The structure of a two-dimensional table is called a relational schema. Let the relation be named R and its attributes be A1, A2… , An, the relational mode can be expressed as: R(A1,A2… ,An), a specific example: employee (employee number, name, gender, department).

Candidate code: AN attribute set is a candidate code if its value uniquely identifies a tuple of relationships without redundant attributes. There can be multiple candidate codes on a relationship.

Primary property: property that contains any candidate code.

Non-primary attribute: an attribute not contained in any candidate code.

Primary key: Sometimes a relationship has multiple candidate codes, and one of them can be selected as the primary key. Each relationship has one and only one primary key.

Foreign key: K in relational schema R is called a foreign key in relational schema R if the attribute K in relational schema R is the primary key of another relational schema.

2 Demand Analysis

Demand analysis stage is to analyze the needs and requirements of users, it is the starting point of the design of the database, the results of demand analysis can accurately reflect the actual requirements of users, will directly affect the subsequent stages of the design, and affect the design results are reasonable and practical.

The main task of demand analysis is to fully understand the various needs of users and finally determine the function of the system through detailed investigation of the objects to be dealt with in the real world, and must fully consider the possible expansion and change of the system in the future, not only according to the current application requirements to design the database.

The final product of the requirement analysis stage is to have a clear system requirement analysis report, which generally includes data flow diagram, function module diagram, data dictionary, etc., and it is an important basis for guiding the activities of the subsequent stage of database design.

Because this paper mainly introduces the database e-R model design, so the related content of demand analysis will not be described.

3 Conceptual Design

3.1 Transition from the real world to the machine world

Conceptual design is designed to form an independent DBMS conceptual data model, used to express the relationship between data and data, it directly faces the real world, so it is easy to be understood by users, convenient for database designers and users to communicate. First stage of the design and the user data structures related to the specific application – user view, and then continue to integrate the view change, in the end should have a correct and complete to the unit could satisfy the requirement of all kinds of processing data and contact and data model, and then to convert the conceptual model to the specific machine DBMS support data model.

Features of conceptual structure design:

  • Can truly and fully reflect the real world;
  • Easy to understand;
  • Easy to change;
  • Easy to convert to relational, mesh, hierarchical, and other data models.

The main description tool of the conceptual design stage is the entity-Relationship Model (E-R).


3.2 the e-r model

It can be seen from the above that e-R model is the main description tool in the conceptual design stage, playing a transitional role of “connecting the preceding and the following”, and its importance is self-evident. Therefore, let’s first understand several important concepts in e-R model.

3.2.1 Entities and attributes

Entities are the basic objects of the E-R model and are abstractions of various things in the real world. It can be physical things, such as people, cars, etc. They can also be abstract concepts, such as schools, departments, etc.

Attributes are indivisible data units used to describe the characteristics of the entity. For example, the teacher entity has the attributes of name, gender, address and so on.

The set of attributes that uniquely represent an entity is called a code.

In the E-R model, entities are generally represented as rectangles and attributes as ellipses. If the attribute is a primary code (primary key), it is underlined under the attribute name. As shown in the figure below.

This picture describes the physical object of a teacher and its attributes such as name and gender, which is an intuitive description of information in the real world.

Some attributes can also be divided into independent sub-attributes, which are called compound attributes. For example, a person’s name can be divided into surname and given name. Address attributes can be divided into sub-attributes such as zip code, province name, city name, district name, and street, which in turn can be divided into street name and house number in the following hierarchy:

Compound attributes serve two purposes:

  • Complex information structures that accurately simulate the real world;
  • When the user needs to use the compound attribute as a whole and needs to use each sub-attribute separately, the compound structure of the attribute becomes very important.

For example, the last name in the compound property “first name” may be sufficient to address the other person in an email.

Attributes can also be divided into single-value attributes and multi-value attributes according to the number of values. Single-valued attribute means that the attribute can only take one value for the same entity. Most of the attributes belong to single-valued attribute, for example, the same person can only have one age and one gender. But in some cases, the entity attribute may take multiple values, this is called a multi-valued attribute, for example, is the person’s contact information for more than a value, some people have a telephone, some people have 2 or 3, and so on, or can be divided into mobile phone contact, fixed telephone contact and email contact information, etc.

Note: The description of the multi-valued attribute is different from that of the single-valued attribute, which is represented by a double-line ellipse, and in the actual development process, if the multi-valued attribute occurs, it is generally classified as an entity or association.

There may be some connection between entity attributes. For example, there is an interdependent relationship between the age attribute and the date of birth. According to the date of birth, the age of a person can be deduced, which is called the derived attribute or derived attribute. Exported properties can be exported not only from other properties, but also from related entities. For example, the value of the number of employees attribute of a company entity can be obtained by adding up all employees of the company.

There is also a type of attribute called optional attribute, that is, not all attributes must have a value, some attributes can have no value, this is the optional attribute, after the ellipse text is represented by “(O)”.

3.2.2 Entity types and keys

The same entity class class entity or entity set, the characteristics of the entity class with entity model, entity model is essentially a collection of entities with the same properties, by an entity type name and a set of attributes to define, public structure is used to describe a group of entities, the entity in the collection any entity called an instance of the entity type.

For example, a company has thousands of employees, each employee need to be stored in the database of information, and the information of the employees are similar, such as name, age and so on are the employees of the same attributes, but for different employees to its specific attribute values are different, at this time we will be able to put these similar entity is an entity type, abstract, such as: Employee (Employee number, name, department, gender, age, title).

Therefore, the concept of entity type is also the relationship pattern we mentioned above, but changed the name of entity type is also the relationship pattern name.

In order to distinguish different entities in the entity type, the concept of “key” is introduced, which requires that the value of “key” must be different for different entities. For example, different employees must have a different “employee number” as a distinction. An entity type can have multiple keys, such as the name and date of birth of the person attribute being one key, and the ID number attribute being another.

Each simple attribute of entity type has a range of possible values, called the range of values. For example, the age range of “person” can be an integer range from 1 to 150.

3.2.3 Linkages between entities

(1) Type of contact

In the real world, there are always relationships within or between things, and relationships reflect relationships within or between entities.

The degree of a connection refers to the number of entities involved in a connection. For example: single entity contact, two entity contact and multiple entity contact.

It is more common for two entities to be connected. The following connections may exist between two entities:

  • One-to-one contact (1:1), for example, the contact between the department and the person in charge, a department has a person in charge, a person in charge of a department;
  • One-to-many connection (1: n), for example, the connection between departments and employees. A department has multiple employees, and each employee belongs to only one department.
  • Many-to-many connections (M: N), such as the connection between a project and an employee, can require more than one employee to participate in a project, and an employee can participate in multiple projects.

Connections are usually described in diamond shapes. The above connections can be represented in the following e-R diagram.

Single entity contact can also be divided into one-to-one, one-to-many, and many-to-many contact. For example, the “leader” relationship between employees is one-to-many contact, and the spouse relationship between employees is one-to-one contact.

In general, one-to-one, one-to-many, and many-to-many relationships also exist between more than two entities. For example, there are three entities in the student course selection system: students, teachers and courses, and the connection between them is as follows:

It means that a teacher can teach more than one course, and a course can be taught by more than one teacher. A student can choose more than one course, a course can be chosen by more than one student, students in the selection of courses at the same time choose teachers.

(2) The existence of connection

In addition to the type of connection above, consider the existence of an entity in a connection. After being converted to logical mode, existence is expressed as whether an attribute can be NULL, which is an undefined value and is represented by NULL in the DBMS.

In general, the connection has the following existence:

  • Mandatory existence: Mark a “1” on the line to indicate that the minimum cardinality is 1. An entity at one end of the connection is said to be mandatory if its instance must exist for instances of other entities of the connection.
  • Optional: 0 is marked on the line to indicate that the minimum cardinality is 0. An instance of an entity at one end of the connection is said to be optional if it is not required to exist for instances of other entities at the other end of the connection.
  • Unknown: If no “1” or “0” is marked on the connection, it is not known whether it is mandatory or optional.

For example, the above e-R diagram of employee participation in the project can complement and improve the existence of its connection:

It means that at least one employee must participate in a project (mandatory), and an employee may not participate in any project (optional).

3.2.4 Advanced E-R construction

Advanced E-R construction refers to the extension of e-R model, referred to as EER model (EXTENd-ER), EER model includes all concepts of E-R model, in addition, it also includes generalization level, aggregation level and weak entity concepts.

(1) Generalization level

Generalization levels involve concepts such as subclasses, superclasses, generalizations, and specializations, so let’s look at them first.

Subclasses and superclasses

In many applications, an entity-type entity needs to be further divided into multiple subsets and explicitly represented. For example, the member entities of entity-type teachers can be divided into four entity sets: professor, associate professor, lecturer and teaching assistant. These sets are all subsets of the entity set of teachers (which can be defined as subreal type), and they are called subclasses of entity-type teachers, while entity-type teachers are called superclasses of these subreal type.

The relationship between A subclass and A superclass IS called an IS-A relationship. The members of A subclass must be members of the superclass or they cannot appear in the database, but some members of the superclass may not belong to any subclass.

This concept is very similar to the concept of subclasses and superclasses in Java inheritance.

Generalization (induction)

Generalization, also known as induction, generalizes some common attributes in several entities and elevates them into a superclass of a higher level, while the original entity class becomes a subclass, which inherits the attributes of the superclass in addition to its own attributes.

Generalization level can be divided into mutually exclusive generalization level and overlapping generalization level. The mutual exclusion generalization level refers to that the subclasses are mutually exclusive, that is, an instance cannot appear in more than two subclasses. The mutual exclusion generalization level writes the letter D (Disjoint) in the circle of the generalization level diagram; Overlapping generalization Levels Write the letter O (Overlapping) in the circles in the generalization level map.

For example, the entity plane, train and automobile can be generalized to superclass transport vehicles. The double line in the figure indicates that each entity of the entity type of transport vehicle must belong to a subclass, that is, the subclass completely contains the superclass, and every instance in the superclass has the instance in the subclass. This situation is called completeness constraint.

Specialization (deduction)

Specialization is the reverse of generalization, also known as deduction, which deduces subclasses from superclasses. For example, entity teachers can deduce and form subclasses of professor, associate professor, teacher and assistant professor. This process classifies the entity of teachers according to their professional titles, and it can also use different classification rules to perform various deductions. For example, the entity of the teacher can be deduced according to the teacher’s specialty to obtain liberal arts teachers, science teachers and foreign language teachers, etc.

(2) Collection level

Aggregation is another important means of information abstraction, describing the connection between whole and parts, that is, between entities “… Is that… Is part of… By…… Composed of.

There is no inheritance in the pooling hierarchy, so related entities are not referred to as superclasses or subclasses, but as parent and component entities.

The concept of aggregation is similar to aggregation relationships in object-oriented concepts. For example, a classroom consists of rooms, doors and Windows, computers, projectors and so on, and there is no inheritance between them.

(3) Weak entities

In the real world, there are often some entity types that do not have their own keys (that is, all attributes are not enough to form a primary key). Such entity types cannot exist independently and must depend on a strong entity, so they are called weak entity types. Weak entities are represented by double frames in an E-R diagram.

The attribute values of different entities of weak entity type may be identical and difficult to distinguish, so it needs to be associated with the general entity type, the purpose is to distinguish different weak entities.

For example, in the personnel management system, the information of the employee’s family is based on the existence of the employee, the family entity is a weak entity, and the contact between the children and the employee is a dependent connection. For example, parents are a weak entity, because only the existence of the student entity, the parent entity will exist.

The figure above shows that the dependant entity is a weak entity and cannot exist alone and must rely on the employee entity. The dependency information marked here is existence dependency (E) and identification dependency (ID) :

Existence dependence: if the existence of an entity X depends on the existence of another entity Y, it is said that the existence of X depends on Y. This is a special connection, denoted by E and directed by an arrow, where X is called the weak entity, represented by a double box, and Y is called the parent entity of X.

Identity dependencies: An entity identity is said to be dependent on other entities if it cannot be uniquely identified by its own attributes, that is, it does not have its own primary key and is identified only by its association with other entities. This is also a special association, represented by an ID and indicated by an arrow.


3.3 Conceptual design methods and steps

After understanding the relevant concepts of e-R model, we need to understand several commonly used design methods and strategies in conceptual design, and the overall strategies and methods can be summarized into four kinds.

(1) Top-down: first define the framework of the global conceptual structure, and then gradually refine it.

(2) Bottom-up: first define the conceptual structure of each local application, and then integrate them, and finally get the global conceptual structure.

(3) Gradual expansion: first define the most important core concept structure, and then expand outward, gradually generate other concept structures in a snowball way, until the overall concept structure.

(4) Mixed strategy: the application is divided into different relatively independent functions, and the corresponding local E-R model is designed for each function. Finally, the global E-R model is formed by eliminating redundancy and inconsistency through induction and merger.

The most commonly used strategy is the bottom-up approach, in which top-down requirements analysis is followed by bottom-up conceptual design, as shown in the figure below.

As shown in the figure above, the conceptual pattern is typically designed from a single small module in the requirements, and then these local views are integrated into a global conceptual pattern that meets requirements analysis requirements.

Bottom-up conceptual structure design steps:

  1. Abstract data and design local views;
  2. Integrate local view to get global concept structure.


3.4 Local E-R model design

According to the above conceptual design strategy, under normal circumstances, we will first build the local E-R model according to the requirements module, and then integrate it into the global E-R model.

In the design of local E-R model, the scope of description of local E-R diagram should be determined firstly, and the principle of independence and scale moderation should be followed in general.

The principle of independence refers to the independence and integrity of the application functions in one scope, and the least connection with other applications in the other scope. The principle of moderate scale means that the scale of local E-R map should be moderate, generally about 6 entities should be appropriate.

The steps of establishing e-R model are as follows: distinguish entity and attribute → find out pooling level → find out generalization level → find out weak entity → define relation.

(1) Distinguish between entities and attributes

The entity must have a description. If an object has more than one description, consider it as an entity. But if an object has only one description, consider it as an attribute.

Some non-identifying attribute is said to be many-valued if it has multiple values corresponding to the entity, that is, multiple values of the attribute correspond to one value of the identifying attribute. Attribute a multi-valued attribute to another entity

Attribute attributes to the entity it most directly describes. For example, the attribute “office building name” should go to the physical department, not to the physical employees.

(2) Find out the pooling level

Base entities are analyzed, and if one entity is composed of other entities, they are constructed into an aggregation hierarchy, usually by pooling existing entities into a new entity.

(3) Find the generalization level

Once the partitioning of the base entities is complete, class hierarchies can be constructed using generalizations and specializations (both), and if the base entities change, the related generalization hierarchies and newly generated generalization hierarchies can be reconsidered. Experience plays an important role in generalization and specialization.

(4) Identify weak entities

For weak entities, it is suggested to consider them after establishing generalization levels. For example, the family members of employees of a company are weak entities compared with the company.

(5) Define the connection

With that done, you can define the relationships between entities. Of course, generalization level and aggregation level are also special types of connection, but generalization and aggregation analysis usually produce new entities, while ordinary connection refers to the interaction between entities, without producing new entities or reducing entities. That’s why you don’t think about contact until the end. It should be said that defining associations is the key to ER model, and most information associations are expressed by associations.

Note: Associations should be the last stage of analyzing definitions, so don’t prematurely consider associations when distinguishing entities from attributes.

It is possible to have multiple relationships between entities, but do not represent the same concepts, or you will have redundancy of relationships. Redundant connection refers to multiple connections representing the same concept. When generating relational patterns from e-R graphs, redundant connection will lead to non-standard and excessive redundancy of the generated relational patterns.

In addition, it is also possible to have connections among multiple entities, which is called multi-entity connection. For example, the student course selection mentioned above is an example of multi-entity connection.

(6) Several principles for establishing e-R model

In the process of creating an E-R model, we generally follow the following principles:

  • Attributes are indivisible;
  • Each entity has a unique identification, while the association has no identification, and the identification of the general association depends on the identification of the related entity;
  • Each subclass has a unique superclass. Subclasses do not define the identity themselves, but inherit the identity from the superclass.
  • Weak entities are not allowed as subclasses, but as superclasses;
  • Entity names, contact names, and attribute names should be unique in an E-R graph (local or global);
  • Multiple connections between the same entities should be distinguishable.

3.5 Integration of E-R model

Since the local E-R model reflects only the data view corresponding to the local sub-functions, and there may be inconsistencies between the local E-R graphs, it cannot be used as the basis for logical design. In this case, the E-R model can be integrated to remove the inconsistencies and duplicates, and finally merged into a global view.

The integration methods of local E-R model are as follows:

  • Multivariate integration method: multiple local E-R graphs are merged into a global E-R graph at one time.
  • Binary integration: integrate two local E-R graphs at a time by summing them up.

In practical application, integration methods are generally selected according to the complexity of the system and can be mixed. Regardless of the integration approach, each integration is divided into two phases:

  1. Merge: eliminate the inconsistencies between local E-R graphs and generate a preliminary E-R graph;
  2. Optimization: Eliminates (or reduces) data redundancy and generates a global E-R graph.

View integration is best done by one person, or always under one person’s auspices, otherwise old problems will not be solved and new ones will be created.

The content of these two stages will be described below.

3.5.1 track of merger

Due to the different problems faced by different local applications, and usually by different designers to design local E-R graphs, there must be many inconsistencies between local E-R graphs, that is, there are conflicts. The main task of merging local E-R model is to eliminate conflicts reasonably and form a unified conceptual model that can be understood and accepted by all users in the whole system.

Conflict can be divided into three categories: attribute conflict, naming conflict and structure conflict.

(1) Attribute conflict

① Attribute domain conflict, that is, the type and value range of attribute values are inconsistent. For example, whether an employee’s id is numeric or character.

② Attribute value conflict. For example, students’ grades may be measured on a 100-point scale, or on a five-point scale.

Such conflicts are caused by user engagement and must be resolved through negotiation.

(2) Naming conflicts

Naming conflicts can occur on entities, attributes, and associations, the most common being attribute conflicts.

① Homonyms: objects with different meanings have the same name in different local applications. For example, “unit” can be used either as a department of a person or as an attribute of measures such as length, weight, etc.

② Name and synonym: objects with the same meaning have different names in different local applications. For example, the “department” and “college” of a university are actually the same entity.

Such conflicts can usually be resolved by administrative means through negotiation.

(3) Structural conflict

① The same object has different identities in different local applications. For example, an entity in local model A is designed as an attribute in another local model B, which creates A structural conflict.

Solution: Turn entities into attributes or attributes into entities, keeping the structure unified.

② The attribute composition of the same object in different local applications is not exactly the same. For example, for the object of “employee” of the same class, in local model A, its attributes consist of four attributes: employee number, name, gender and age, while in another local model B, its attributes consist of three attributes: employee number, name and department.

Solution: take the union of entity attributes in different local applications, and properly design the order of attributes.

(3) The connections between the same entities are inconsistent in different local models. For example, entities E1 and E2 are one-to-many in local application A, but many-to-many in local application B.

Solution: Synthesize or adjust the types of entity connections based on application semantics.

3.5.2 optimization

Data redundancy and relation redundancy are the main redundancy problems of E-R model. Data that can be derived from other data is redundant data, and relation that can be derived from other relation is redundant relation. For example, an employee entity has both “date of birth” and “age” attributes, and “age” can be deduced from “date of birth” and is therefore redundant data.

The existence of redundancy is easy to destroy the integrity of data, resulting in database maintenance difficulties, should be eliminated. There are many ways to eliminate redundancy, and normalization theory is more commonly used in relational databases.


4 Logical Design

The e-R model obtained in the conceptual design stage is a conceptual model for users, which is independent of the specific DBMS, while the main task of the logical design stage is to transform it into the data model supported by the specific DBMS.

The design stage of its logical structure is mainly divided into: converting e-R graph into relational data model, normalization and optimization of relational model.

4.1 Transform e-R graph into relational data model

Data relational model is a set of relational patterns. Converting an E-R graph into a relational data model is actually about converting entities, attributes and associations into relational patterns. There are generally the following conversion principles.

(1) Transforming entities

When an entity is transformed into a relational schema, the entity attributes are the relational attributes, and the entity code (primary key) is the relational code.

For example, the relationship model of an employee is “employee (employee number, name, gender, age, title, department)”.

(2) Transform weak entities

If weak entities exist, a weak entity is transformed into a relational schema with the code of the strong entity it depends on as the code of the relationship.

For example, if an employee’s family member is a weak entity, it can be transformed into a relational model: family member (employee number, family name, family relationship).

(3) Transformation of aggregation level

For the aggregation level, add the key of the component entity of radix 1 to its parent entity as the external key, and add the key of the parent entity to the component entity of radix M as its external key.

(4) Transformation of generalization level

For the generalization level, the key of each superclass is the key and foreign key of its subclass.

(5) Transform multi-value attributes

If a many-valued attribute exists, the many-valued attribute is converted to a separate relationship with its entity code as the code for that relationship.

For example, if the contact information of an employee is a multi-valued attribute, it can be converted to “employee contact information (employee number, contact information)”.

(6) Transforming connection

A 1:1 relationship can be transformed into a separate relationship pattern, but it is more common to merge the relationship with the corresponding relationship pattern on either end.

If it is converted to an independent relational mode, the codes of all entities connected to the relation and the attributes of the relation itself are converted to the attributes of the relation, and the codes of each entity are the candidate codes of the relation.

If you combine the relational patterns of associations with one side entities, which side you choose depends on the context of your application, but the goal should be to minimize join operations.

For example, a director manages a department:

A 1:n association can be transformed into an independent relational pattern, but it is more common to merge the association with the corresponding relational pattern of the N-terminal.

If it is converted to an independent relation mode, the codes of the entities connected to the relation and the attributes of the relation itself are converted to the attributes of the relation, and the codes of the relation are the codes of the n-terminal entities.

In practice, the common transformation method is to combine the relation mode of the relation with n-terminal entity. For example, for a department consisting of multiple employees:

This can be converted to relational patterns: Department (department number, department name), employee (employee number, name, gender, department number).

An M :n relation is transformed into an independent relation mode. The codes of the entities connected to the relation and the attributes of the relation are transformed into the attributes of the relation, and the codes of the relation are the combination of the entity codes.

For example, students choose courses:

The course selection association can be transformed into a pattern: course selection (student id, course ID, course grade), where the course grade is the attribute of the association.


4.2 Normalization of relational patterns

In general, the result of database logic design (relational schema) is not unique. In order to further reduce the anomalies existing in relational schema and improve the performance of application system, normalization theory is an important theoretical basis and powerful tool.

4.2.1 What are function dependencies

Before proceeding, let’s look at the following three concepts of function dependencies. We assume that R(U) is a relational pattern on the attribute set U, and X and Y are subsets of U.

Function dependence: expressed by X→Y, called “X determines Y” or “Y function depends on X”. For example: {job number} → {employee name}.

Perfect function dependence: If “Y function depends on X” and Y cannot be determined for any true subset of X ‘, then “Y perfect function depends on X”. For example, in the relation model of (Student number, course number, grade), there exists function dependence {student number, course number} → grade, and any true subset of {student number, course number} cannot determine the grade, so this function dependence is the complete function dependence.

Transfer function dependence: if X→Z, Z→Y, and Z does not contain X, then “Y transfer function depends on X”. Suppose there is a relational model: (student number, student name, class, head teacher), the student number can get the class of the student, and the class can get the head teacher, otherwise it is not valid, so there is transfer function dependence in this model: {student number} → {head teacher}.

4.2.2 Learn to solve candidate codes

If we want to normalize the relational schema, we must first define the concepts of candidate code, primary attribute and non-primary attribute. And because, by definition, attributes that belong to candidate codes are master attributes, and attributes that do not belong to candidate codes are non-master attributes, the most important point is to learn how to solve candidate codes in relational patterns.

Let R<U, F>, U is the set of attribute names that constitute the relation, F is the set of function dependencies among attributes.

Now suppose R<U, F>, and U = {A, B, C, D, E}; F = {A→B, AC→D, CD→E, E→C}, to solve the candidate code of this relational pattern.

UL ={A}, UR ={B}, UB ={C, D, E}.

UL refers to the set of attributes that only appear on the left side of each dependency relation in function dependency set F. If UL is not empty, any attribute in UL must be included in the candidate code of relational mode R. UR represents the set of attributes that only appear on the right side of each dependency relation in function dependency set F. If UR is not empty, any attribute in UR must not be included in any candidate code of relational mode R. UB = U-ul-ur, which represents the set of attributes that appear on both the left and right sides of the dependency.

In combination with this example, A property only appear in the top left of the dependencies, so it must be included in the candidate code, and B properties only appear on the right side of the dependency, so it must be included in the candidate code and all the rest for the left and right sides, so they may be included in the candidate code or may not be included in the candidate code. So we’re going to do the following.

If UL+ = U, that is, the closure of UL is the whole attribute set of the relational pattern, then UL is the only candidate code of the relational pattern R. If UL+ ≠ U, it is necessary to solve the closure by combining UL with the attributes in UB successively.

UL = {A}, UL+ = {A, B}, that is, attribute B can only be deduced from attribute A and dependence in F (because there is A→B in F), Therefore, the closure result is {A, B} ≠ U (the closure operation is actually the total set of elements that can be derived from the current element), so the elements in UL and UB need to be combined in sequence to continue solving, as follows:

(AC)+ = {A, B, C, D, E} = U, so AC is A candidate code for the relational mode, thus ACD, ACE, ACDE are not candidate codes (why after solving AC, it is determined that the latter several are not candidate codes? This depends on the definition of candidate code, see 1.2)

(AD)+ = {A, B, D} ≠ U, so AD is not A candidate code;

(AE)+ = {ABCDE} = U, so AE is candidate code, and ADE is not candidate code.

The algorithm ends, so the candidate codes of the current relational mode are AC and AE, then the main attributes are A, C and E, and the non-main attributes are B and D.

Once the candidate codes, main attributes and non-main attributes are solved, the following steps of normalization theory can be used to eliminate different functional dependencies (mainly split patterns) of non-main attributes on the code, so as to achieve different normal form levels.

4.2.3 Normalization theory

The standardization of relational database is usually referred to as a paradigm for different levels of standardization requirements. According to the different properties and degree of normalization of relational patterns, relational patterns are divided into 1NF (first normal form), 2NF, 3NF, BCNF, 4NF and 5NF, which are progressive relations. Usually, the NTH normal form of pattern R is simply written as R∈nNF.

After understanding the above concept of function dependence and learning how to solve candidate codes, we can normalize patterns according to the definition of paradigms:

First normal form: if all the properties of the relational pattern R are simple, that is, every property is non-divisible (atomicity), then R is said to belong to the first normal form and denoized as R∈1NF.

Second normal form: if the relational pattern R∈1NF and every non-principal attribute is a complete function dependent on the code of R, then R is said to be in the second normal form and denoted as R∈2NF.

Third normal form: if the relational pattern R∈2NF and every non-principal attribute does not transfer function dependent on R’s candidate code, then R is said to belong to the third normal form and is written as belonging to 3NF.

BC normal form: if the relational mode R∈1NF, and for all function dependencies X→Y (Y ∉ X), determinant X contains a candidate code of R, then R is called BC normal form, written as R∈BCNF.

The progression of the above paradigms is shown in the figure below.


4.3 Optimization of relational mode

In order to improve the performance of database application system, it is necessary to modify and adjust the relational schema.

(1) Merger

The main purpose of merging multiple relational patterns is to reduce join operations and improve query efficiency. It is generally used when multiple relational patterns have the same primary key, and these relational patterns mainly deal with multi-relational query operations.

(2) Decomposition

In order to improve the efficiency of data operation and the utilization of storage space, the relational schema can be decomposed horizontally and vertically.

Horizontal decomposition: The relational mode is decomposed into several relational modes according to the classification query conditions, which can reduce the number of records that the application system needs to access each time, thus improving efficiency.

For example, if a university student database has a relational schema: students (student ID, name, age, native place), but the fact is that most queries involve only one class of students at a time, students can be split horizontally by category: undergraduates (…) ; Master (…). ; PhD (…). .

Vertical decomposition: Decompose the attributes of relational pattern R into several sub-sets to form several sub-relational patterns.

For example, the relationship model of employee situation: employee (employee number, name, gender, age, position, salary, length of service, address, telephone number), but the fact is that the first six items are frequently queried, and the last three items are rarely used, then the relationship model can be vertically decomposed: Employee information 1(employee number, name, gender, age, position, salary), employee information 2(employee number, length of service, address, telephone number).

Both frequently queried and infrequently queried attributes can be vertically segmented. Its advantage is to reduce the amount of data transfer and improve the speed of query.

5 Physical Design

The storage structure and access method of a database on a physical device is called the physical structure of a database, which depends on a given computer system.

Database physical design generally consists of two steps: determining the physical structure of the database (access method and storage structure) and evaluating the physical structure (time and space efficiency).

(1) Determine the physical structure of the database

① Determine the storage structure and storage location of data

Including identifying:

  • Relationships, indexes, clustering, logging, backups, etc.
  • Factors to consider: Access time, storage space utilization, and maintenance cost.

According to the application situation, the volatile part and the stable part, the high access frequency part and the low access frequency part are stored separately to improve the system performance.

② Design the appropriate access path;

③ Determine the system configuration, DBMS products generally provide some system configuration variables and storage allocation parameters.

(2) Evaluate the physical structure

① Evaluation content

The time efficiency, space efficiency, maintenance cost and various user requirements of various schemes in the process of database physical design are carefully evaluated, and a better scheme is selected as the physical structure of the database.

② Evaluation method

The storage space, access time and maintenance cost of various solutions are quantitatively estimated, and the estimated results are weighed and compared to select an optimal and reasonable physical structure. If the structure does not meet user requirements, the design needs to be modified.

6 Database Implementation

After the logical and physical structure of the database is well designed, it is necessary to establish the database in the actual computer system and test run, the main work of this stage has the following points.

  1. Establish database structure;
  2. Load data;
  3. Compile and debug application program;
  4. Database trial run;
  5. Organize relevant documents.

7 Database operation and maintenance

After trial operation, if the database conforms to the system design objectives, it can be formally put into operation. In the process of database operation, the application environment, physical storage of database and so on constantly change, at this time, DBA should constantly evaluate, adjust and modify the database design. In summary, database maintenance includes the following contents.

(1) Database dump and recovery

Dump and restore is one of the most important maintenance tasks after the system is officially running. Dbas need to make different dump plans for different applications and back up databases and log files periodically. Once a media failure occurs, database backup and log file backup can be used to restore the database to a consistent state as soon as possible.

(2) Database security and integrity control

The DBA must grant different operation rights based on the actual needs of users. In the process of database operation, due to the change of the application environment, the security requirements will also change, and the DBA needs to modify the original security control according to the actual situation.

As the application environment changes, database integrity constraints will also change, requiring DBA to constantly modify to meet user requirements.

(3) Database performance monitoring, analysis and improvement

In the process of database operation, DBA must supervise system operation, analyze the monitoring data, and find out the method to improve system performance.

Use the monitoring tool to obtain the values of a series of performance parameters in the process of system operation. By carefully analyzing these data, judge whether the current system is in the best running state. If not, adjust some parameters to further improve the database performance.

(4) database reorganization and reconstruction

The form of reorganization is divided into full reorganization and partial reorganization (reorganizing only the frequently added and deleted tables). It refers to rearranging storage locations, recycling garbage, reducing pointer chains, etc., according to the original design requirements to improve system performance, but it does not change the logic and physical structure of the original design.

Reconstruction mainly refers to partial reconstruction (if the changes are too large, the reconstruction is not useful). Its main job is to adjust the schema and internal schema of the database according to the new environment, such as adding new data items, changing the type of data items, changing the database capacity, adding or deleting indexes, and modifying integrity constraints. The essence of reconstruction is to modify partial schema and internal schema of the database.

8 summarizes

This paper mainly summarizes the database design stage steps some of the important concepts and relational database, conceptual design and logical design is discussed emphatically summarized two core stage, emphasis on the structure of the e-r model, in addition to adding the relationship between the standardization of the relational schema and how to solve the model of candidate code and other important points.

Through this paper, the first thing to do is to clarify the specific meaning of related terms or concepts in the database, such as real body type, relationship, relationship mode, etc.; Secondly, it is necessary to master relevant design strategies and theories, such as bottom-up analysis, standardization theory, etc. Finally, it is necessary to apply theoretical knowledge to practical application through continuous practical practice.


Shoulders of giants

1. Lei Jingsheng, Ye Wenjun, Lou Yuehuan. Principle and Application of Database [M]. Beijing: Tsinghua University Press, 2015.

2. Zhang Yong, Gu Guoqing. Journal of Shanghai Electric Power University, 2002.3, 18(1): 38-40


The author information

Hi, I’m CoderGeshu, a programmer who loves life. If this article is helpful, don’t forget to like it

In addition, you are welcome to visit 👉 CoderGeshu to get the latest sharing