Database self optimization

Optimization ① : Add multiple data files, set automatic file growth (rough data partition) 1.1: Since SQL SERVER 2005, the database does not generate NDF data files by default. In general, a master data file (MDF) is enough. However, some large databases, due to a lot of information, and frequent queries, so in order to improve the query speed, Can put some parts of the table or some table records stored separately in different data files Due to the speed of the CPU and memory than the hard disk read and write speed, so you can put different data files on different physical disk, execute the query, so can make multiple hard query at the same time, to make full use of CPU and memory performance, Improve query speed. Here is a detailed introduction to its writing principle, data files (MDF, NDF) and log files (LDF) are written in different ways:

Data file: SQL Server distributes new data to all available data files according to the size of the existing free space of all files in the same file group. If there are three data files a.df, B.ndf, and C.ndf, the free size is 200MB, 100MB, and 50MB, respectively. If you write something 70MB, it will write 40, 20, 10 pieces of data at a time to the ABC files. If a log file is full, it will not write to it

Log files: Log files are written in order. When one is full, the other is written

It can be seen from the above that if we can increase the NDF of its data file, it is beneficial to the query speed of large data volume, but it is useless to increase the log file.

1.2: In SQL Server 2005, the default MDF file size is 5MB, and the increment is 1MB, and the growth is unlimited. LDF file size is 1MB, and the growth is 10%, and the growth is limited to a certain number of files. In general, use the SQL design. But in large database designs, it’s best to design the growth and initial size yourself. If the initial value is too small, then the database will soon be full. If it’s full, what happens when you insert? When a data file is full and an operation is performed, SQL Server makes the operation wait until the automatic growth of the file has finished so that the original operation can continue. If self-growth takes a long time, the original operation will not wait to timeout (the default threshold is 15 seconds), not only the operation will be rolled back, but also the automatic file growth will be cancelled. In other words, this time the file does not get any enlargement, the growth time is determined according to the size of the automatic growth. If it is too small, it may need several consecutive growth times in one operation to meet the demand, if it is too large, it will need to wait a long time, so pay attention to the following points when setting automatic growth:

1) Set it to grow by a fixed size, not by scale. This avoids the unnecessary hassle of having too much or too little growth at a time. You are advised to set the growth of 50 MB to 100 MB at a time for a smaller database. For large databases, set growth of 100 MB to 200 MB at a time.

2) Regularly monitor the usage of each data file, and try to ensure that the remaining space of each file is the same size, or the desired proportion.

3) Set the maximum number of files to prevent SQL Server files from exhausting disk space and affecting the operating system.

4) Check the space allocation of new data files in time after self-growth occurs. Avoid SQL Server always writing data to individual files.

Therefore, for a busy database, the recommended setting is to enable automatic database growth in case the application fails due to running out of database space, but to strictly avoid automatic growth. Also, try not to use autoshrink.

1.3 Storing Data and Log Files On Different Disks The operation of data files and log files generates a large number of I/ OS. Where possible, log files should be stored on a different disk from the data files where the data and indexes reside to spread I/O and also to facilitate disaster recovery for the database.

Optimization ② : table partition, index partition (optimization ① rough table partition, optimization ② for accurate data partition) why table partition?

What’s the one thing we want to do when a table has too much data? Splitting the table in half or more, but the table is still the same table, but storing its contents separately, makes reading N times faster

Principle: Table data cannot be stored in files, but file groups can be stored in files, tables can be stored in file groups, which indirectly implements table data stored in different files. Partitioned storage also includes tables, indexes, and large object data.

SQL SERVER 2005, the introduction of the concept of table partitioning, when the amount of data in the table increases, the speed of querying data will be slow, the performance of the application will decline, then you should consider partitioning the table, when there is a lot of data in a table, Can be split into multiple table, because to scan data becomes less, the query can be run faster, so greatly improve the performance of table partitioning, logic chart is still a complete list, but the data in the table on the physical store to more than one table space (physical file), so when querying data, without scanning the entire table every time

2.1 When to use a partitioned table: 1. The size of a table exceeds 2GB.

2. The table contains historical data, and new data is added to the new partition.

2.2 Advantages and Disadvantages of table Partitioning Table partitioning has the following advantages: 1. Improves query performance: When querying partitioned objects, you can search only the partitions you care about, which improves the search speed. 2. Enhanced availability: If one partition of a table fails, data in other partitions of the table is still available; 3. Easy maintenance: If a partition of the table fails and data needs to be repaired, only the partition can be repaired; 4. I/O balancing: You can map different partitions to disks to balance I/O and improve the overall system performance. Disadvantages: Partitioned table correlation: There is no way to directly convert an existing table to a partitioned table. Oracle does, however, provide the ability to redefine tables online.

2.3 Table Partitioning Operations in three Steps 2.31 Creating a Partition Function

CREATE PARTITION FUNCTION xx1(int)

AS RANGE LEFT FOR VALUES (10000, 20000);

Create partition function myRangePF2 (INT); create partition function myRangePF2 (INT); create partition function myRangePF2 (INT);

2.3.2 Creating a Partition Architecture

CREATE PARTITION SCHEME myRangePS2

AS PARTITION xx1

TO (a, b, c);

Create A partition schema on XX1: myRangePS2: A,B,C

A,B, and C are the names of three file groups, and the three NDFS must belong to these three groups. Once A file group is created, it cannot be changed

2.3.3 Partitioning a Table

Common data specifications – Change the data space type to partition scheme, and then select the partition scheme name and partition column list. The result is as shown in the following figure:

It can also be generated using SQL statements

CREATE TABLE [dbo].[AvCache](

[AVNote] varchar NULL,

(bb)/int IDENTITY (1, 1)

) ON myRangePS2; — Notice the use of the [myRangePS2] schema, according to the BB partition

2.3.4 Querying Table Partitions

SELECT *, $PARTITION.myRangePF2 FROM dbo.AVCache



So you can see clearly how the table data is partitioned

2.3.5 Creating an Index Partition



Optimization 3: Distributed database design

Distributed database system is developed on the basis of centralized database system, it is also very simple to understand, is to separate the overall database, distributed to each place, in terms of its essence, distributed database system is divided into two kinds: 1. Data is logically unified, but physically dispersed. A distributed database is logically a unified whole, but physically it is stored on different physical nodes. Generally speaking, distributed database is this 2. The logic is distributed, and the physical distribution is also distributed, which is also a federal distributed database. Because each sub-database system of the federation is relatively “autonomous”, this system can accommodate a variety of different uses, the larger difference of the database, which is more suitable for the integration of databases in a large range.

Distributed database is relatively complicated, in this not use and illustrates in detail, for example, are now distributed database for the user more partitions sex strong system, if a national chain, a general design for each branch has its own sales and inventory information, such as headquarters need employees, suppliers, stores information such as database, This type of store database can be completely consistent, many systems could also lead to inconsistent, in this way, each chain data is stored locally, so as to improve the impact speed, reduce the communication cost, and data distribution in different sites, and have multiple copies, even if the individual site failure occurs, does not cause paralysis of the entire system. But it also brings a lot of problems, such as: data consistency problem, the realization of remote transmission of data, communication cost reduction, which makes the development of distributed database system become more complex, just let everyone understand its principle, the specific way of use will not do a detailed introduction.

If your table is already indexed but still performing poorly, it is likely that index fragmentation has occurred and you need to perform index defragmentation.

What is index fragmentation?

Due to excessive insertion, modification, and deletion operations on the table, the index page is divided into multiple pieces, forming index fragmentation. If the index fragmentation is serious, the scanning time of the index will be longer, and even the index is not available, so the data retrieval operation will slow down.

How do I know if index fragmentation has occurred?

In SQLServer database, the DBCC ShowContig or DBCC ShowContig(table name) can be used to check the index fragmentation to guide us to periodically rebuild and collate it.



By analyzing the results of scanning density (too low) and scanning fragment (too high), we can determine whether index reconstruction is needed, mainly based on the following two aspects:

Scan Density [Best Count:Actual Count]- Scan Density [Best value :Actual Count] : DBCC SHOWCONTIG Returns the most useful percentage. This is the ratio of the best value of the extended range to the actual value. The percentage should be as close to 100% as possible. Low means there’s external debris.

Logical Scan Fragmentation- Logical Scan Fragmentation: percentage of unordered pages. The percentage should be between 0% and 10%, with a high indicating external fragmentation.

Solution:

DBCC INDEXDEFRAG; DBCC INDEXDEFRAG

The second is to use DBCC DBREINDEX to rebuild the index.

The DBCC INDEXDEFRAG command operates online, so the index is only available while the command is running, and the operation can be broken without losing the work done. The disadvantage of this approach is that it is not as effective at reorganizing data as the remove/recreate operation of clustered indexes. Recreating the clustered index reorganizes the data, with the result that the data pages fill up. The degree of filling can be configured using the FILLFACTOR option. The disadvantage of this approach is that the index is offline during the remove/recreate cycle and the operation is atomic. If index creation is interrupted, it is not recreated. In other words, to get good results, you still have to rebuild the index, so you decided to rebuild the index.

Database table optimization

Optimization ① : design standardized tables to eliminate data redundancy

Database paradigm is a database design method to ensure reasonable database structure, meet various query needs and avoid abnormal database operation. Meet the requirements of the form, known as the normalized table, the form was produced in the early 1970s, the general table design can meet the first three forms, here is a brief introduction to the first three forms

Let’s take a look at the definition given by Baidu Encyclopedia:

First normal Form (1NF) no duplicate columns

The so-called first normal Form (1NF) refers to a standard requirement added to fields in the relational model. All fields should be atomic, that is, each column of the database table is an indivisible atomic data item, instead of a set, array, record and other non-atomic data item.

Second normal Form (2NF) properties

On the basis of 1NF, non-code attributes must be completely dependent on the code [on the basis of 1NF to eliminate the partial function dependence of the main attribute on the main code]

Third normal Form (3NF) properties

Any non-primary attribute does not depend on other non-primary attributes on a 1NF basis [eliminates transitive dependencies on a 2NF basis]

To explain to you in a popular way (may not be the most scientific and accurate understanding)

The first normal form: the atomicity constraint of attributes (fields) requires that attributes have atomicity and cannot be divided again. Second normal form: Constraints on the uniqueness of records, requiring that records have unique identifiers. Each record needs to have an attribute that uniquely identifies the entity. Third normal Form: constraint of attribute (field) redundancy, i.e. no field can be derived from other fields. In layman’s terms, columns that are not directly related to the primary key must be eliminated (by creating a new table to store them, except for foreign keys, of course).

If the database design is completely standardized, there will be no data duplication if all tables are joined together by key. The advantages of standardization are obvious, it avoids data redundancy, naturally saves space, but also provides a fundamental guarantee for the consistency of data, puts an end to the phenomenon of inconsistent data, but also improves efficiency.

Optimization ② : appropriate redundancy, increase the calculation column

The practical principle of database design is to find the right balance between data redundancy and processing speed

A table that satisfies a paradigm is always a canonical table, but not necessarily the best design. In many cases, in order to improve the efficiency of database operation, it is often necessary to reduce the standard paradigm: to increase the redundancy appropriately, to achieve the purpose of space for time. Let’s say we have a table, product name, unit price, inventory, total value. This table does not satisfy the third normal form, because “total value” can be obtained by multiplying “unit price” by “quantity”, indicating that “amount” is a redundant field. However, adding a redundant “total value” field can speed up query statistics, which is a space-for-time approach. Reasonable redundancy can disperse the concurrent pressure of tables with large data volume, and also speed up the speed of special queries. Redundant fields can effectively reduce the connection of database tables and improve efficiency.

The “total value” is a calculated column. There are two types in the database: data column and calculated column. A data column is a column that needs to be assigned a value manually or programmatically.

Create a computed column in SQL:

Create table table1 (number decimal(18,4), price money, Amount as number*price)

You can also manually add or modify column attributes in the table design as shown in the following figure



Persistence or not, we also need to pay attention to:

If the value is’ no ‘, this column is a virtual column that is evaluated every time the query is performed, and it cannot be used for check,foreign key, or not NULL constraints.

If it is’ yes’, it is a real column and does not need to be evaluated every time, you can create an index on that column and so on.

Optimization 3: index

Index is an important index of table optimization, in the table optimization occupies an extremely important component, so will write a separate chapter “SQL index in one step” to tell you how to establish and optimize the index

Optimization 4: The necessity of primary and foreign keys

The design of primary key and foreign key plays an important role in the design of global database. A primary key is an abstraction of an entity, and the pairing of a primary key with a foreign key represents a connection between entities.

Primary key: According to the second normal form, there needs to be a field to identify the record. The primary key is undoubtedly the best identification. However, many tables do not necessarily need primary keys.

The selection of primary keys is also important. Generally, the key with a small total length is selected. The comparison speed of small keys is fast, and small keys can make the b-tree structure of primary keys have fewer levels. Select the primary key and pay attention to the field order of the combined primary key. For the combined primary key, the performance of the primary key in different field order may be very different. In general, select the fields with low repetition rate and high possibility of single or combined query in the first place.

Foreign keys: As database objects, foreign keys are considered cumbersome and unnecessary. In fact, foreign keys can be useful in most cases for the reason that they are the most efficient way to maintain consistency

The consistency requirements of the database can be followed by foreign keys, CHECK constraints, rule constraints, triggers, and client programs. It is generally believed that the method closer to the data is more efficient. Use cascading deletes and cascading updates with caution. Cascading deletes and cascading updates were new in SQL SERVER 2000 and retained in 2005 and should have their place. I am cautious here because cascading deletion and cascading update break through the traditional definition of foreign keys, and their functions are a little too powerful. Before using them, you must make sure that you have grasped the scope of their functions, otherwise, cascading deletion and cascading update may cause your data to be inexplicably modified or lost. Cascading deletes and cascading updates are more efficient than other approaches in terms of performance.

Optimization: Proper use of stored procedures, views, and functions

Many people tend to keep complex operations in the application layer, but if you want to optimize data access performance, porting SQL code to the database (using stored procedures, views, functions, and triggers) is also a big improvement for the following reasons:

  1. Stored procedures reduce the workload of network transmission, processing, and storage, and are compiled and optimized to achieve fast execution speed and easy maintenance. In addition, when table structure changes, client applications are not affected

2. Using stored procedures, views, and functions helps reduce the drawbacks of SQL replication in your applications, since SQL is now processed centrally in one place

3, the use of database objects to achieve all TSQL helps to analyze TSQL performance problems, and helps you to centrally manage TSQL code, better TSQL code reconstruction

⑥ optimization: the legendary ‘three less principle’

① : The fewer tables in the database, the better

② : The fewer fields in the table, the better

③ : the less the combined primary key and combined index in the field, the better

Of course, less here is relative, is an important design concept to reduce data redundancy.

⑦ : Split your table, reduce the table size

If you find that a table has too many records, such as more than 10 million records, split the table horizontally. In horizontal splitting, records of a table are split horizontally into two tables based on a value of the primary key of the table.

If you find that a table has too many fields, such as more than 80, split the table vertically, splitting the original table into two tables

Optimization ⑧ : field design principles

Field is the most basic unit of database, and its design has a great impact on performance. Note the following:

A, the data type as far as possible to use the number type, number type comparison is much faster than the character type.

B. The data type should be as small as possible. Here, the data type should be as small as possible under the premise of meeting the foreseeable future demand.

C, do NOT allow NULL unless necessary, use NOT NULL+DEFAULT instead.

D, use less TEXT and IMAGE, binary fields read and write is relatively slow, and there are not many ways to read, in most cases it is best not to use.

E. Self-added fields should be used with caution, which is not conducive to data migration

Program operation optimization

Operator optimization

1. The IN and NOT IN operators

IN and EXISTS performance can be distinguished by external and internal tables, but it is recommended to use EXISTS instead of IN IN tables with large data volumes.

Do Not use Not EXISTS instead of an index

2. The IS NULL or IS NOT NULL operation

Indexes do not index null values, so such operations can not use indexes, can be handled by other methods, such as: number type, check greater than 0, string type set a default value, check whether the default value

The <> operator (not equal) never uses an index, so processing it only results in a full table scan. A <>0 to a>0 or a<0 a<> “to a>”

4. Search text data with full text search instead of like search

Full-text search is always better than like search:

(1) Full-text search allows you to perform complex searches that like cannot, such as searching for a word or phrase, searching for a word or phrase that is similar to another word or phrase, or searching for synonyms;

(2) It is easier to implement full-text search than to implement like search (especially complex search);

Second, SQL statement optimization

1. Do not use select * in queries

Why can not use, the earth is known, but many people are used to such use, to understand that can save the province, and this query database can not use the “overwrite index”

2. Try to write WHERE clauses

Try not to write SQL statements without WHERE

3. Note the WHERE clause after SELECT INTO

Because SELECT INTO inserts data INTO temporary tables, this process locks some system tables. If the WHERE clause returns too much data or is too slow, it can cause the system table to be locked for a long time, blocking other processes. 4. For aggregate queries, you can use the HAVING clause to further qualify the rows returned

5. Avoid temporary tables

(1) Avoid using temporary tables unless necessary, and use table variables instead;

(2) Most of the time (99%), table variables reside in memory and are therefore faster than temporary tables, which reside in the TempDb database, so operations on temporary tables need to communicate across databases and are naturally slower.

6. Reduce the number of database accesses:

It is better to keep some commonly used global variables in memory or reduce the number of database accesses in other ways in program design

7. Do less repetitive work to reduce invalid work as far as possible, but the focus of this point in the client program, need to pay attention to the following: A, control the execution of the same statement for many times, especially some basic data execution is A lot of programmers rarely pay attention to

B. Reducing the number of data conversions, perhaps requiring data conversions is a design problem, but reducing the number of data conversions is something programmers can do.

C, eliminate unnecessary sub-query and join table, sub-query in the execution plan is generally interpreted as external join, redundant join table brings extra overhead.

Select * from table_name where table_name = 1 and table_name = 1; UPDATE EMPLOYEE SET FNAME= ‘HAIWER’ WHERE EMP_ID= ‘VPA30890F’ UPDATE EMPLOYEE SET FNAME= ‘HAIWER’,LNAME= ‘YANG’ WHERE EMP_ID= ‘VPA30890F’ UPDATE EMPLOYEE SET FNAME= ‘HAIWER’ WHERE EMP_ID= ‘VPA30890F’

E, do not split the UPDATE operation into the form of DELETE operation +INSERT operation, although the function is the same, but the performance is very different.

F, do not write a meaningless query such as SELECT * FROM EMPLOYEE WHERE 1=2

Three, where the principle of use

Select * from table1 where field1<=10000 and field1>=0; select * from table1 where field1>=0 and field1<=10000; If field1 >=0, the first select statement is much more efficient than the second select statement because the first condition of the second SELECT statement consumes a lot of system resources. First rule: Place the most restrictive conditions first in the WHERE clause. Select * from TAB where a=… And b =… And c =… ; If index(a,b,c) exists, the order of the fields in the WHERE clause should be the same as the order of the fields in the index. Second rule: The order of the fields in the WHERE clause should be the same as the order of the fields in the index.

The following assumes that there is a unique index I1 on field1 and a non-unique index I2 on field2. Select field3,field4 from TB where field1= ‘SDF’ select field4 from TB where field1= ‘SDF’

Select field3,field4 from TB where field1>= ‘SDF’

Select field3,field4 from TB where field4 = ‘%R’ select field4 from TB where field4 = ‘%R’

Select field3,field4 from TB where upper(fieldd2)=’RMN ‘ If a table has 20,000 records, it is recommended not to use the function. If a table has more than 50,000 entries, strictly forbid using functions! There is no limit below 20,000 records.