Issue an overview

Using Ali Cloud RDS for MySQL database (MySQL5.6 version), a user’s Internet record table has nearly 20 million data in 6 months, and the data in the last year has reached 40 million. The query speed is extremely slow and the daily life is stuck. Services are severely affected.

Question premise: the old system, when the design of the system is probably not graduated from university, table design and SQL statement writing is not only garbage, it is impossible to look at. The original developers have left, to me to maintain, this is the legend of maintenance can not run, and then I am the pit of that!!

I tried to solve the problem, so, there is this log.

Solution overview

Solution 1: Optimize the existing mysql database. Advantages: does not affect the existing business, the source program does not need to modify the code, the lowest cost. Disadvantages: optimization bottleneck, data volume over 100 million on the end.

Solution 2: Upgrade the database type to a database that is 100% compatible with mysql. Pros: No impact on the existing business, no code changes to the source program, you can improve database performance with almost no action, cons: more money

Solution 3: One-step, big data solution, replace the NewSQL/NOSQL database. Advantages: Strong scalability, low cost, no data capacity bottleneck, disadvantages: need to modify the source code

The above three schemes can be used in sequence. There is no need to change noSQL if the data volume is below 100 million level, because the development cost is too high. I tried all three solutions once, and all of them resulted in landing solutions. This process in the heart of the several developers who ran away ten thousand times 🙂

Solution 1: Optimize the existing mysql database

Telephone communication with ali Cloud database leaders and Google solutions and questions group leaders, summarized as follows (are the essence) :

1. Consider performance when designing databases and creating tables

2. The compilation of SQL needs to pay attention to optimization

3. The partition

4. The table

5. Depots

1. Consider performance when designing databases and creating tables

The mysql database itself is highly flexible, resulting in poor performance and heavy reliance on developer capabilities. That is to say, if the developer is competent, the mysql performance is high. This is also a common problem with many relational databases, so corporate DBAs are often overpaid.

Note when designing tables:

1. Avoid null values in table fields. Null values are difficult to optimize and occupy extra index space.

2. Use INT instead of BIGINT, or UNSIGNED if non-negative. TINYINT, SMALLINT, MEDIUM_INT are better.

3. Use enumerations or integers instead of string types

4. Try to use TIMESTAMP instead of DATETIME

5. Do not have too many fields in a single table. The recommended value is less than 20

6. Use an integer to store IP addresses

The index

Select * from table WHERE (ORDER BY); select * from table WHERE (ORDER BY); select * from table WHERE (ORDER BY)

2. Avoid NULL values for fields in the WHERE clause. Otherwise, the engine will abandon the use of indexes and perform a full table scan

3. Fields with sparse value distribution are not suitable for indexing, such as “gender”, which has only two or three values

4. Prefix indexes are used only for character fields

5. Character fields should not be primary keys

6. No foreign keys, by the program to ensure constraints

7. Try not to use UNIQUE

8. When using a multi-column index, keep the order and query conditions consistent, and delete unnecessary single-column indexes

In short, use the right data type and choose the right index

Select the right data type (1) Use the smallest data type that can hold data. Integers < date,time < char,varchar < blob (2) Use simple data types. Integers are less expensive than character processing because string comparisons are more complex. For example, int stores the time type, bigint transfers IP function (3) uses a reasonable length of field attributes, fixed length tables are faster. Select * from table where group by, order by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by <, <=, =, >, >=, between, in, like string + wildcard (%) The more data you can store on a page, the better. (4) Columns with high dispersion (many different values) are placed before the joint index. Check the dispersion by counting different column values. The larger count is, the higher the dispersion is:

The original developer has run away, the table has been established, I can not modify, therefore: this wording cannot be executed, abandoned!

2. The compilation of SQL needs to pay attention to optimization

1. Run the limit command to limit the records in the query result

2. Avoid select * and list the fields to be searched

3. Use joins instead of subqueries

4. Split large DELETE or INSERT statements

5. You can enable the slow query log function to find the SLOW SQL

SELECT * from table WHERE age + 1 = 10 SELECT * from table WHERE age + 1 = 10 SELECT * from table WHERE age + 1 = 10

7. SQL statements should be as simple as possible: one SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library

8. The efficiency of OR is at the level of N, and that of IN is at the level of log(n). It is recommended that the number of Ins be controlled within 200

9. Implement in application without functions and triggers

10. Avoid % XXX queries

11. Use less the JOIN

12. Use similar types of comparisons, such as ‘123’ to ‘123’, 123 to 123

13. Avoid using it in WHERE clauses! = or <> otherwise the engine will abandon the index for a full table scan

IN: SELECT id FROM t WHERE num BETWEEN 1 AND 5

15. Do not fill the table with list data, use LIMIT pagination, and do not have too many pages

The original developer has run away, the program has been completed online, I can not modify SQL, so: the wording can not be executed, give up!

engine

engine

MyISAM and InnoDB are two widely used engines:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:

1. Row locks are not supported. All tables to be read are locked when read, and all tables are locked when written

2. Transactions are not supported

3. Foreign keys are not supported

4. Security recovery after a crash is not supported

5. New records can be inserted into the table when the table has read queries

6. Support BLOB and TEXT index of the first 500 characters, support full-text index

7. Supports delayed index update, greatly improving write performance

8. Tables that are not modified can be compressed to greatly reduce disk space usage

InnoDB

InnoDB became the default index after MySQL 5.5 and features:

1. Support line locking, using MVCC to support high concurrency

2. Support transactions

3. Foreign keys are supported

4. Secure recovery after a crash

5. Full-text indexes are not supported

In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables

MyISAM may be super fast and occupies little storage space, but the program requires transaction support, so InnoDB is a must, so this scheme cannot be executed, abandon!

3. The partition

The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code

To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes

The user’s SQL statement needs to be optimized for the partition table. The columns of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. We can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS for SQL optimization. Querying without columns with partitioning criteria also improves speed, so this measure is worth a try.

The benefits of partitioning are:

1. More data can be stored in a single table

2. Data in partitioned tables is easier to maintain. You can delete a large amount of data in batches by clearing the entire partition, or add new partitions to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on

3. Part of the query can be determined from the query conditions only fall on a few partitions, the speed will be very fast

4. Partitioned table data can also be distributed on different physical devices, making use of multiple hardware devices

5. Partitioned table latencies can be used to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems

6. You can back up and restore a single partition

Limitations and disadvantages of partitioning:

1. A table can have a maximum of 1024 partitions

2. If a partition column has a primary key or unique index column, all primary key columns and unique index columns must be included

3. The partition table cannot use foreign key constraints

4. A NULL value invalidates partition filtering

5. All partitions must use the same storage engine

Type of partition:

1.RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given continuous RANGE

2.LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value in a discrete set of values

3.HASH partition: A partition selected based on the return value of a user-defined expression that evaluates using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value

4.KEY partitioning: Similar to HASH partitioning, except that KEY partitioning only supports one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values

5. For details about the concept of mysql partition, please Google or query the official documentation.

First of all, I divided the Internet record table into 12 partitions according to the month, which improved the query efficiency by about 6 times, but the effect was not obvious. Therefore, I changed the ID to HASH partition and divided it into 64 partitions, which significantly improved the query speed. Problem solved!

Result: PARTITION BY HASH (ID)PARTITIONS 64

select count() from readroom_website; – 11901336 rows

/ Affected rows: 0 Records found: 1 Warning: 0 Duration 1 Query: 5.734 SEC. /

select * from readroom_website where month(accesstime) =11 limit 10;

/ Affected rows: 0 Record found: 10 Warning: 0 Duration 1 Query: 0.719 SEC. */

4. The table

Split table is a large table, according to the above process are optimized, or query stuck, then divide the table into multiple tables, a query into multiple queries, and then return the result combination to the user.

The split table is divided into vertical split and horizontal split, usually with a field as the split item. For example, split the ID field into 100 tables: tableName_id%100

But: sub-table need to modify the source code, will bring a lot of work to the development, greatly increased the development cost, so: only suitable for the development of a large number of data is considered in the early stage, do a good job of sub-table processing, not suitable for the application online to do modification, the cost is too high!! And the choice of this plan, are not as good as the choice OF the second and third plan I provide low cost! Therefore, it is not recommended.

5. Depots

Divide a database into many, suggest to do a read and write separation on the line, the real library will also bring a lot of development costs, outweighs the loss! Not recommended.

Solution 2: Upgrade the database to a database that is 100% compatible with mysql

Mysql does not perform well, so change it. To ensure that the source code is not modified and the existing services migrate smoothly, you need to change a database that is 100% compatible with mysql.

Open source selection

1.tiDB https://github.com/pingcap/tidb

2.Cubrid https://www.cubrid.org/

3. Open source database will bring a lot of operation and maintenance costs and there is still a gap between its industrial quality and MySQL, there are many pits to step on. If your company requires you to build your own database, then choose this type of product.

Cloud Data selection

1. POLARDB

2. https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudessentials.47.7a984b5cs7h4wh

POLARDB is ali Cloud research from the next generation of relational distributed cloud native database, 100% compatible with MySQL, storage capacity up to 100T, the highest performance to improve to MySQL 6 times. POLARDB not only integrates the characteristics of stable, reliable and high performance of commercial database, but also has the advantages of simple, scalable and continuous iteration of open source database, while the cost is only 1/10 of that of commercial database.

I opened the test, support free mysql data migration, no operation cost, performance improvement in about 10 times, the price is similar to RDS, is a good alternative solution!

1. Alibaba Cloud OcenanBase

2 taobao use, carry double 11, outstanding performance, but in the public test, I can not try, but it is worth looking forward to

3. Ali Cloud HybridDB for MySQL (originally PetaData)

4. https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudessentials.54.7a984b5cs7h4wh

Official introduction: Cloud database HybridDB for MySQL (formerly PetaData) is a Hybrid Transaction/Analytical Processing (HTAP) relational database that supports both OLTP and OLAP.

I have also tested it, and it is an OLAP and OLTP-compatible solution, but the price is too high, up to $10 per hour, and it is too wasteful for storage, suitable for both storage and analysis business.

1. Tencent Cloud DCDB

2.https://cloud.tencent.com/product/dcdb_for_tdsql

Official introduction: DCDB, also known as TDSQL, is a high-performance distributed database compatible with MySQL protocol and syntax that supports automatic horizontal split — that is, the business is displayed as a complete logical table, but the data is evenly split into multiple fragments; Each fragment works in active/standby mode by default, providing a full range of solutions such as DISASTER recovery, recovery, monitoring, and capacity expansion. This solution applies to TB or PB massive data scenarios.

Tencent I do not like to use, not to say. The reason is that there is a problem can not find people, online problems can not be solved headache! But it is cheap, suitable for the ultra small company, play.

Plan 3: Remove mysql and replace it with a big data engine to process data

Data volume over 100 million, there is no choice but to go on the big data.

Open Source solutions

Hadoop family. Hbase/Hive if you want to rename it. But there are high operation and maintenance costs, the general company can not afford to play, not one hundred thousand investment is not a good output!

Cloud Solutions

Big data is a trend in the future. Professional companies provide professional services, and small companies or individuals buy services. Big data exists in every aspect of society just like water/electricity and other public facilities.

The best one in China is Aliyun.

I chose Aliyun’s MaxCompute with DataWorks, which is super comfortable to use, pay-per-volume and extremely low cost.

MaxCompute is the open source Hive that provides SQL, MapReduce, AI, Python, and shell scripts to manipulate data. The data is presented in tables, stored in distributed mode, and processed in scheduled tasks and batch mode. DataWorks provides a workflow way to manage your data processing tasks and schedule monitoring.

You can also choose ali Cloud hbase and other products. Here I mainly use offline processing, so I choose MaxCompute. Basically, it is a graphical interface operation, about 300 lines of SQL, the cost is less than 100 yuan to solve the data processing problem.