Issue an overview
Using Ali Cloud RDS for MySQL database (MySQL5.6 version), a user’s Internet record table has nearly 20 million data in 6 months, and the data in the last year has reached 40 million. The query speed is extremely slow and the daily life is stuck. Services are severely affected.
Question premise: the old system, when the design of the system is probably not graduated from university, table design and SQL statement writing is not only garbage, it is impossible to look at. The original developers have left, to me to maintain, this is the legend of maintenance can not run, and then I am the pit of that!!
I tried to solve the problem, so, there is this log.
Solution overview
-
Solution 1: Optimize the existing mysql database. Advantages: does not affect the existing business, the source program does not need to modify the code, the lowest cost. Disadvantages: optimization bottleneck, data volume over 100 million on the end.
-
Solution 2: Upgrade the database type to a database that is 100% compatible with mysql. Pros: No impact on the existing business, no code changes to the source program, you can improve database performance with almost no action, cons: more money
-
Solution 3: One-step, big data solution, replace the NewSQL/NOSQL database. Advantages: Strong scalability, low cost, no data capacity bottleneck, disadvantages: need to modify the source code
The above three schemes can be used in sequence. There is no need to change noSQL if the data volume is below 100 million level, because the development cost is too high. I tried all three solutions once, and all of them resulted in landing solutions. This process in the heart of the several developers who ran away ten thousand times 🙂
Solution 1: Optimize the existing mysql database
Telephone communication with ali Cloud database leaders and Google solutions and questions group leaders, summarized as follows (are the essence) :
-
1. Consider performance when designing databases and creating tables
-
2. The compilation of SQL needs to pay attention to optimization
-
4. The partition
-
4. The table
-
5. Depots
1. Consider performance when designing databases and creating tables
The mysql database itself is highly flexible, resulting in poor performance and heavy reliance on developer capabilities. That is to say, if the developer is competent, the mysql performance is high. This is also a common problem with many relational databases, so corporate DBAs are often overpaid.
Note when designing tables:
-
Table fields avoid null values. Null values are difficult to optimize and take up extra index space. The default value 0 is recommended.
-
Use INT instead of BIGINT, or UNSIGNED if non-negative (which doubles the size of the value). TINYINT, SMALLINT, MEDIUM_INT are better.
-
Use enumerations or integers instead of string types
-
Use TIMESTAMP instead of DATETIME whenever possible
-
Do not have too many fields in a single table, recommend less than 20
-
Use an integer to store IP addresses
The index
-
More indexes are not always better. To create specific indexes based on the query, consider creating indexes on the columns involved in the WHERE and ORDER BY commands. You can use EXPLAIN to check whether an index or full table scan is used
-
Try to avoid NULL values for fields in the WHERE clause, as this will cause the engine to abandon the index for a full table scan
-
Fields with a sparse distribution of values, such as “gender,” have only two or three values
-
Character fields are indexed by prefix only
-
Character fields should preferably not be primary keys
-
No foreign keys, the program guarantees the constraint
-
Try not to use UNIQUE, let the program guarantee constraint
-
Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes
In short, use the right data type and choose the right index
Select the right data type (1) Use the smallest data type that can hold data. Integers < date,time < char,varchar < blob (2) Use simple data types. Integers are less expensive than character processing because string comparisons are more complex. For example, int stores the time type, bigint transfers IP function (3) uses a reasonable length of field attributes, fixed length tables are faster. Select * from table where group by, order by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by <, <=, =, >, >=, between, in, like string + wildcard (%) The more data you can store on a page, the better. (4) Columns with high dispersion (many different values) are placed before the joint index. Check the dispersion by counting different column values. The larger count is, the higher the dispersion is:
The original developer has run away, the table has been established, I can not modify, therefore: this wording cannot be executed, abandoned!
2. The compilation of SQL needs to pay attention to optimization
-
Use limit to limit the records of query results
-
Avoid select * and list the fields you want to find
-
Use joins instead of subqueries
-
Split large DELETE or INSERT statements
-
Slow SQL can be found by enabling slow query logging
-
SELECT id WHERE age + 1 = 10, any operation on a column will cause a table scan. This includes database tutorial functions, computed expressions, and so on. If possible, move the operation to the right of the equal sign
-
SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
-
IN: The efficiency of OR is at the n level, and that of IN is at the log(n) level. It is recommended that the number of Ins be less than 200
-
No functions and triggers, implemented in the application
-
Avoid % XXX type queries
-
Do not use the JOIN
-
Use the same type for comparison, such as ‘123’ to ‘123’, 123 to 123
-
Avoid using it in WHERE clauses! = or <> otherwise the engine will abandon the index for a full table scan
-
IN: SELECT ID FROM t WHERE num BETWEEN 1 AND 5
-
Do not fill the table with list data, use LIMIT pagination, and do not have too many pages
The original developer has run away, the program has been completed online, I can not modify SQL, so: the wording can not be executed, give up!
engine
engine
MyISAM and InnoDB are two widely used engines:
-
MyISAM
-
The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:
-
Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
-
Transactions not supported
-
Foreign keys are not supported
-
Security recovery after a crash is not supported
-
New records can be inserted into a table while a query is being read
-
Support BLOB and TEXT first 500 character index, support full-text index
-
Delay index update is supported to greatly improve write performance
-
Tables that are not modified can be compressed to greatly reduce disk space usage
-
InnoDB
-
InnoDB became the default index after MySQL 5.5 and features:
-
Line locking is supported and MVCC is used to support high concurrency
-
Support transactions
-
Support foreign keys
-
Supports secure recovery after crashes
-
Full-text indexing is not supported
In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables
MyISAM may be super fast and occupies little storage space, but the program requires transaction support, so InnoDB is a must, so this scheme cannot be executed, abandon!
3. The partition
The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code
To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes
The user’s SQL statement needs to be optimized for the partition table. The columns of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. We can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS for SQL optimization. Querying without columns with partitioning criteria also improves speed, so this measure is worth a try.
The benefits of partitioning are:
-
More data can be stored in a single table
-
The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
-
Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
-
Partitioned table data can also be distributed across different physical devices to make fun of multiple hardware devices
-
You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
-
Individual partitions can be backed up and restored
Limitations and disadvantages of partitioning:
-
A table can have a maximum of 1024 partitions
-
If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
-
Partitioned tables cannot use foreign key constraints
-
A NULL value invalidates partition filtering
-
All partitions must use the same storage engine
Type of partition:
-
RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
-
LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
-
HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
-
KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values
-
For details on the concept of mysql partition, please Google or check the official documentation.
First of all, I divided the Internet record table into 12 partitions according to the month, which improved the query efficiency by about 6 times, but the effect was not obvious. Therefore, I changed the ID to HASH partition and divided it into 64 partitions, which significantly improved the query speed. Problem solved!
Result: PARTITION BY HASH (ID)PARTITIONS 64
select count() from readroom_website; – 11901336 rows
/ Affected rows: 0 Records found: 1 Warning: 0 Duration 1 Query: 5.734 SEC. /
select * from readroom_website where month(accesstime) =11 limit 10;
/ Affected rows: 0 Record found: 10 Warning: 0 Duration 1 Query: 0.719 SEC. */
4. The table
Split table is a large table, according to the above process are optimized, or query stuck, then divide the table into multiple tables, a query into multiple queries, and then return the result combination to the user.
The split table is divided into vertical split and horizontal split, usually with a field as the split item. For example, split the ID field into 100 tables: tableName_id%100
But: sub-table need to modify the source code, will bring a lot of work to the development, greatly increased the development cost, so: only suitable for the development of a large number of data is considered in the early stage, do a good job of sub-table processing, not suitable for the application online to do modification, the cost is too high!! And the choice of this plan, are not as good as the choice OF the second and third plan I provide low cost! Therefore, it is not recommended.
5. Depots
Divide a database into many, suggest to do a read and write separation on the line, the real library will also bring a lot of development costs, outweighs the loss! Not recommended.
Solution 2: Upgrade the database to a database that is 100% compatible with mysql
Mysql does not perform well, so change it. To ensure that the source code is not modified and the existing services migrate smoothly, you need to change a database that is 100% compatible with mysql.
-
Open source selection
-
tiDB https://github.com/pingcap/tidb
-
Cubrid https://www.cubrid.org/
-
Open source database will bring a lot of operation and maintenance costs and its industrial quality and MySQL still have a gap, there are many pits to step on, if your company requires you to build your own database, then choose this type of product.
-
Cloud Data selection
-
Ali cloud POLARDB
-
https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudEssentials.47.7a984b5cS7h4wH
POLARDB is ali Cloud research from the next generation of relational distributed cloud native database, 100% compatible with MySQL, storage capacity up to 100T, the highest performance to improve to MySQL 6 times. POLARDB not only integrates the characteristics of stable, reliable and high performance of commercial database, but also has the advantages of simple, scalable and continuous iteration of open source database, while the cost is only 1/10 of that of commercial database.
I opened the test, support free mysql data migration, no operation cost, performance improvement in about 10 times, the price is similar to RDS, is a good alternative solution!
-
Ali cloud OcenanBase
-
Taobao use, carry double eleven, outstanding performance, but in the public test, I can not try, but worth looking forward to
-
Ali Cloud HybridDB for MySQL (originally PetaData)
-
https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudEssentials.54.7a984b5cS7h4wH
Official introduction: Cloud database HybridDB for MySQL (formerly PetaData) is a Hybrid Transaction/Analytical Processing (HTAP) relational database that supports both OLTP and OLAP.
I have also tested it, and it is an OLAP and OLTP-compatible solution, but the price is too high, up to $10 per hour, and it is too wasteful for storage, suitable for both storage and analysis business.
-
Tencent cloud DCDB
-
https://cloud.tencent.com/product/dcdb_for_tdsql
Official introduction: DCDB, also known as TDSQL, is a high-performance distributed database compatible with MySQL protocol and syntax that supports automatic horizontal split — that is, the business is displayed as a complete logical table, but the data is evenly split into multiple fragments; Each fragment works in active/standby mode by default, providing a full range of solutions such as DISASTER recovery, recovery, monitoring, and capacity expansion. This solution applies to TB or PB massive data scenarios.
Tencent I do not like to use, not to say. The reason is that there is a problem can not find people, online problems can not be solved headache! But it is cheap, suitable for the ultra small company, play.
Plan 3: Remove mysql and replace it with a big data engine to process data
Data volume over 100 million, there is no choice but to go on the big data.
-
Open Source solutions
-
Hadoop family. Hbase/Hive if you want to rename it. But there are high operation and maintenance costs, the general company can not afford to play, not one hundred thousand investment is not a good output!
-
Cloud Solutions
-
Big data is a trend in the future. Professional companies provide professional services, and small companies or individuals buy services. Big data exists in every aspect of society just like water/electricity and other public facilities.
-
The best one in China is Aliyun.
-
I chose Aliyun’s MaxCompute with DataWorks, which is super comfortable to use, pay-per-volume and extremely low cost.
-
MaxCompute is the open source Hive that provides SQL, MapReduce, AI, Python, and shell scripts to manipulate data. The data is presented in tables, stored in distributed mode, and processed in scheduled tasks and batch mode. DataWorks provides a workflow way to manage your data processing tasks and schedule monitoring.
-
You can also choose ali Cloud hbase and other products. Here I mainly use offline processing, so I choose MaxCompute. Basically, it is a graphical interface operation, about 300 lines of SQL, the cost is less than 100 yuan to solve the data processing problem.
The original link