Abstract: The core of million-level data processing is the design of data storage scheme. Whether the design of storage scheme is reasonable directly affects the CRUD operation of data. The overall design can consider several aspects of design consideration: data storage structure design; Index design; Data primary key design; Query scheme design.

In my opinion, the core of data storage scheme is the design of data storage scheme. Whether the design of storage scheme is reasonable will directly affect data CRUD operation. The overall design can be considered in the following aspects:

  • Data storage structure design
  • Index design
  • Data primary key design
  • Query scheme design

Mega data processing scheme:

Data storage structure design

Table field design

The value of a NULL table is not null. The default value 0 is recommended because a null value is difficult to query and optimizes and consumes extra index space.

Undefined data state type fields, such as status, type, etc., try not to define negative values, such as -1. Because you can add UNSIGNED, and that doubles the number.

Anyway, you buy a ticket to replace an INT with something like TINYINT or SMALLINT, instead of using BIGINT because it takes up less space.

A string field takes up more space than a numeric type, so use integers instead of strings, which can be implemented using encoding logic for many scenarios.

Anyway, do not set the length of the string type casually. Ensure that the string type is small to meet service requirements.

Buy a queue to store AN IP address using an integer.

Anyway, a single table doesn’t have too many fields to buy a ticket — 20 or less is recommended.

Buy a ticket for reserving predictable fields in advance, because changing data structures takes time as the amount of data increases.

Index design

Buy indices, which essentially design indexes based on business requirements to handle millions of data, and get into the habit of using Explain, which you can also access: Explain lets you write YOUR SQL more reliably and learn more.

Anyway, you buy a common idea: more indexes isn’t always better, as they slow down data write performance.

Principle To shorten the length of an index field, which saves a lot of index space.

Buy a way to cancel foreign keys by handing them over to a program for better performance.

A composite index automatically matches the left-most column rules, keeping indexes in the same order as the query criteria, and minimizes unnecessary single-column indexes.

Unconsciously, fields with low values distribution (and few that don’t repeat themselves) don’t lend themselves well to indexing, for example if there are only two or three values, such as gender, it makes little sense to index a field.

Anyway, fields that need to be sorted are recommended to be indexed, because indexes are sorted, improving query performance.

According to the principle, string fields are indexed using a prefix instead of a full-field index, which greatly reduces the index space.

Query statement optimization

Buy a way to replace complex inline queries with short queries.

Queue queries do not use select *, and instead query indexed fields to avoid running back to a table.

Queue to limit the number of queries.

Unconsciously, query fields fall on indexes, especially composite indexes, and pay attention to left-most prefix matching.

Splitting large DELETE/INSERT operations automatically locks the table, affecting other business operations, and causes MySQL to limit SQL length.

Anyway, it’s not recommended to use MySQL functions, so calculations can be handled by programs first. According to the points mentioned above, users should not transfer the pressure to the database if they can be handled by programs. Because most of the server performance bottlenecks are in the database.

Anyway, queries count, and performance: Count (1) = count(*) > count(primary key) > count(other fields).

The queue query operator uses BETWEEN instead of in, and uses IN instead of OR.

Buy a way to avoid using something! Operators such as = or < >, IS NULL or IS NOT NULL, IN, NOT IN, etc., because these queries cannot use indexes.

Buy a ticket to simplify SQL, use less joins, and no more than two joins are recommended.

Ten-million-level data processing scheme:

Data storage structure design

At this stage, the amount of data itself has become very valuable. In addition to meeting regular business needs, data will also have some requirements for data analysis. At this time, data variability is not high, so modification of the original structure will not be considered, and optimization will be generally considered from three aspects: partition, table and library:

partition

Partitioning is a horizontal partition in which the database breaks up a table into smaller, more manageable parts according to certain rules. Correspondence is completely transparent and does not affect the application’s business logic, that is, no code changes are required. So can save more data, query, delete also support by partition operation, so as to achieve the purpose of optimization. If partitioning is being considered, you can prepare ahead of time to avoid some of the following restrictions:

  • A table can have a maximum of 1024 partitions (8192 partitions supported after mysql5.6). However, when you do this, it is best not to open more than 100 partitions at once, because opening partitions is also time consuming.
  • All primary key and unique index columns must be included if there is a primary key or unique index column in a partitioned field, or if there is a primary key or unique index in a table, the partition key must be a primary key or unique index.
  • Foreign key constraints cannot be used in partitioned tables.
  • NULL values will invalidate partition filtering and will be placed in the default partition. Do not allow NULL values in partition fields.
  • All partitions must use the same storage engine.

table

Sub – table sub – table and vertical sub – table.

Horizontal table is divided into the same data structure of each small table, such as divided into Table1, Table2… To relieve database read and write pressure.

Vertical split table means that some fields are separated to form a new table, and each table has different data structure, which can optimize the situation of locking table under high concurrency.

As can be imagined, the logic of the program needs to be modified in the case of table division. Therefore, table division is generally considered only in the early stage of the project when a large amount of data is foreseen. Late stage is not recommended to separate tables, the cost is very large.

depots

The sub-database is generally a master-slave mode, in which the master node of a database server is copied to one or more slave nodes to multiple databases. The master database is responsible for write operations, and the slave database is responsible for read operations, so as to achieve the optimization purposes of master-slave separation, high availability, data backup and so on.

Of course, master-slave mode also has some drawbacks, such as master-slave synchronization delays, binlog problems, etc., which I won’t go into here.

other

Hot and cold meter isolation. For historical data, if the number of people who query and use the data is small, it can be moved to another cold database, which only provides the query, to alleviate the situation of large amount of data in the thermal table.

Database table primary key design

Database primary key design, I recommend self-growing numeric ID with time attribute. (Distributed self-growth ID generation algorithm)

  • Snowflakes algorithm
  • Baidu distributed ID algorithm
  • Meituan distributed ID algorithm

So why do we use these algorithms, this has to do with the MySQL data storage structure

In business terms

You don’t have to worry too much about which field is the primary key when you’re designing your database. The next is that these fields are only theoretically unique, such as using the primary key of the book number, which is only theoretically unique, but may be duplicated in practice. So set a business-independent increment ID as the primary key, and then add a unique constraint on the book number.

Technically speaking

  1. If the table uses auto-increment primary keys, each time a new record is inserted, the records are sequentially added to the place following the current index node, and when a page is full, a new page is automatically opened. The overall result is improved query and insert performance.
  2. InnoDB’s primary key index stores both index values and row data in leaf nodes, which means data files themselves are stored in b+ tree mode.
  3. If no primary key is defined, a non-empty UNIQUE primary key is used; If there is no non-empty UNIQUE key, the system generates a 6-byte ROwid primary key; In a clustered index, N rows form a page (a page is typically 16K in size). In case of irregular data insertion, in order to maintain the balance of B+ tree, it will cause frequent page splitting and page rotation, and the insertion speed is relatively slow. Therefore, the primary key value of the clustered index should be continuously increasing, not random (do not use random string or UUID).
  4. So for InnoDB primary key, try to use integer type, and increasing integer type. This is very efficient in storage/query.

MySQL interview questions

MySQL database tens of millions of data query optimization scheme

Limit the lower the limit, the slower the query. Which leads us to the conclusion:

1. The query time of the limit statement is proportional to the position of the start record. 2, mysql limit statement is very convenient, but it is not suitable for direct use of tables with many records

Tables use InnoDB as storage engine, id as auto-increment primary key, default primary key index

SELECT id FROM test LIMIT 9000000,100;
Copy the code

At present, there are two optimized schemes, that is, using ID as the query condition to implement the sub-query and use join to implement;

1, id>= (subquery) form implementation

Select * from test where id >= (select id from test limit 90000001,1)limit 0,100Copy the code

Use the form join;

SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 9000000,100) b ON a.id = b.id
Copy the code

These two optimized queries take a similar amount of time, but they both use the same principle, so the results are similar. However, I recommend using JOIN to minimize the use of subqueries. Note: the current level is tens of millions of queries, if it will reach millions of levels, the speed will be faster.

SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 1000000,100) b ON a.id = b.id
Copy the code

Have you used MySQL storage engines? What are their features and differences?

This is a common question senior developers are asked in interviews. In fact, we often encounter in the peacetime development. There are so many storage engines for Mysql. In fact, we use InnoDB and MyISAM most often. So if the interviewer asks what storage engines mysql has, you only need to tell these two commonly used ones.

  • What are their characteristics and differences?

MyISAM: The default table type, which is based on the traditional ISAM type, which is short for Indexed Sequential Access Method, a standard way to store records and files. It is not transactionally safe and does not support foreign keys. Insert MyISAM is suitable if you perform a large number of select operations.

InnoDB: support transaction security engine, support foreign key, row lock, transaction is his biggest feature. If you have a lot of updates and inserts, InnoDB is recommended, especially for multiple concurrency and high QPS. Note: In MySQL 5.5, the default search engine was MyISAM. In MySQL 5.5, the default search engine was changed to InnoDB

  • Differences between MyISAM and InnoDB:
  1. InnoDB supports transactions, MyISAM does not. For InnoDB, each SQL language is encapsulated as a transaction by default, which will affect the speed, so it is best to put multiple SQL languages between begin and COMMIT, forming a transaction;
  2. InnoDB supports foreign keys, while MyISAM does not.
  3. InnoDB is a clustered index, uses B+Tree as the index structure, data files are tied together with (primary key) index (table data files themselves are organized by B+Tree index structure), must have a primary key, through the primary key index efficiency is very high. MyISAM is a non-clustered index that uses B+Tree as its index structure. The index is separated from the data file, and the index stores the pointer to the data file. Primary and secondary indexes are separate.
  4. InnoDB does not store the exact number of rows in a table. Select count(*) from table requires a full table scan. MyISAM uses a variable to hold the number of rows in the entire table, which can be read quickly.
  5. Innodb does not support full-text indexing, while MyISAM supports full-text indexing. MyISAM has higher query efficiency. InnoDB supports full-text indexing since 5.7.
  6. InnoDB supports table and row locking (default), while MyISAM supports table locking. ;
  7. InnoDB tables must have a primary key (users will find or produce one if they do not specify one), while Myisam can have no primary key.
  8. Innodb stores FRM and IBD files, while Myisam is FRM, MYD and MYI.
  9. Innodb: FRM is a table definition file, IBD is a data file.
  10. Myisam: FRM isa table definition file, myd isa data file, and myi is an index file.

MySQL complex query statement optimization, you will do?

When it comes to complex SQL optimization, the most is due to multiple table association caused a large number of complex SQL statements, then we get this KIND of SQL in the end how to optimize it, the actual optimization is also a routine, as long as according to the routine execution. Complex SQL optimization solution:

  1. Examine the SQL using the EXPLAIN keyword. EXPLAIN can help you analyze your query statement or table structure performance bottleneck, have to EXPLAIN query results will tell you how your index primary key is used, how your data table is searched and sorted, whether there is a full table scan, etc.;
  2. Index fields should be used as far as possible for query conditions. For example, if a table has multiple conditions, compound indexes should be used as far as possible. When using compound indexes, pay attention to the sequence of fields.
  3. Multi-table association uses JOIN as far as possible to reduce the use of sub-query. Table associated fields use primary keys if they can be used, that is, use index fields whenever possible. If the associated field is not an index field, consider adding an index as appropriate.
  4. Try to use limit for paging bulk queries rather than all at once.
  5. Absolutely avoid the use of select *, try to select specific fields, reduce unnecessary field query;
  6. Try to convert an or to a union all.
  7. Avoid using is NULL or IS not NULL.
  8. Note the use of like, pre-blur and full blur do not cross the index.
  9. Minimize the use of functions for query fields after Where, because functions cause index invalidation.
  10. Avoid using something that does not equal (! =), because it does not use indexes.
  11. It is more efficient to use exists instead of in and not exists instead of not in.
  12. Avoid the HAVING clause. HAVING filters the result set only after all records have been retrieved, which requires sorting, summing, and so on. If you can limit the number of records through the WHERE clause, you can reduce this overhead.
  13. Don’t ORDER BY RAND()

Click to follow, the first time to learn about Huawei cloud fresh technology ~