preface

Text has been included to my lot warehouse, welcome Star:https://github.com/bin392328206/six-finger plant a tree is the best time ten years ago, followed by now

Each sentence

A sword strong enough to cut water, a breeze weak enough to calm the waves.

omg

Mobile Internet era, under the high concurrency of vast amounts of user data generated on a daily basis, based on the user using the data analysis, such as all need to rely on data statistics and analysis, when the data amount of hours, database optimization are not very important, once the data quantity is more and more big, the system response will be slow, TPS plummeted, until the service is not available.

Business background

Today? Is actually want to put himself under the question bank business architecture of the comb, and what is not, of course, cow force of the company, may say things they contact, not so the cowhide, but somehow also summarize the herself, belong to an online education companies, then of course question bank is the underlying foundation to support many business, of course, I also is not so long, to the company But the question bank business has been reconstructed for two times. This year, 2020 will be the third. Let me first comb through the storage design process

  • The first stage was from 2015 to 2018. At that time, it was said that the company was a question bank system made for an outsourcing company. At that time, due to the small size of the company at that time, r&d and other reasons, it was outsourced to others anyway. Net+sqlserver, and then the back of the system maintenance ah, the system response and so on slow, because do the problem record we used to do a problem to save a data, you can imagine that the amount of data slowly more than 500W, after all, the volume of business is also more and more large.
  • The second stage was the reconstruction in 2018. Before I came to the company, data was transferred to hbase for the purpose of data production, so hbase+mysql storage was adopted. However, phoniex was used, which had no change at the code level for us writing business. In fact, Phoniex is good for OLTP services, such as indexing and multi-table joint lookup. However, for OLAP services, phoniex is not able to perform well. Reports of many services may cause transaction event length, resulting in hbase write timeout. At present, the amount of data has reached 4 billion, but writing and a small number of index queries are still within 100ms, fast or ok, but for OLTP business is a little inadequate, and then because of the reconstruction in 2018, Many business is a direct copy of the first phase of the business code, after years of development, a few years ago, plus document is complete, the personnel changes in maintenance, and iterative cost more and more big, so we have to consider refactoring the question bank system (the biggest difficulty is data reconstruction system and let the user have no perception of migration of the business, I will write an article to elaborate on this if I have the opportunity.
  • To finalise our technical director and director of product, decided to the company all the educational teaching, research, fully automatic all aspects of the adaptive learning system reconstruction, question bank business is the first, of course, of course refactoring actually before business, then report to comb, architecture design, module design, data synchronization, system in parallel, high concurrent design, Supporting tens of thousands of students in thousands of campuses to take exams at the same time, I will not elaborate on these for the moment, we will focus on the discussion of massive data storage scheme, and then we adopt the architecture of mysql+ ES +hbase+ Clickhuose to deal with the whole scenario and all-round business of the question bank. First, the questions are recorded in hbase+ Clickhuose, and then the questions and papers are placed in mysql+ ES. Anyway, according to the people above, it is the goal to make the structure of the same in 10 years, ha ha, for the above business background, small 66 would like to discuss with you, the current market mainstream for massive data processing.

What if the performance is slow due to massive data

Optimize the existing mysql database

  • Performance considerations in database design and table creation mysql database itself is highly flexible, resulting in poor performance, heavily dependent on the ability of developers. That is to say, if the developer is competent, the mysql performance is high. This is also a common problem with many relational databases, so corporate DBAs are often overpaid.

    • Table fields avoid null values. Null values are difficult to optimize and take up extra index space. The default value 0 is recommended.
    • Use INT instead of BIGINT, or UNSIGNED if non-negative (which doubles the size of the value). TINYINT, SMALLINT, MEDIUM_INT are better.
    • Use enumerations or integers instead of string types
    • Use TIMESTAMP instead of DATETIME whenever possible
    • Do not have too many fields in a single table, recommend less than 20
    • Use an integer to store IP addresses
  • The index

  • More indexes are not always better. To create specific indexes based on the query, consider creating indexes on the columns involved in the WHERE and ORDER BY commands. You can use EXPLAIN to check whether an index or full table scan is used

  • Try to avoid NULL values for fields in the WHERE clause, as this will cause the engine to abandon the index for a full table scan

  • Fields with a sparse distribution of values, such as “gender,” have only two or three values

  • Character fields are indexed by prefix only

  • Character fields should preferably not be primary keys

  • No foreign keys, the program guarantees the constraint

  • Try not to use UNIQUE, let the program guarantee constraint

  • The idea of using multi-column indexes is to keep the order and query conditions consistent, while removing unnecessary single-column indexes. In short, use the appropriate data type and select the appropriate index

The simple method can do 80% of the work, and the hard method can do 95% of the work, so we can also choose the simple method, the hard method has to use the time to come again.

In short, it is not necessary to introduce a responsible solution to carry out simple optimization first. Usually when the database first shows pressure, most of the reasons are not because the business really develops to the database can not support, but many slow queries are caused by;

Talk about the database and table

Why do we need separate databases and separate tables?

High number of requests: In the case of high concurrency, a large number of requests fall into the database, which eventually leads to an increase in the number of active connections in the database, approaching or even reaching the threshold for the number of active connections that the database can handle. At the business Services layer, there are few or no database connections available. Concurrency, throughput, connection exceptions, crashes, outages;

  • Slow data query:

    • 1. If the amount of single table or single database data is too large, refer to Article 3 for details;
    • Second, the total number of concurrent connections in a single library is close to the system threshold, resulting in this request to obtain less than the number of connections or has been obtained but encountered CPU bottleneck, resulting in the SQL query table even if there are few data rows also appear too slow query phenomenon;
  • Too much data: – one, when a repository of data storage is too big, even if each table concurrency is not much, but because it is a huge amounts of data, the reservoir in the presence of large amounts of data table, each table has a part of the concurrent requests, leading to a single final number of connections threshold (maximum number of connections the default 100, the biggest can be set to 16384, but generally according to the business of the hardware and libraries to reasonable configuration, Generally between 500-1200) become the bottleneck of the database;

    • 2. When there is too much data in a table, the query speed of a single table decreases seriously. Although the maximum number of rows allowed by the innoDB storage engine is 1 billion, if the data row records of a table reach hundreds of millions, even if I query a data through the index, it will need at least ten to dozens of disk IO. As a result, the speed of single table query decreases linearly; Generally, the data behavior of a table is about 10 million is the most appropriate, because when the table data is 10 million, the index established if the B+Tree type is generally between 3 and 5, so the speed of query is naturally very fast.
  • Common problem of single architecture: when a problem is encountered in a single library and needs to be fixed, all data in the whole library will be affected, while when a separate library needs to be fixed, only one library will be fine.

In fact, the above problems are database bottlenecks, but only according to the different types of database bottlenecks, but ultimately for the client is the database is not available or slow down.

Attention!! Don’t divide your inventory for the sake of dividing your inventory!! To introduce a phrase from SOA architecture: Architecture is not a sheer upheaval, but a gradual evolution

As for the commonsense problem of sub – library sub – table, small six six is not elaborated here, I think every back-end developer at least needs to understand it, its vertical sub – library, vertical sub – table, horizontal sub – library, horizontal sub – table. Its advantages (is single table can be solved under the condition of large amount of data, the efficiency of the query, because mysql index tree height at the time of 3 to 4 layers, the query is good), and then its disadvantages (transaction, multi-table LianZha, paging, sorting), etc., and then how to solve these problems, the industry also has a lot of open source framework, Mycat, ShardingSphere and so on. In fact, this is also a knowledge point, may be small six six behind this article, but today we can only be said to be a massive data solution

NoSQL/NewSQL

For example, hbase, ES and Redis used by our company are all NOSQL. Their characteristics are distributed and have strong ability of horizontal expansion.

As a newborn, NoSQL/NewSQL cannot compete with RDBMSS when reliability is our primary concern. RDBMSS have evolved for decades, becoming the core storage of choice wherever there is software.

At present, most of the company’s core data are: RDBMS storage as the main, NoSQL/NewSQL storage for the auxiliary! Internet companies are dominated by MySQL, state-owned enterprises & banks and other enterprises with good money are dominated by Oracle/DB2. No matter how awesome NoSQL/NewSQL is advertised, companies are now positioning it as a complement to, not a replacement for, RDBMS.

Recently very fire TiDB is a kind of NewSql, small companies also useless, but to play a, of course I’m not a professional testing personnel, for the performance of it I don’t really know exactly how much better than other database, but for usage, I feel a line still, because you’ll SQL syntax, you can use it, relatively simple to use. Its TIKV supports OLTP scenarios, and its TIflash supports OLAP scenarios. I believe it will be a very good database in the 5G era.

Mysql + no solution

At present, because of the popularity of NewSql, I think most companies are still mysql +Nosql as a supplement. You can comment and leave a message below to discuss your company’s solution.

  • Mysql > select * from ‘es’ where (select * from’ es’ where (select * from ‘es’ where (select * from’ es’))

    This kind of plan, also can, should have a company to use it

  • Hbase (Column storage RowKey design) + ES (query paging sorting) Some fields + RowKey

  • Solr +hbase(which is more mature and probably used more often), such as Lily hbase Indexer.

conclusion

Finally, several schemes are summarized as follows (sub-database sub-table is referred to as SC) :

In a word, for massive data, and a certain amount of concurrent sub-database sub-table, by no means to introduce a sub-database sub-table middleware can solve the problem, but a systematic project. You need to analyze the entire table-related business and let the right middleware do what it does best. For example, if a sharding column query goes to the database table, some fuzzy queries go to ES, or multiple unfixed criteria are filtered, and massive storage goes to HBase.

After doing so many things, there will be a lot of work to do, such as the consistency of data synchronization, and after running for a period of time, the amount of data in some tables slowly reached the single-table bottleneck, this time also need to do cold data migration. In a word, database and table is a very complex system engineering. Any massive data processing, is not simple things, ready to fight!

At the end

Talk so much today: next time have a chance to chat with all of you for high concurrent processing of bai, probably our business scenario is, to solve the problem. So to handle is much simpler than what order goods, but we also pay attention to design our high concurrency scenario, and after the reconstruction, even the table structure changed, how do we let users do not perceive migration, Refactoring is not just about building a new system, it’s still a long way to go, unless you can get rid of the old data, which is easy. Haha, in fact, we can only do 70 points, not full marks, I don’t know how big companies engage in refactoring, hope to have a big boss to comment on the message.

Daily for praise

Ok, everybody, that’s all for this article, you can see people here, they are real fans.

Creation is not easy, your support and recognition, is the biggest motivation for my creation, we will see in the next article

Six pulse excalibur | article “original” if there are any errors in this blog, please give criticisms, be obliged!