Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

Previously on

In my previous article, “How to design a high concurrency architecture with 100,000 queries per second”, I talked about query platforms in system architectures.

We used cold and hot data separation:

  • Cold Data Based on HBase+Elasticsearch+ pure memory self-developed query engine, it provides high-performance millisecond query for massive historical data
  • Hot data based on the cache cluster +MySQL cluster to achieve the daily data query performance of tens of milliseconds.

In the end, the entire query architecture stood up to 100,000 concurrent query requests per second without a problem.

In this final article in this series on architecture evolution, we discuss the topic of high availability. What does high availability mean?

Simply put, in such a complex architecture, any link can fail, such as MQ cluster may fail, KV cluster may fail, MySQL cluster may fail. So how do you guarantee that if any part of your complex architecture fails, the whole system will continue to work?

This is the so-called full link 99.99% high availability architecture, because our platform products are paid level, paid level, must do the best for customers, availability must be guaranteed!

Let’s take a look at what the architecture looks like so far.

MQ cluster high availability solution

Asynchronous to Synchronous + Traffic limiting algorithm + restricted discard traffic

MQ cluster failure is actually a probability, and quite normal, because before some large Internet companies, MQ cluster failure, resulting in the whole platform can not trade for several hours, serious will cause a few hours the company has tens of millions of losses. We have had MQ cluster failures before, but not on this system.

What happens if the MQ cluster fails on this link?

If you look at the upper right corner, the database binlog middleware can’t write data to the MQ cluster, and then the flow control cluster behind it can’t consume and store data to the KV cluster. The architecture would be completely inoperative.

Is this what we want? That’s certainly not the case, and if that’s the case, the usability of the architecture is pretty poor.

! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2018/11/29/1675ede342302431~tplv-t2oaga2asx-image.imag e)

Therefore, here, we design a high availability guarantee scheme for the failure of MQ cluster: asynchronous to synchronous + traffic limiting algorithm + restricted discard traffic.

In simple terms, the database binlog collection will trigger a degrade policy if it detects an MQ cluster failure, that is, failure to write data to the MQ cluster after multiple attempts. Instead of writing data to the MQ cluster, the standby traffic receiving interface provided by the flow control cluster is called directly to send data to the flow control cluster.

But flow control cluster is also more awkward, before using MQ cluster is peak cutting ah, peak can be a little backlog of data in the MQ cluster, to avoid excessive flow, wash down the background system.

Therefore, the standby traffic receiving interfaces of the flow control cluster implement the traffic limiting algorithm. That is, if the traffic is too large and exceeds the threshold, part of the traffic is discarded.

But this abandoned part of the flow is also exquisite, how do you want to abandon the flow? If you discard traffic carelessly, you may end up with inaccurate data analysis results for all merchants. Therefore, the strategy chosen at that time was to only select a small number of merchants to discard the data in full, but the majority of merchants to preserve the data in full.

That is to say, for example, if you have 200,000 users on your platform, there may be 20,000 merchants who cannot see today’s data under the strategy of discarting traffic, but the data of 180,000 merchants are not affected and all are accurate. But it’s better than 200,000 merchants whose data are all inaccurate, so there are trade-offs when downgrading strategies are made.

In this way, in the case of MQ cluster failure, some traffic may be discarded and the final data analysis results may be biased, but the data of most merchants is normal.

Take a look at the figure below. High availability assurance links are all represented in light red, which is very clear.

High availability guarantee scheme for KV cluster

Temporary capacity expansion Slave cluster + memory fragmentation + hour-level data granularity

Next question, what if the KV cluster dies? We have really encountered this problem, but it is not in this system, but in another core system we have been responsible for. KV cluster did have a failure, which directly lasted for many hours, leading to the company’s business almost shut down, with losses of tens of millions of yuan.

If you look at the right side of the diagram, what if the KV cluster fails? That is also disastrous, because in our architecture selection, massive data storage is directly based on KV cluster. If KV is down and there is no high availability guarantee measures, the flow control cluster will not be able to write data to KV cluster, and then the subsequent links will not be able to continue to calculate.

At that time, we considered whether to introduce another set of storage for double write, such as a set of hbase cluster, but such dependence would be more complex, and the iron needs to be hard, or we should optimize from our own architecture.

Therefore, the pre-plan for KV cluster degradation was as follows: temporary expansion of Slave cluster + hour-level data granularity + memory level fragment storage.

In simple terms, once the KV cluster fault is found, the direct alarm. As soon as we receive the alarm, we will immediately initiate a temporary plan to manually expand and deploy the N times Slave computing cluster.

A degrade switch for the flow control cluster is also manually turned on, and the flow control cluster directly distributes data to the Slave compute nodes according to the preset hash algorithm.

This is the key point, do not store data based on KV cluster, our Slave cluster is distributed computing itself, that just happens to be temporary distributed storage! The flow control cluster distributes data to the Slave cluster directly, and the Slave node stores the data in memory.

Then, when distributing data calculation tasks, the Master node ensures that the calculation tasks are distributed to a Slave node based on the data in the local memory.

Reconstruction of both Master and Slave nodes is not expensive, but it achieves the effect of local data storage + local data calculation.

But there was also a problem, because there was a lot of data that day! What if you put it all in Slave cluster memory?

So, since it is demoted, it is time to do a balance. We choose the hour-level data granularity scheme, that is, only the data of the last hour is saved in the Slave cluster, and then when calculating the data indicator, only the data indicator of each hour can be produced.

However, if the data index needs to be calculated for the data of one day, it cannot be provided after the downgrade, because the memory only contains the data of the last hour to ensure that the memory of the Slave cluster will not be overwhelmed.

For users, it means that they can only see the data indicators of each hour of the day, but they cannot see the summary of the whole day temporarily.

4. Real-time computing link high availability guarantee scheme

Computing task redistribution + Active/standby switchover mechanism

The next piece is the high availability guarantee for real-time computing links, and actually, as I mentioned before, real-time computing links are a distributed architecture, so either the Slave node goes down or the Master node goes down.

In fact, this is not a problem, because the Slave node is down. The Master node senses this and reassigns computing tasks to other compute nodes. If the Master node is down, it automatically performs an Active/Standby switchover based on the active-standby high availability architecture.

Let’s just label the high availability link in the real-time computing link in red.

5. Guarantee scheme for high availability of hot data

Self-developed cache cluster query engine + JVM local cache + traffic limiting mechanism

Let’s look at the data query block on the left. Hot data, which provides a real-time computing link to write the calculation results of the day’s data, uses a MySQL cluster to host the main data, and then mount a cache cluster in front of it.

If a failure occurs, there are only two cases: one is the MySQL cluster failure, and the other is the cache cluster failure.

If the MySQL cluster fails, the solution we adopt is: write the real-time calculation results directly to the cache cluster, and since there is no MySQL support, we cannot use SQL to assemble the report data from MySQL.

Therefore, we have developed a set of memory-level query engine based on cache cluster, which supports simple query syntax and can directly implement basic query semantics such as conditional filtering, group aggregation and sorting for the data in cache cluster, and then directly query and analyze the data in cache and return.

However, the only disadvantage is that the cache cluster carries far less data than the MySQL cluster, so some users can not see the data, and some users can see the data. However, since this is a downgrade, it is certainly to lose part of the user experience.

If the cache cluster fails, we will have a local cache in the query platform, which can be achieved by using ehCache and other frameworks. The data found in mysql can be cached in the local cache of the JVM in the query platform, and can also be used as a certain cache to support high concurrency. In addition, the query platform implements the traffic limiting mechanism. If the query traffic exceeds its capacity, the query platform limits the traffic and directly returns an abnormal response to the query.

6. High availability guarantee scheme for cold data

Collection of query logs + Offline log analysis + Cache high-frequency query

In fact, as you can see from the above figure, the cold data architecture itself is more complex, involving ES and HBase, etc. If you want to make ES and HBase down, and then make some degradation plan, it is still quite difficult.

You can’t ES can’t use, temporary go Solr? Or HBase is not working, temporarily using KV cluster? All not line. That implementation is too complex to be appropriate.

So when we take the way is, for the recent period of time the request of the user initiated off-line query logs are collected, and then analyze the request log in every day morning, analyzed that each user will often, multiple data query request, the high frequency of cold, and then on this particular query (a set of conditions, such as special time range, The result corresponding to the dimension combination is cached.

In this way, the results of cold data query requests initiated by various users are dynamically analyzed every day and put into the cache cluster dynamically. For example, some users will look at the data analysis results of the last week or the data analysis results of the last month every day, so these results can be cached in advance.

If the ES or HBase cluster is faulty, you can directly query cold data. You only need to provide the high-frequency query results that are cached in advance. You cannot see the non-high-frequency query results that are not cached.

Vii. Final summary

So far, the system has evolved to a very good state, because this architecture has solved a series of technical challenges such as high concurrency write with 10 billion traffic, massive data storage, high performance computing, high concurrency query, high availability guarantee, etc. The online production system is stable enough to handle problems at all production levels.

Actually again in the future the system architecture can also continue to evolve, because of the large system architecture evolution, can continue to N years, such as behind us and distributed system whole link data consistency, high stability, engineering quality assurance, and so on a series of things, but this article will no longer continue to write, because the article carrying capacity is too little, it is difficult to clear all the writing.

Actually many classmates said to my feedback, feel look not to understand the evolution of architecture series of articles, actually very normal, because the article carry content is less, there are a lot of detailed technical scheme and the implementation of the ground, I can’t write, can only write the large system architecture evolves, solve various online technical challenges of a process.

I think for some young students, mainly to understand the evolution process of the system architecture, for some older brothers who have done architecture design, it should inspire some ideas, welcome the public account backstage to leave me a message to discuss these technical problems.

END

A large wave of micro services, distributed, high concurrency, high availability **** original series

The article is on its way,Please scan the qr code belowContinue to pay attention to:

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

**> **推荐阅读:** > > 1、[拜托!面试请不要再问我Spring Cloud底层原理](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5be13b83f265da6116393fc7) > > 2、[【双11狂欢的背后】微服务注册中心如何承载大型系统的千万级访问?](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5be3f8dcf265da613a5382ca) > > 3、[【性能优化之道】每秒上万并发下的Spring Cloud参数优化实战](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5be83e166fb9a049a7115580) > > 4、[微服务架构如何保障双11狂欢下的99.99%高可用](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5be99a68e51d4511a8090440) > > 5、[兄弟,用大白话告诉你小白都能听懂的Hadoop架构原理](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5beaf02ce51d457e90196069) > > 6、[大规模集群下Hadoop NameNode如何承载每秒上千次的高并发访问](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5bec278c5188253e64332c76) > > 7、【[性能优化的秘密】Hadoop如何将TB级大文件的上传性能优化上百倍](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Flink.juejin.im%252525252F%252525253Ftarget%252525253Dhttps%25252525253A%25252525252F%25252525252Fjuejin.im%25252525252Fpost%25252525252F5bed82a9e51d450f9461cfc7) > > 8、[拜托,面试请不要再问我TCC分布式事务的实现原理](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Flink.juejin.im%2525252F%2525253Ftarget%2525253Dhttps%252525253A%252525252F%252525252Fjuejin.im%252525252Fpost%252525252F5bf201f7f265da610f63528a)[坑爹呀!](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bf2c6b6e51d456693549af4) > > 9、[【坑爹呀!】最终一致性分布式事务如何保障实际生产中99.99%高可用?](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bf2c6b6e51d456693549af4) > > 10、[拜托,面试请不要再问我Redis分布式锁的实现原理!](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bf3f15851882526a643e207) > > **11、****[【眼前一亮!】看Hadoop底层算法如何优雅的将大规模集群性能提升10倍以上?](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Flink.juejin.im%25253Ftarget%25253Dhttps%2525253A%2525252F%2525252Fjuejin.im%2525252Fpost%2525252F5bf5396f51882509a768067e)** > > **12、****[亿级流量系统架构之如何支撑百亿级数据的存储与计算](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Flink.juejin.im%252F%253Ftarget%253Dhttps%25253A%25252F%25252Fjuejin.im%25252Fpost%25252F5bfab59fe51d4551584c7bcf)** > > 13、[亿级流量系统架构之如何设计高容错分布式计算系统](https://link.juejin.im?target=https%3A%2F%2Flink.juejin.im%3Ftarget%3Dhttps%253A%252F%252Fjuejin.im%252Fpost%252F5bfbeeb9f265da61407e9679) > > 14、[亿级流量系统架构之如何设计承载百亿流量的高性能架构](https://link.juejin.im?target=https%3A%2F%2Fjuejin.im%2Fpost%2F5bfd2df1e51d4574b133dd3a) > > 15、[亿级流量系统架构之如何设计每秒十万查询的高并发架构](https://juejin.cn/post/6844903726835073031)**