preface

In micro service popular at present, bilibili (B) is also under the pressure of rapid growth of business, the optimization of historical system constantly, in the so-called “big system small do” behind a lot of effort, especially the Go as development language’s overall operational support is relatively weak, such as development, deployment, testing, integration, monitoring, debugging, etc. At GopherChina 2017 conference, MAO Jian, technical director of STATION B, shared the “pits” on the road of micro-service and his thoughts on the whole micro-service framework after the final evolution.

The content of this speech will include the following parts: 1. Evolution process of micro-service of station B; 2. 2. High availability; 3. Middleware; Continuous integration and delivery; 5. Operation and maintenance system.

The authors introduce

Since 2015, I have been in charge of UGC platform and infrastructure in Bilibili (station B), developed goIM, open source push service of live bullet screen, BFS of distributed storage in Station B, guided the development of Cache Proxy and Bili Twemproxy of station B, etc., carried out iteration and reconstruction of historical master station architecture. Six years ago, I worked in Cheetah Mobile as a MySQL DBA and worked in C development, where I developed gopush cluster for cheetah Mobile push system. Enjoy application service performance diagnostics, kernel research, stable server architecture evolution.

Evolution of microservices

The original framework

At the beginning of entering station B, we faced some challenges in technical architecture. First, it was very inconvenient to deploy, and we had to put up the whole main station without relying on anything. Then there are two repositories on the code side, and when you want to deploy, you pack them up and throw them in, and they often don’t get up. The cost of testing is also very high. As a result, the whole development efficiency is relatively low, because the responsibility is not clear, the whole code is a lump, I do not know which module is responsible for, and even some common parts are in charge of many people. See Figure 1.

Figure 1

Upgrade the concept

The first step is to tease out the boundaries of the business. Because the business of the whole B station is very complicated, we first see how to split it from the overall picture. So I made a diagram, as shown in Figure 2.

Figure 2

From my perspective, the top layer is the latitude of the user and the account. If you look at the service in the middle, it has membership, contribution, user property information, relationship chain, dynamic recommendation and so on. From the bottom of some auxiliary services, there are verification code, IP search, push, configuration center, including some service management of protobuffer.

Why emphasize this? Is to start doing micro services, we must make clear the responsibilities of the whole business or called boundaries. The way we started was that the countryside surrounded the city, and all the core businesses were gradually dismantled. For example, users’ broadcast history, favorites, comments, etc., these are peripheral businesses that are not directly related to our main line. So let’s do this. In the upgrade process must consider a compatibility, do not make one API with all incompatible. Until now, account refactoring at the lowest level was painful because there were so many departments involved.

The second is resource isolation. How do you isolate it? Is to buy a server first, try not to write code with unreliable programmers together, isolated first. Because the old code must have a lot of dark technology, you buy a new machine to put there, your new refactoring into this, don’t bother with the old. And old code is hard to reuse when there is no documentation and nothing, so it is best to buy a clean new machine and redeploy it.

The third is extranet service isolation. In 2015, station B also encountered some security accidents. Our APP KEY was reverse leaked from the client side, and someone requested some interfaces on the Intranet. This is a safety incident and a safety risk. Why is this a problem? Because we did not do Intranet isolation before, the Intranet API was exposed to the public network. Because we decided at that time that we wanted to sort out the whole service.

As shown in Figure 3, we start at the top level, starting with the SLB entry and returning to the Gateway. For example, some of our external apis on the mobile end, because some businesses need to have many aggregations on a page, we package them into an API. There are also user reviews, which are platform attributes that can be directly provided to the external API. However, some services are not external, such as our internal services and operational audit operations for the operating platform. For example, our operating platform must operate a certain account, this time must not be external, or if someone discovers this interface, no matter what means you use, he must have a way to fuck you. Therefore, we must do the isolation of Intranet and extranet. After these roles are defined, the Service layer is actually the core of the microservice, which is directly face to face, which is a module unit of the business.

Figure 3

Next, an RPC framework. What features does RPC require? Serialization is the first thing, the first thing is to use GOB, my first idea is to unify the language and write as much as possible in Go. Because when you have a short version of a business or a bottleneck, it’s hard to find out that the person doesn’t know how to write Go. With Go unified, we thought GOB would be the most convenient, since it supports all built-in types. The second is timeout control. Because one block is dead. Let’s say you have a provider, and he gets stuck, and you build up and you die. So have a timeout control that includes some contextual stuff passing. APM problems like the one I just mentioned can be controlled by context. The third is to do some interceptors, so the internal network should also do a certain sound mechanism, including permission control, statistics, limiting traffic. The fourth is service registration. We compared a lot at that time and finally chose ZooKeeper, which is also gradually improving to ZK to make an AP system. And finally load balancing, which I’ve been thinking about for a long time, like in the early days I actually used things like LVS, or DNS for this kind of scheduling, but it’s not the best in terms of performance, because if you’re going to use LVS, you’re going to have to go through your network a few more times, and the straight point is actually the best. But because of the cost at the time, we implemented it directly with client load.

Code-level implementation

Speaking of using GOB, the most suitable RPC for us after using GOB is the NET RPC of the standard library, with some modifications to it.

The first is to support context, and the second is to do a timeout control, and this is a demo of two tests. As shown in figure 4-1.

Figure 4-1

Someone asked me if this thing is difficult to transform, in fact, we looked at the source code of NET RPC, the whole implementation is very simple, so we only need to change a very small number of code, relatively small risk.

So we add a context object, and the first argument has to implement a context interface, and we’re done. All of our RPC methods start with context. See Figure 4-2.

Figure 4-2

Figure 4-3 also starts as a context injection, where we put a lot of stuff like methods, names, and so on. So our external context actually has a little RPC context inside that puts in some of the things that we might need. The interceptor is actually quite simple. First I define an abstract, finite flow, statistical, robust, and then we add a few lines of simple code to the SERVER side of the RPC library.

Figure 4-3

And then we made some improvements. First of all, if you have tested net RPC, you will find that get Request and free Request have global lock requisition, so we change it to a Request Scope level optimization. On the right side of figure 4-4, we put response and Request inside the codec. And in order to reduce the number of objects, instead of using Pointers, we’re using struct containment directly, so it’s a little less stressful.

Figure 4-4

Then look at the handshake. See Figure 4-5.

Figure 4-5

How does scheduling work? Load balancing. We define an interface, and then we might implement these methods, such as broadcast, call, set timeout, and set timeout for one of the methods. The fourth is actually a global default timeout. After doing this, we configure the file with the address, protocol, grouping, and weight. What we did in the first version is the WRR strategy, that is, I have a bunch of clients, you tell me what the weight is, and I will poll and schedule according to the weight, because all my node information is on ZooKeeper. As long as I get all its node changes according to the event regularly, I will cooperate with a client. I think this code is also relatively simple to write.

The Group was actually added in a later version, which I will focus on later when I talk about high availability.

Figure 4-6

Now that we’re done with the service layer, let’s look at the gateway layer. The gateway needs to do aggregation. Back to the previous scenario, a page on the mobile terminal transfers 4 or 5 business parties. It is impossible for students on the mobile terminal to directly connect to each other. We will make an API from the Gateway layer to it, so that the cost is very low. Protocol unification is also done at the Gateway layer.

In the second step we do parallel optimization, because we rely on a large number of business parties, 4-6, so we make a parallel call with errGroup. There are two ways to do Gateway. In our early days, we only had one Gateway for all the exits, but later we found that it didn’t work. A careless bug could cause the process to crash, and then it kept crashing, which was actually very dangerous. So we did some isolation based on what was important or not important and so on. It might be called APP Gateway, membership Gateway, but we’ve done some isolation. Finally, there are some fuses, downgrades, limiting traffic, high availability, and so on that we did on Gateway. Let’s focus on high availability.

High availability

The first one is to do quarantine. Let’s look at Figure 5. Firstly, according to the pressure of the business, some services have high pressure and some have low pressure. Can they be isolated so that the services with high pressure should not have small influence on the stability?

Figure 5

When there was no server in the early days, in fact, it was uniformly distributed on A machine. It was possible that AB and TWO services were provided. If A went wrong, B would suffer. In fact, it was very bad to go through a period of no containers, just using physical machines, really using manual cgroups and then limiting each resource. So physical isolation is buying machines. For example, if we have queues that have a lot of stuff and queues that have very little stuff, if you put them all in one Topic, it still matters. For example, like when we transcode a video, it has super long files and super short files. If super short and super long are put together in a queue, it is actually a kind of light separation. Like our cluster, there are even different deployment, Tencent often mentioned by the cluster deployment.

The second thing to mention is timeout. The most important thing in RPCE is timeout. There are many kinds of timeout, such as connection timeout, read timeout, and write timeout, as shown in Figure 6. At that time, I started to write codes with Go 1.3, but I found that many places were not available at that time. As a result, we once had a fault, that is, the business of a computer room was connected to a DB, and the line was broken at that time, resulting in all processes blocked. Because we found that the CPU was not high, and then the database reported an error, and the request could not be received, we still used GDB to check to see where the reconciliation of gorutnine was running. There were some other methods at that time. In the follow-up introduction, I hope to adjust this timeout to our service continuously based on the data we collect.

Figure 6.

The third is current limiting. There is also a lot of limiting, as shown in Figure 7. For example, last year, we should have said that a service was jammed. What was the problem? The upstream was suspended because it did not set a timeout on the upstream. It was found that the switch was in standby and all the results were hung up. So we came to the conclusion that we still want limited flow protection, limiting flow is to avoid a wave of CC attacks to directly kill you. So we did distributed limiting on top of that, and things like that. These are important resources to be aware of. As with Go, there is a limit to the number of connections in one indicator, which can be used to avoid dragging too many connections in at once and killing you. Like request traffic limiting, adjust its traffic as soon as you find it, remove the traffic you call CC.

Figure 7.

Then bring up downgrades. A lot, too, as figure 8 shows. In our early days, the first step was to do UI degradation. Every time there was a failure, we found that if your mobile terminal could not be opened, users would frantically brush it, and it might be opened after a while. In fact, the number of requests would increase at this time. Later, when we downgrade the client, for example, if you brush it continuously, I will issue a TTL, which can be adjusted in the client. If you really cannot bear it, you can send a 10-second or even 30-second TTL to tell the client not to request at this time. I remember that Alipay is not received as if it is too busy to let you pay or something, in fact, there is a client end to limit the flow, this is the ultimate means.

And feature degradation. It is also relatively simple. For example, there are some big data or artificial intelligence recommended data on a page. This module may lead to the collapse of data in a certain experiment, so there will be some empty Windows or the whole page cannot be opened. In this case, there are many ways, such as you can return another UR to a Dr Recommendation pool, at least to avoid the entire page closed. So when your interface is spit out and a dependent business fails, you must consider whether you have some means to degrade or return some default or even static data.

I remember that jingdong has a share that if they have a very serious fault, the product details page can return a static. Because it is impossible to change the content of many things of our products in real time, I think this is also a way. There are some automatic downgrades, and we actually do a lot of downgrades like statistical failures. For example, if I request a certain interface, it has a high error rate or a lot of timeout, can I not adjust it? Although you depend on many business parties, you can downgrade it, but if it runs out of time, you end up with increased latency, so kick it out.

There are also some cases, such as our core business, or some key business can not do automatic downgrade, at this time you do not know, you must make some function switch, at the appropriate time can turn it on, do not return that data, or do not request it.

Figure 8.

And finally, fault tolerance. See Figure 9. Fault tolerance we do a lot of fuse, in fact, there is a Reference to a Java framework, changed it to Go version, and then made a fuse. In fact, its core idea is very simple, that is, when the number of my requests reaches the number of my error rate, WHETHER I can not adjust it, if not, I can quickly return, called fail-fast. Once the fail-fast phase is over, should I put some flow into it to test whether it recovers? If it works, I think the server is stable, turn on the switch, put the whole traffic in, and then see if it doesn’t work, it will repeat the process.

Figure 9.

There are also important businesses, like those we rely on queues, that may do the hard delivery model. In this case, we might retry indefinitely until it succeeds, or wait forever. Add a random number every second and try again.

Due to the length of the content, this paper is divided into two parts. In the next part to be released tomorrow, we will introduce the practice of STATION B in middleware, continuous integration and delivery, and operation and maintenance system. Please look forward to it