Klook shares the secrets of the Go app: 4 refreshments to serve tens of millions of global users

Klook, a four-year-old Travel technology company based in Hong Kong, has emerged as a dark horse in Asia’s travel start-up scene, offering discounted tickets and booking of travel features to tens of millions of users around the world. Klook’s continuous exploration and innovation in technology is the foundation for the realization of the company’s international and standardized business. Today, Xiong Chuanliang, the technical director of Klook back-end, will share the application practice of Klook in Go language.

Orders,

record

1. Application of Go in Klook

2. How to implement the new architecture with Go

3. Architecture challenges and technology evolution

4. A little exploration and reflection

1. Application of Go in Klook

Klook has been using Go for a long time. In 2015, it began to try to use Go for project development. Websites, APPS, Open APIS and other platforms are all supported by Go. All internal systems, risk control, intelligent recommendation, micro-service tool chains and so on also use Go. It can be said that the backend of Klook uses Go to do almost everything, and the backend technology stack is based on Go.

Of course, Klook wasn’t originally built with Go either

This is a very classic backend architecture, Klook initial business services are based on the classical architecture of single Java application, the architecture of the problem is that, in accordance with the Klook at that time, the pace of development, in the time period can be estimated, the system will soon be strong enough to support the growth of the company’s business, will soon be facing an upgrading of the technical architecture. However, to start a new technical architecture, we need to think about a question: how to upgrade and implement the new architecture while ensuring business development?

Common concerns

I think there are a few things you might want to consider when upgrading an architecture or landing on a new one:

How can a new architecture or language be justified?

Is the team strong enough to solve technical problems for a new architecture or language?

How to solve the migration problem of the old system?

How to resolve the conflict between business requirements and technical transformation?

What if you can’t find anyone? Is it fast for developers in other languages to switch to Go?

If you haven’t thought it through, upgrade the architecture and I think you can rethink that.

2. How to implement the new architecture?

2.1 You need some success stories

How and where to start a new architecture? If you’re starting a new architecture, you probably need some success stories to back it up. Klook started with two services

1. Hong Kong WeChat Wallet, backstage service of guest products (proving that the new language is reliable)

The first was wechat Wallet, which was a project with Tencent’s wechat team at the time. If you open the wechat wallet in Hong Kong, you can buy tickets or travel products in the wechat wallet, which is the service provided by Klook, and this service is written in Go.

2.Go scheduled task service replacement QuartZ (proven to solve old system migration problems)

Another project was the Scheduled task Migration project. The background of this project was that when Tomcat was being deployed for multiple instances, scheduled tasks faced the problem of repeated triggering. Our CTO Bernie suggested that we take this opportunity to take the scheduled tasks out of Java once and for all and just spend a little time doing this service. The architecture of the service is actually quite simple.

2.2 Triggering process of a scheduled task

First, all scheduled tasks are encapsulated into an independent interface, and then the scheduled task scheduling service is written by Go. The scheduled task triggering interface is configured according to the scheduled task. If the triggering succeeds, a successful response is sent to the scheduling service. After the service logic related to the interface is processed, the service logic processing result is fed back to the scheduled task scheduling service.

In addition, there is a WEB interface to manage and monitor scheduled tasks. It wasn’t very complicated from a technical point of view, and it took only a little time to code. What I think it does well is that it turns timed tasks into interfaces, with a clean separation of tasks and scheduling, which is more flexible and consistent with a single principle. This service solves the problem of scheduled task migration. Subsequently, we did a lot of microservices with Go and added a lot of scheduled tasks, but this code has been used since the end of the writing and hasn’t changed much.

After the endorsement of these two projects, Klook actually began to seriously think about how to use Go to do micro-services. As we know, Go has been popular in recent years. If you want to promote Go in your company, you can use this idea for reference to build your confidence in Go.

To do microservices with Go in a new architecture, you might want to do the following

1. Log collection and alarm

2. Basic library accumulation

3. Business boundary division

4. Unified technical specifications

5. Code reviews are critical

Previous sharing has emphasized the importance of log collection and alarm, we consider the same, log collection and alarm is very important, for the back end, when the production line problems, in some cases, log query is even the only means.

In addition, good internal architecture requires a good, constantly updated base library. The other thing is demarcating the boundaries of your business, which is very important because if you don’t demarcate the boundaries of your business early on, as the company grows, there will be more and more businesses and microservices, if you don’t demarcate the boundaries, then it will become a mess. If you set boundaries early on, the costs may not be so high. I don’t think there will be any change in its broad business category.

What’s left is uniform technical specifications and code reviews, two things that I think are most beneficial to development is that it’s a conduit for very good code best practices. Good, best code practices can help you avoid a lot of pitfalls. If you don’t already have a code review mechanism in your team, I think you should try it internally. It will help improve the quality of your code and the stability of your code.

2.3 Service stability three sets

Overload protection + Go: Recover () + process daemon (container container)

If a service is to be stable, it is recommended to pay attention to these points:

1. Overload protection of the service. If the service overload protection is not done, when the amount to be processed exceeds the maximum capacity of the service, the service may be washed down, and the following services may also be washed down, and even your DB may be washed down. So it’s important to test the performance of your service. You need to know what your service’s processing limit is, and set a relatively conservative value. That way, your service won’t fall into an avalanche of other brother services.

2, as you know, GO has a very useful thing is pointer, there is a pointer may appear empty pointer, empty pointer may Panic. When Panic occurs, the service is in danger. So when you write code, make sure you have recover as a backstop. This way, if an exception occurs, your service will not hang. Of course, if you fail, your service is still down, at this point, you need a service daemon tool, service down, automatically pull up your service, to ensure that your service is reliable, available.

3, there is overload protection to ensure that the service will not collapse, recover mechanism to prevent the bottom of the service to hang up, in addition to a process daemon, to ensure that even if the service hung up, it will automatically pull up. With these three things as guarantees, I think the basic stability of the service is guaranteed. Of course, this is not the case if the business logic is wrong.

2.4 Back-end microservitization

After the architecture of Klook was microservized, the architecture at that time was as follows:

Service boundaries are divided into service groups based on Consul and Fabio. Service registration, service discovery, configuration center, etc., of these services are based on Consul and Fabio for load balancing routing. Then we have our own monitoring center, link tracking, and our own service governance platform. Of course, there are many other things underpinning this architecture.

I think these are pretty familiar open source projects, or at least familiar ones. These constitute the cornerstone of the entire Klook micro service.

With the deepening of micro-service transformation, there are many, many pain points related to micro-service governance. For example, with the introduction of so many components or third-party services, you often find yourself switching between several Windows or services when it comes to doing something related, and often feel like the screen is never enough. And your service is becoming more and more rapid deployment and rolled back will be a big trouble, service, the more you like the APP to send a larger version, it may be a dozen to several hundred service instance, you are very hard one finished deployment, as a result, when on the production line regression testing, test told you that unfortunately hit don’t know where is a bug, You need to roll back in an emergency, and it’s going to be miserable, and it’s going to be a problem to roll back in a hurry.

There are a lot of parameters in micro services. If you change a parameter in batches, such as account number or password, you first need to know that there are so many services in the production line, how many services use this parameter, and how to change it in batches. Do you change it one by one? That’s also a problem.

In a word, after servitization, maintenance becomes more and more of a burden. Klook’s solution to these service governance problems is as follows:

We have built a micro-service governance platform, you can see how the interface has the production line services running, it can also be a single instance of the operation. Then you can batch deploy and fast rollback in this, you can do parameter query and batch collation on this system, for some WAF, flow control or other rules, you can also quickly set up and distribute in this page. In this interface, most of the microservices-related pain points have been solved on it. At least with this interface, you can do it with a mouse, and you don’t have to switch back and forth between components.

Anyway I personally feel, after have the interface, greatly improved the work efficiency, and reduce a lot of mental burden, I don’t need to remember so many commands and script parameters of so many, there is one other advantage is greatly reduced the error probability in the busy, because sometimes often emergency production line also is very panic, the panic is easy to get wrong. This is a little bit more stable than if you change the parameters manually.

Of course, behind this platform, there needs to be a lot of transformation of Klook’s infrastructure, and strong technical integration ability to build this platform. In addition, I personally think that using a relatively simple interface to solve such a complex microservices governance problem is consistent with Go’s less is more philosophy. When Klook’s micro-service governance platform, especially its interface, was relatively stable, I believed that Klook’s micro-service transformation had entered a relatively mature period.

But it’s not. Challenges are everywhere.

3. New challenges and architecture evolution

1. Global expansion puts forward higher requirements on the reliability, security and availability of services

2. With the continuous improvement of system complexity, the requirements of product quality are getting higher and higher

3. The iteration frequency of existing businesses is getting faster and faster, while new businesses are constantly emerging

Hiring can’t keep up with business growth

The company has been developing at a high speed and will face various challenges, among which the biggest challenge is globalization, because the company has been expanding to new countries and new destinations, which will face many difficulties and requirements. I don’t know everyone’s understanding of globalization is how, my personal understanding of globalization is that you can imagine such a scene, I want to go out to play, said go go, go directly to the Hong Kong Disneyland, or Japan’s Osaka universal studios, or some scenic spots to dubai, take out a cellular phone directly through Klook APP purchase vouchers, don’t have to queue, Just scan to get in. No matter what country you are in, no matter what scenic spot you are in, the user experience is the same. This is what I think of as globalization.

And to achieve that kind of user experience, they may be in different time zones, different languages, different currencies, different attractions, different payment methods, etc. Globalization is very challenging and very interesting. Back to the service itself, globalization has put forward high requirements on the reliability and stability of the service. If the service is unstable, customers will not be able to get in and have to wait in line to buy tickets, which is a very poor user experience. Therefore, the expansion of globalization has increased the reliability of the service, which is a very important problem.

At the same time, because the company has been expanding, it is inevitable to face more and more complex scenarios and businesses, and business pressure is great. At the same time, there is a big problem that there is a shortage of staff in the back end, not only in the back end, but also in the whole technical team, and even in all departments of Klook. This may be a problem that companies in a period of rapid development will encounter, the demand for talents.

3.1 Other backend issues

1. Too many heavy weapons consume machine resources, and their deployment, operation and maintenance are quite complicated

2. Some open source component bugs, and frequent updates to follow up a headache

In the meantime, there are other back-end issues. As you can see, we’ve introduced very, very many open source components or third-party services. We have so many services, and these components and services actually need to be configured and maintained by people. And it takes resources to build these open source components, which require a lot of manpower and material resources. And then there’s the problem of bugs, not just in open source software, but in commercial software as well, for a lot of bugs.

When hit, you go to the official solution and usually get a result that the new version has fixed the problem, or the next version will fix the problem…… Other issues include frequent updates of many component versions and poor compatibility.

Both of these are hot programs, Consul in particular, and it’s great to see that there are a lot of people involved and a lot of people going to make it better. But you might also want to note that it has solved more than 1900 issues and has 300 more in progress. There are probably some bugs out there, and we’ve hit some of them. When I hit the bug, I went to check the solution, which told me that the new version had fixed the bug. The problem was that the production line had hit the bug at that time. How could it be possible to directly upgrade the production line without a complete test? It is the cornerstone of microservices, and all service registration, service discovery, and configuration centers are based on it.

So we rethought and rethought the bottom of the whole architecture, introducing so many open source components, is there really a need? How much does it cost, how much does it benefit us, and how much does it harm the stability of the service? Is the use of resources reasonable? In the end, we came to the conclusion that we need to go cloud, and we need to use cloud services more. Cloud services, in most cases, provide better security.

3.2 Cloud of microservices

Microservitization architecture

In the new architecture, you can see a lot of open source components or third party services are gone, it is much simpler.

The architecture you can see is divided into three parts, cloud services, open source components plus third-party services, and a chain of homegrown tools, such as source code generation tools, automation platforms and command line tools. These three pieces make up the current background architecture. So why are there open source components left in the middle? First of all, there are some functions that cloud service providers do not provide. In addition, the quality of some projects is far better than the services provided by cloud service providers. Not all the services provided by cloud service providers are so good.

You can think about what kind of benefits or changes in thinking will be brought after the back-end services are fully embraced by Amazon or Google cloud or other cloud service providers?

3.3 Log Processing and Alarm -ELK Stack

Here I have prepared an example, a classic ELK Stack, I believe many of you are familiar with it. Log reporting, collection can be used to solve, very easy to use, our development is also very like to use it, Klook has been using it in the early days. There is no problem with it itself, but the problem is that the company’s log growth rate is too fast. Initially, it was an ELK server, but soon the development said that the query was too slow, and the data could not be found, and then it quickly became two, into four, why not four or eight? As a startup, there are some cost considerations.

So, I started to think that there might be something wrong with it. Too much effort has been put into expanding ELK, but it doesn’t work as well as it should. Often you upgrade and it stabilizes for a while, but quickly runs out of energy. We don’t have a lot of people working on it, we just upgrade it. All things considered, it’s really not very good.

If we look at this solution in terms of cloud services, we end up with something like this.

3.4 Log and archive data processing scheme

This is our log solution (PPT) after combing, you can look at the blue line first. The current solution is that the service logs are directly sent to the NATS queue using the Protobuf protocol, and then the data reporting service reports them to AWS S3 and triggers an alarm for exceptions. When our development or business needs to query relevant logs, we can take out data through our data query platform.

The benefits are obvious. First, the 4 – and 8-elK clusters and mounted log storage disks can be thrown away and no longer require human maintenance. S3 is as long as you have data to save, Athena does not affect the query speed because your log volume is too large, its query speed is stable. In addition, the logs of the current service do not need to be dropped. In the past, ELK logs were written to the local disk, and Filebeat reports the logs by monitoring the specified log directory through configuration. This means that your logs must be written to the local disk, which may cause disk burst and delay. Another problem is that the log package used by our service is Glog. In order to improve performance, it will cache for 30 seconds and then write logs in batches. However, our alarm depends on the log, so the alarm will be delayed for 30 seconds. Because PB protocols and Nats are fast, log alerts become more timely.

At the same time, this approach can bring more possibilities, first reported to the service, directly complete the log archiving. There are other similar things that can be reported to AWS S3 using this scheme. For data query platform, also can do many things in it, such as there used to be developed, only a Request log service to check one by one, but becoming a service, there is a lot behind a Request service, when you check the problem need to be aggregated, and now this may, according to a collection of related log query you fast positioning problem, on the other hand, In the past, some data queries will be slower because of the amount of data or query conditions, so relevant restrictions or optimizations can be made on this interface. It also has the benefit of supporting SQL statements that combine query logs.

In general, when you change the way you think about architecture, you can bring a broader idea, use fewer human resources or use

With less effort or less cost, to get a relatively stable service, I think the price is worth it, and it will also bring more possibilities. A solution like this one is simple, doesn’t write much code, but solves a lot of problems. Klook is using this idea to make relevant architecture at present.

4. A little exploration and reflection

4.1 Take the initiative to change and find the most suitable architecture at different stages

I think every company is at a different stage, and you need to find the architecture that works best for the company. Of course, it’s best to upgrade or change the architecture in advance before it deteriorates to a certain extent, rather than waiting for the system to break down.

4.2 Be good at “Combination”

Another is the ability of combination. How to integrate third-party open source components, cloud services and self-made wheels into a complete micro-service ecological chain according to your business requirements or business scenarios or the development of the company is very challenging.

4.3 Service stability is the most important

For back-end development, service stability is always the most important thing to remember.

Thank you!

[Q&A]

I see you use Neo4j, how do you use it? I’ve been working on graph databases.

Xiong Chuanliang: This is applied to our risk control system. In the process of globalization, there will be some very challenging risk control problems. We do it through this thing.

Q: Your micro services don’t talk about containerization, I’d like to know about it.

Xiong Chuanliang: Yes. Many people are concerned about containerization. At present, our Docker is only used in the test environment, but why don’t we use containers in our production line? There is a reason for this. In my opinion, containers bring two main benefits to the company’s existing Go service: first, the isolation of services, and second, flexible expansion. First, the isolation of services. I don’t think it will be a big problem, as long as your code quality is guaranteed and the services are properly allocated and deployed. The other is elastic capacity expansion. The prerequisite for elastic capacity expansion is that a large enough resource pool exists. We have communicate with relevant people, including also welcome AWS architects communication, we made a careful evaluation, if say the number and size of the to the existing service do this aspect of the plan, may the company many times the number of servers to growth, its yield than actually is not so big, not too many attractive, because for us, what service is hot, We know that we can do this in other ways, not necessarily on a very large scale. In addition, Klook’s micro-service governance platform was relatively mature when we were in 2016, so it was ok and not urgent.

Q: Are there multiple data centers for internationalization?

Xiong Chuanliang: Yes, there must be multiple data centers. Each country and region has different requirements for data security and compliance.

Question: is there already?

Xiong Chuanliang: Yes.

Question: I don’t understand why your company started to use Java and use Go for microservices. Java microservices are also very good. Why do you switch to Go to embrace microservices?

Xiong Chuanliang: There is a reason for that. I personally have some understanding of Java and Go, and I prefer to use Go after evaluation based on Go’s performance, reliability or ease of use. To do microservices in Java, I felt that the rest of the team and I couldn’t handle it. Evaluate the capabilities of the team as you build a new architecture. I talked about this in the powerpoint before, when things go wrong, can your team and people handle it? I think I can handle Go’s technical architecture, but I may not be able to handle Java.

Question: When it comes to microservices, it has many services, and each microservice uses a different database. Do these databases follow the service or build a data center with a database, and then these services adjust the things of the data center?

Xiong Chuanliang: Our current practice is that the database is divided according to the business boundary and follows your business. There are many services in the business, and these services may be tuned to the same database. To access the databases of other business lines, you have to tune them through interfaces.

Q: I see you mentioned that you use many open source components. For these open source components, they are like a service. Do you have high availability requirements for them? If so, how do you request it? More services, the maintenance costs are rising, how do you deal with such things?

Xiong Chuanliang: Whether an open source component is worthy of being a high availability center depends on the importance of the service. The service is relatively secondary. I don’t think about its multi-cluster availability. You can see the beginning architecture will be a lot of open source components, and our team so far have not operations, our business needs and is very much, in fact we don’t want to spend too much energy on the construction of the open source components and operational, after all, you use it only for the purpose of the use of its function, rather than to set up and operational.

Q: When you do microservitization, how do you solve the problem of data aggregation and data consistency?

Xiong Chuanliang: Data consistency is very important. Data consistency depends on the division of your services. This is based on business scenarios.

Q: Data aggregation and consistency. For example, a user, a single user center, your microservices will use data from the data center. How do you aggregate this data to the front end?

Xiong Chuanliang: How to quickly give service front end to convert the data to demonstrate, request to foreign business services first, then the user service center, it is a basic services and other services are the user center this basic service to retrieve the user data, on the other hand, if it is a very common page interface, tend to direct the cache service, by caching service to do these things.

In 2018, Gopher Meetup will be in Hangzhou for the second stop of its tour. This time, many new instructors will be invited to share their Go experience with us

Click to read the original article to sign up

Klook shares the secrets of the Go app: 4 refreshments to serve tens of millions of global users

Related Posts

Java HashMap principle and internal storage structure

Find the location of the Nginx configuration file

Programmer and Descartes product