background

As we all know, the main programming language at the back end of Zhihu community is Python.

With the rapid growth of Zhihu users and the continuous increase of business complexity, the traffic of the core business has increased several times in the past year, and the corresponding pressure on the service end is also increasing. As the business grew, we found that Python, as a dynamic interpreted language, was exposed to problems caused by low operating efficiency and high maintenance costs:

  1. Operation efficiency is low. At present, Zhihu has insufficient room and cabinet space. According to the current growth rate of users and traffic, it can be predicted that the server resources will run out in a short time (in view of this, Zhihu is upgrading from single room architecture to remote multi-live architecture).

  2. Python’s overly flexible language features lead to high collaboration and project maintenance costs.

Benefiting from the development of open source community and the popularity of key technologies such as containers in recent years, The selection of basic platform technology of Zhihu has been relatively open. On top of open standards, mature open source middleware is available for each language. This allows the business to choose a more appropriate tool based on the problem scenario, as well as the language.

Based on this, in order to solve the problem of resource occupancy and maintenance cost of dynamic languages, we decided to try to refactor the core business with high resource occupancy using static languages.

Why Golang

As mentioned above, Zhihu is relatively open in terms of back-end technology selection. In the past few years, in addition to Python as the main language development, Zhihu has also developed Java, Golang, NodeJS, Rust and other languages.




Golang is one of the most active programming languages for discussion and communication within Zhihu. Considering the following points, we decided to try to reconstruct the core business with high internal concurrency with Golang:

  • Natural concurrency advantage, especially suitable for IO intensive applications

  • The Golang version of zhihu’s internal basic components is relatively complete

  • Static type, multiple collaboration development and maintenance more secure and reliable

  • Once built, only one executable file is required for easy deployment

  • The learning cost is low, and the development efficiency is not significantly lower than Python

Golang was chosen because of zhihu’s internal ecosystem, ease of deployment, and engineers’ interest compared to Java, another excellent alternative language.

Reconstruction results

Up to now, zhihu community member (RPC, peak hundreds of thousands of QPS), comment (RPC + HTTP), q&A (RPC + HTTP) services have all been rewritten by Golang. At the same time, due to the further improvement of Golang basic components in the process of Golang transformation, some new businesses directly choose Golang to realize at the beginning of development, and Golang has become one of the recommended languages for technology selection of new projects in Zhihu.

Compared with before transformation, the following points have been improved at present:

  1. More than 80% server resources are saved. Because our deployment system adopts blue-green deployment, the services that occupy the highest server resources cannot be deployed simultaneously due to container resource problems and need to be deployed in sequence. After the reconstruction, the server resources are optimized and the server resources problem is solved effectively.

  2. Multiplayer development and project maintenance costs are significantly reduced. If you’re maintaining a large Python project, you’ll often need to verify the type and return value of a function at three levels. Golang, on the other hand, is all about defining and implementing an interface, which makes the coding process much safer, and a lot of problems that Python code doesn’t find at runtime can be found at compile time.

  3. Improved the internal Golang base components. As mentioned above, the Golang version of zhihu’s internal basic components is relatively complete, which is one of the preconditions for us to choose Golang. However, in the process of refactoring, we found that some of the basic components were incomplete or even missing. Therefore, we also improved and provided many basic components, which facilitated the Golang transformation of other projects.




Implementation process

Thanks to the thorough microservitization of Zhihu, it is very convenient for each independent microservice to change its language. We can easily transform a single business and almost make external dependent parties unaware.

Inside Zhihu, each independent micro-service has its own resources, and there is no resource dependence between services. All services interact through RPC requests. Each container group of external services (HTTP or RPC) provides services through independent HAProxy address. A typical microservice structure is as follows:




Therefore, our Golang transformation is divided into the following steps:

Step1. Reconstruct logic with Golang

First, we’ll start a new microservice that uses Golang to refactor the business logic, but:

  1. The protocol exposed by the new service (HTTP, RPC interface definition, and return data) remains the same as before (it is important to keep the protocol consistent, since it is easier to migrate dependent parties later)

  2. The new service does not have its own resources and uses the resources of the service to be refactored:




Step2. Verify the correctness of the new logic

When the code refactoring is complete, we verify the correctness of the new service before switching traffic to the new logic.

For the read interface, because it is idempotent, there is no side effect of multiple calls. Therefore, when the implementation of the new interface is completed, we will request the new service with a coroutine when the old service receives the request, and compare whether the data of the new and old service is consistent:

    1. When the request reaches the old service, a coroutine is immediately initiated to request the new service, while the old service’s main logic executes normally.

    2. When the request is returned, the old service is compared with the newly implemented service to see if the data returned is the same, and if it is different, log + log.

    3. The engineers found errors in the new implementation logic based on the metrics and logs, corrected and continued to verify (in fact, we also found many errors in the original Python implementation).




For write interfaces, however, most are not idempotent, so you cannot verify for write interfaces as above. For write interfaces, we mainly ensure that the old logic and the new logic are equivalent by the following means:

    1. Unit test guarantee

    2. Developer verification

    3. QA validation

Step3. Gray scale

When everything is verified, we start forwarding traffic on a percentage basis.

At this point, the request is still proxied to the container group of the old service, but instead of processing the request, the old service forwards the request to the new service and returns the data returned by the new service directly.

The reason for not switching directly from the traffic inlet is to ensure stability and to be able to roll back quickly in case of problems.




Step4. Cut the flow inlet

When the volume of the previous step reaches 100%, the request will still be propped to the old container group, but all the data returned will be generated by the new service. At this point, we can switch the traffic entry directly to the new service.




Step5. Log off the old service

This is the end of refactoring. However, the resources of the new service are still in the old service, and the old service with no traffic is not actually offline.

In this case, you can directly change the resource ownership of the old service to the new service and offline the old service.




At this point, refactoring is complete.

Golang project practice

In the process of refactoring, we made a lot of mistakes. Here are some of them. If you have similar refactoring requirements, please refer to them.

The premise of language refactoring is to understand the business

Don’t mindlessly translate original code, and don’t mindlessly fix a seemingly broken implementation. At the beginning of the refactoring, we found some things that looked like they could be done better, but after a bit of tinkering, some strange problems arose. The lesson here is that it’s important to understand the business and understand the original implementation before refactoring. Ideally, the entire refactoring process should involve the appropriate business engineers.

The project structure

In terms of the right project structure, we actually took a lot of detours.

At first, based on our experience in Python, we provided interfaces directly from layer to layer through functions. However, it quickly became clear that Golang was not as easy to test with Monkey Patch as Python was.

Through evolution and reference to various open source projects, our current code structure looks something like this:

. ├ ─ ─ bin - > build the generated executable file ├ ─ ─ CMD - > the main function of various service entrance (RPC, Web, etc.) │ ├ ─ ─ service │ │ └ ─ ─ main. Go │ ├ ─ ─ Web │ └ ─ ─ the worker ├ ─ ─ gen -- go -- -- > based on RPC thrift interface automatically generated ├ ─ ─ PKG - > the realization of the real part (detailed below) │ ├ ─ ─ controller │ ├ ─ ─ dao │ ├ ─ ─ the RPC │ ├ ─ ─ service │ └ ─ ─ Web │ ├─ Controller │ ├─ Hand │ ├─ model │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ custom │ ├─ Vendor --> DeP Ensure ├─ Gopkg.lock --> Version Control ├─ Gopkg.toml ├─ Joker.yml --> App Build Configuration ├─ Makefile ├ ─ └─ README. MdCopy the code

Respectively is:

  • Bin: executable file generated by construction. Generally, it is’ bin/xxxx-service ‘when started online

  • CMD: The entry to the main function for various services (RPC, Web, offline tasks, and so on), and is generally executed from here

  • Gen-go: thrift Compiles automatically generated code, usually by configuring makefiles to ‘make thrift’. (One drawback of this approach is that it is difficult to upgrade thrift versions.)

  • PKG: Real business implementation (more on that below)

  • Thrift_files: defines RPC interface protocols

  • Vendor: a dependent third-party library

Where, under PKG lies the real logical implementation of the project, which is structured as follows:

PKG / ├ ─ ─ controller │ ├ ─ ─ CTL. Go - > interface │ ├ ─ ─ impl - > the business implementation of the interface │ │ └ ─ ─ CTL. Go │ └ ─ ─ the mock - > the mock implementation of the interface │ └ ─ ─ Mock_ctl. Go ├ ─ ─ dao │ ├ ─ ─ impl │ └ ─ ─ the mock ├ ─ ─ the RPC │ ├ ─ ─ impl │ └ ─ ─ the mock ├ ─ ─ service - > this project the RPC service interface entry │ ├ ─ ─ impl │ └ ─ ─ ├─ ├─ mock ├─ web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web -> Web ├─ ├─ Formatting model exercises -> ├─ formatting model exercises -> Router exercises -> Exercises -> Exercises -> Exercises -> Exercises -> Exercises -> Exercises -> Exercises -> ExercisesCopy the code

With the above structure, it’s worth noting that we typically have impL and mock packages in between each layer.




This is done because Golang does not dynamically mock out an implementation as easily as Python does and cannot be easily tested. We take testing very seriously, and Golang’s implementation has maintained over 85% test coverage. Therefore, we first abstracted interfaces from layer to layer (as above ctl.go), and the calls from the upper layer to the lower layer were made by interface convention. At execution time, the real business logic is run by binding the implementation of the interface in the dependency injection impL, and at test time, binding the implementation of the interface in the mock for the purpose of the mock implementation below.

At the same time, to facilitate business development, we also implemented a Golang project scaffolding that makes it easier to directly generate a Golang service with HTTP & RPC entry. The scaffolding has been integrated into ZAE(Zhihu App Engine) and the default template code is generated after the Golang project is created. For new projects developed using Golang, you have a framework that works out of the box.

Static code review, the earlier the better

We realized late in development that it was best to introduce static code reviews at the beginning of a project, with strict standards for code quality on the main branch.

The problem, introduced late in development, is that too much code is already substandard. So we had to ignore a lot of checks for a while.

There are many very basic and even stupid mistakes that cannot be avoided 100% of the time, which is the value of Linter’s existence.

In practice, we use Gometalinter. Gometalinter does no code checking itself, but integrates various linters to provide uniform configuration and output. We integrate VET, GOLINT and errCheck.

demotion

What is the granularity of the downgrade? Some engineers’ point of view is RPC calls, while our answer is “functionality”.

In the process of reconstruction, we degraded all function points that can be degraded from the perspective of “if this function is unavailable, what is the impact on users?”, and added corresponding index points and alarms to all degradation. The net effect is that if all external RPC dependencies of the Q&A are hung up (including basic services such as member and authentication), the Q&A itself can still browse the questions and answers.

Our downgrading is based on circuit, which encapsulates features such as metric collection and log output. Twitch also uses the library in production, and we’ve been using it for over six months without any problems.

anti-pattern: panic – recover

One of the things that most people don’t get used to when they start developing with Golang is its error handling. A simple HTTP interface implementation might look like this:

func (h *AnswerHandler) Get(w http.ResponseWriter, r *http.Request) { ctx := r.Context() loginId, err := auth.GetLoginID(ctx) if err ! = nil { zapi.RenderError(err)---> return } answer, err := h.PrepareAnswer(ctx, r, loginId) if err ! = nil { zapi.RenderError(err)---> return } formattedAnswer, err := h.ctl.FormatAnswer(ctx, loginId, answer) if err ! = nil { zapi.RenderError(err)---> return } zapi.RenderJSON(w, formattedAnswer)}Copy the code

As above, each line of code is followed by a false judgment. Verbose is secondary. The main problem is that if the return statement after error processing is forgotten, the logic is not blocked and the code continues to execute. We did make similar mistakes in actual development.

To do this, we use a layer of middleware to catch panic outside the framework, and if we recover a frame-defined Error, we render it as an HTTP Error, and if we don’t, we throw it up. The modified code looks like this:

func (h *AnswerHandler) Get(w http.ResponseWriter, r *http.Request) { ctx := r.Context() loginId := auth.MustGetLoginID(ctx) answer := h.MustPrepareAnswer(ctx, r, loginId) formattedAnswer := h.ctl.MustFormatAnswer(ctx, loginId, answer) zapi.RenderJSON(w, formattedAnswer)}Copy the code

As mentioned above, the business logic where RenderError was immediately returned will panic when error is encountered again. The panic will be captured in the HTTP framework layer. If it is an HTTPError defined in the project, it will be returned to the front-end in the corresponding 4xx JSON format. Otherwise, it will continue to be thrown up and eventually return to the front-end as a 5XX.

The implementation mentioned here is not a recommendation, and Golang officially does not recommend it. However, this does effectively solve some problems, here put forward for everyone more than a reference.

Goroutine startup

When building the model, much of the logic can actually be executed concurrently without dependencies on each other. At this point, starting multiple Goroutines to fetch data concurrently can greatly reduce response time.

However, one goroutine pit that can be easily trod by new Golang users is that if a Goroutine panic occurs, its parent goroutine cannot recover — technically, there is no such thing as a parent goroutine. Once started, it’s a stand-alone Goroutine.

So here must be very careful, if your new goroutine may panic, must need this goroutine recover. Of course, a better approach would be to do a layer of encapsulation rather than start Goroutine naked in business code.

Therefore, we refer to the Future function in Java and do a simple encapsulation. Where goroutine needs to be started, it is done with a encapsulated Future, which handles various conditions such as panic.

HTTP.Response Body not close causes goroutine to leak

For a while, we noticed that the number of service Goroutines increased over time and would immediately drop as the container was restarted. So we suspect that the code has a Goroutine leak.




By using the Goroutine stack and printing logs in the dependent libraries, the problem was eventually identified as one of the internal base libraries using HTTP.client but not ‘resp.body.close ()’, resulting in a Goroutine leak.

One lesson learned here is that instead of using http.Get directly in production, it’s better to generate an instance of HTTP Client and set timeout yourself.

After fixing this issue, it works fine:




Although a brief introduction to the problem was made, the actual steps to locate the problem took quite a bit of time, so we can write a new article on the goroutine leak.

The last

The Golang reconfiguration of core business was achieved by the community business Architecture team and the students of the community content technology team through the efforts of Q2/Q3 2018. Here are some of the members of both teams:

Yao Gangqiang @Adam Wen @Wan Qiping @Chen Zheng @Yetingsky @Wang Zhizhao @Chai Xiaomiao @XLZD

The community business architecture team is responsible for solving the problems and challenges caused by the rapid increase of business complexity and concurrency scale at the back end of Zhihu community. With the rapid growth of Zhihu’s business scale and users, as well as the continuous increase of business complexity, our team is facing more and more technical challenges. At present, we are implementing the multi-machine room remote multi-live architecture of Zhihu community, and at the same time, we are trying to guarantee and improve the quality and stability of zhihu backend.


This article is from “code hole”, a partner of the cloud habitat community. For relevant information, you can pay attention to “code hole”.