The author | Zhang Hongxiao (see flower: v) new retail sources of senior technical experts | alibaba alibaba cloud native public number

preface

The whole cloud transformation of Kaola started in October 2019, and the only goal at that time was to complete the migration quickly in a short time. In less than four months, the koala team’s only concern was how to accomplish the mission as quickly as possible, and cloud native was the right path for us.

Practice course

This paper mainly talks about the practice process of Kaola overseas shopping from the cloud product access in the third stage and the upgrade of the operation and research mode in the fourth stage.

Cloud Product Access

1. Cloud native product definition

Cloud native is essentially a set of technical systems and methodologies. With the development of container technology, sustainable delivery, choreography system and other technologies, and driven by the concept of open source community, distributed microservices, the application of cloud has become an irreversible trend. True cloud is not just about infrastructure and platform changes, but also about applications themselves. In each stage of architecture design, development mode, application operation and maintenance, based on the characteristics of cloud, open source and standardization, the construction of new cloud applications, namely cloud native applications.

Cloud native technologies enable organizations to build and run applications that can scale flexibly in new dynamic environments such as public, private and hybrid clouds. According to CNCF’s definition, cloud native technologies include containers, service grids, microservices, immutable infrastructure, and declarative apis. Ali Cloud provides message queue products, such as message queue RocketMQ version, message queue Kafka version, application real-time monitoring service ARMS, microservice engine MSE, application high availability service AHAS, performance test PTS, function computing FC and other middleware cloud native products. It has laid a solid foundation for Kaola Overseas Shopping to evolve from traditional application to cloud native application.

2. Inner journey

We generally went through three stages in terms of mentality in the process of cloud product access.

1) Stage 1: very good, very strong, access efficiency bar

This part was mainly from October 2019 to March 2020. At that time, the products like database, Redis and ASI were connected. There were more users and the whole product was relatively stable, which was basically fully compatible with open source products. Basically just change a few points.

2) Stage 2: Cloud products are so rich that they have everything

In the past, many components were maintained by ourselves, but as the number of connected instances increased, the number of reads and writes increased, and there were occasional outages. At that time, I heard that the microservices engine MSE was very useful. It provided one-stop microservices capability support, including microservices dependency component hosting, non-invasive microservices governance, and faster, stable and low-cost running of microservices. We went to MSE’s brothers and they patted their chest and said no problem. After the product was run, there were no such problems.

There are many more examples like this, and the feeling at the time was that only when you really use cloud native products in a systematic way can you get a deeper sense of the value of cloud native products.

3) The third stage: running-in adaptation

As Kaola began to access the group’s business platform, the supply chain also began to integrate with the group, and we further carried out the process of cloud. There were challenges in the process, but after overcoming many difficulties, we completed the transformation as scheduled and passed through several big promotions very smoothly. Cloud native products supported the growth of Kaola overseas shopping business very well.

3. Access process

1) Access policy

Because of the cloud and purchase self-built koala sea products have some ability difference, so we set up a complete set of product evaluation and testing ground for access mechanism to ensure that the whole access to an orderly and functional mobility, it is the good operation of this mechanism, our whole country, to guarantee the stability of the based did not appear in the big changes in the whole big fault.

Our entire guarantee process is shown as follows:

2) Permission scheme

The first problem when accessing cloud products is how to manage cloud accounts and resource permissions of cloud products? Ali Cloud itself provides RAM products as a service to manage user identity and resource access rights. So how is the RAM account associated with employee identity?

  • Do you apply for a sub-account for each product, which is shared by employees?

  • Or apply for a RAM sub-account for each person and manage resource permissions for each person separately?

  • Or apply for a subaccount for the application and associate the resource permissions of the subaccount with the application permissions of the employee?

Kaola overseas Shopping has hundreds of employees. Both schemes 2 and 3 are faced with high life cycle of sub-accounts and management costs of resource permissions. Therefore, we adopted the first scheme — applying for a sub-account and developing them together for simplicity when using these middleware cloud products in the early stage.

The problem is that the granularity of resource permissions is too coarse. For example, using SchedulerX, you can log in to the console to operate all tasks of all applications, which is a very dangerous thing for production security. So for application security, our first requirement for middleware cloud products is to provide the ability to do resource authorization by application granularity based on RAM.

When kaola.com users log in to the cloud console, they cannot sense the RAM account. The capability in RAM based cloud product STS (Security Token Service) encapsulates a layer of simple cloud console jump temporary authorization, when generating STS Token, obtains the current user according to BUC, and generates and specifies an additional permission policy. Restrict the user’s permission to operate cloud resources (applications). The login page is as follows:

SchedulerX also builds on STS ‘ability to associate employee identities with RoleSessionName to manage permissions. Of course, this is only a temporary solution, which can help Koala to solve part of the problem. The final solution still depends on the overall solution, and we will discuss this part later.

3) Message scheme

Migration target:

Kaola overseas shopping message system is based on message queue Kafka and message queue RabbitMQ. Transaction message center and delayed message product are developed on kaola overseas shopping message system to meet the message demand of rich business. After calling RocketMQ message queue on the cloud, it is found that it is perfectly compatible with and supports the existing complete message system of Kaola Overseas Shopping, which can provide sufficient performance guarantee, stable line guarantee, and provides additional functions to support message track and message query, making it more friendly to business use.

Implementation process:

The overall migration involves hundreds of projects of Kaola Overseas Shopping, so it is impossible to arrange and transform in a unified time. Therefore, a migration plan spanning several months is formulated according to the scene of Kaola overseas shopping. In addition, THE SDK was developed to realize the double writing of messages, topic mapping, support for pressure measuring messages and other special functional scenarios of Kaola overseas Shopping. Business students do not need to invest a lot of manpower. You can double write messages by updating the SDK and adding a few lines of configuration.

  • Phase 1: Message double-write transformation is performed for all services.
  • Phase 2: Message dual-read transformation is performed for all services.
  • Stage 3: The general message closing stage is carried out, and the business side switches to the single writing state, so that the original message system of Kaola Overseas Shopping is completely stripped.

4) RPC scheme

RPC mainly involves the RPC framework and the service registry. Kaola overseas Shopping uses RPC framework Dubbok (Dubbo internal branch) + Nvwa (Kaola self-research Registry), while the group uses HSF + ConfigServer.

Due to the requirement of interworking with the group’s micro-services in the early stage, the EDAS team of Ali Cloud provided us with the extension of Dubbo ConfigServer registry based on the COMPATIBILITY of HSF with Dubbo protocol. After the introduction of this extension package, Kaola application registered CS and subscribed from CS. Intercall with group HSF applications very quickly and easily.

Next, we started to use Dubbo3.0, and reconstructed HSF3.0 based on the Dubbo kernel. After the upgrade, the original Koala Dubbo application has all the features of HSF, and can seamlessly interact with the group services. However, as a new SDK, it is bound to face great challenges in terms of functionality and performance. In the early stage, we introduced this SDK to carry out a month-long functional test under the scenario of Kaola Overseas Shopping, and solved nearly 40 functional problems. At the same time, for performance problems, call delay, registry push and cache problems are solved. At the same time, Kaola’s Dubbo registration center expansion will also support Dubbo3.0, which has finally undergone large-scale verification on Singles’ Day.

At the same time, we adopt the mode of double registration and double subscription, which also lays a foundation for the relocation and offline of kaola Overseas Shopping self-research registry in the future. After the application is upgraded, it can be modified to only connect to the CS connection string, and then go offline to Nvwa. At the same time, Kaola overseas shopping has also moved to MSE, the micro service engine of Cloud raw products. We are particularly grateful to the MSE team of Ali Cloud for supporting the related functions of Kihara Kaola governance platform Dubbo.

5) SchedulerX scheme

challenge:

The ScheduleX scheduled task bottle on the cloud and the Kschedule scheduled task platform of Kaola Overseas Shopping are investigated and compared, and it is found that ScheduleX can be said to be an upgraded version of THE architecture of Kschedule, which not only meets the basic scheduled scheduling and sharding scheduling, but also supports larger scale task scheduling. For the overall migration, the biggest difficulty lies in how to migrate and synchronize the scheduled tasks of 13000+ koala overseas purchase, during which each task needs to be manually transformed in the code and configured on the platform. It’s a huge drain on manpower.

The migration plan:

  • Self-developed synchronization tool for 13000+ timing task synchronization and alarm information synchronization, solve the massive human operation of business students.
  • The self-developed Kaola overseas Shopping cloud native control platform synchronizes timed task permission information to ensure data security after migration.

6) Environmental isolation scheme

In microservice scenarios, environmental governance is a big problem. In essence, environmental isolation is to maximize the resources of the test environment and improve the efficiency of demand testing. Koalas originally developed a set of environmental routing logic based on Dubbo’s routing strategy. The idea is based on the strategy of trunk environment plus project environment. Only the applications whose requirements involve changes need to be deployed, and the traffic is preferentially routed to the project environment by carrying project labels. If there is no deployment, the services and resources of the trunk environment will be reused. Therefore, the stability of the trunk environment and the routing of the project environment are the priorities of test environment governance.

After migrating to Ali Cloud, Ali Cloud actually has a similar scheme based on SCM routing to achieve the same effect, as shown in the figure below:

However, in terms of functions, SCM does not support THE RPC framework Dubbok and message framework of Kaola Overseas Shopping. However, thanks to the excellent plug-in package mechanism of ARMS, we packaged the SCM plug-in of HSF into a plug-in through code enhancement and transplanted it into Dubbok. Have Aone SCM solution capability. We switched to the group’s SCM solution within a week, based on a combination of JVM metrics and release platform, extensive early testing and synchronization with QA development. Subsequently, Kaola Overseas Shopping basically carried out the iterative development of requirements in the way of trunk environment + project environment.

7) Highly available component solution

AHAS current limit:

There are three key points for traffic limiting. One is access, which needs to be buried in the application code or base components so that metrics can be collected and traffic limiting operations can be performed. Second, traffic limiting capability, rule configuration and delivery; Third, monitoring and alarm.

AHAS and Kaola’s original stream limiting component (NFC) are basically the same in terms of user usage, providing annotation, API display call, Dubbo filter, HTTP filter, etc. During migration, only the corresponding API needs to be replaced. Because the component API is relatively simple, So the cost of access is relatively low. AHAS also provides JavaAgent access capability without modifying the code.

In terms of capabilities, the AHAS is more complete than the original Koala’s components, providing system load based protection and fuse downgrading. There was a requirement for a cluster limiting function, but the AHAS team was very helpful and launched it before 618 so we could use it. In terms of monitoring and alarm, it provides real-time second-level monitoring, TopN interface display and other functions, which are very perfect. There are also flow control automatic trigger alarm, by nailing the way.

AHAS failure drill:

Kaola Overseas Shopping application was deployed in ASI, and AHAS-Chaos completed the access without any sense of business through the Operator capability provided by K8s externally, and successfully participated in the group 527 joint drill.

8) Pressure measurement link transformation scheme

Koalas already had a full-link manometry shadow scheme. Its core is mainly divided into two parts:

  • Full-link pressure gauge transparent transmission

  • Traffic interception for shadow routing, service mocks, and so on

The first step of migration is to access the real-time application monitoring service ARMS. The second step of migration is to access performance test PTS, support ARMS and Koala components, and take over koala’s original shadow routing logic.

Both ARMS and PTS themselves use The JavaAgent approach to achieve the embedding of various basic components through bytecode enhancement. The benefits of this operation are low access cost and low business awareness. Finally, we successfully completed the transformation of full-link pressure measurement.

9) Same-city active-active scheme

After moving to the group’s computer room, Kaola overseas Shopping still had self-built, cloud products and group components for a period of time. Based on the current situation, we designed a set of our own active-active and SPE solutions.

Normal state on line:

The same room priority based on DNS and Vipserver supports both random daily traffic and traffic isolation in a single room.

Single room pressure measurement under the state:

Infrastructure as Code (IaC)

1. What is IaC

Infrastructure as Code — Infrastructure is Code, a way to build and manage dynamic Infrastructure using new technologies. It regards infrastructure, tools and services as well as the management of infrastructure itself as a software system, adopting software engineering practices to manage changes to the system in a structured and secure way.

My understanding is that through the consistent management (change, version, etc.) of software operating environment, software dependencies, and software code, and the provision of baas-like decoupling mode, the software is not bound to a specific environment and can be quickly replicated in any environment.

2. Practice content

1) Build the deployment system

On the basis of the original application DevOps system of Kaola, we combined the concept of IaC & GitOps to transform the application construction, deployment, configuration loading, daily operation and maintenance based on AppStack & IaC. Related construction, deployment and application static configuration were all migrated to git source code. With the help of Git to host all the related configurations of the application, the version iteration of the configuration is clearer than the previous mode, and the version consistency of the application source code, build configuration, container configuration and static configuration can be effectively guaranteed.

2) Lightweight containers

Taking this cloud native transformation as an opportunity, we made a benchmarking transformation between kaola’s original container image system and the group standard. The big change is to change the original startup user from AppOps to admin.

On the other hand, we introduced lightweight containers. As one of the foundations of cloud native, the isolation capability of the container layer is a big selling point. Kaola overseas Shopping has switched over the whole, completed the transformation of lightweight containers, and divided POD into application containers, operation and maintenance containers, and custom containers. The whole deployment has become more lightweight and easier to control.

The deployment mode after transformation is shown in the following figure.

3) CPU – share

The pattern shown above is cpu-set, where the container is bound to a portion of the CPU and the runtime uses only the bound CPU, which is most efficient on a normal host because it reduces CPU switching. The deployment of Kaola overseas purchase is all switched to CPU-share mode, that is, under the same NUMA chip, the container can use all cpus under the chip (the total number of CPU time slices does not exceed the limit configuration), so as long as there are idle CPUS under the chip, Will make preemption will not be too intense, can greatly improve the stability of the operation.

Finally, in the verification of large peak pressure test, the CPU of Shenlong can maintain a relatively stable operation state when it is below 55%, thus ensuring the stability of the overall service and making full use of resources.

4) Mirror configuration separation

Image configuration separation means that the container image of an application is stored separately from the configuration (static configuration and publish configuration) dependent on the application. In this way, the application image can be reused to the maximum extent and the construction times of application images can be reduced to improve the construction and deployment efficiency. In addition, static configurations are automatically rolled back after application codes are migrated to the AppStack. Services do not need to manually roll back static configurations in the static configuration center, which greatly reduces service rollback risks.

In addition, if the image and configuration are separated, the image can be deployed in any environment without depending on the configuration of the corresponding environment. This way, our release process can be adjusted from change-oriented to product-oriented, and live as a mirror image of our tests.

3. Implement strategies

1) Automation

The task of IaC migration is configuration migration, environment migration and overall standardization. Improving the migration efficiency will greatly speed up IaC migration, and also bring positive influence to the mentality of business development migration process.

  • The build release configuration is stored in the old deployment platform of Koala, and the static configuration is stored in the self-developed configuration center. The old deployment platform first connected with the configuration center of Kaola and the Group’s GITLab code warehouse, and then automatically created various configurations of the old deployment center and configuration center into the business code according to the standardized Service. cue template, and automatically completed the IaC configuration migration. This greatly saves service migration time and improves migration efficiency.

  • We have developed a set of apis for the cloud native environment, which can automatically create, modify and delete the cloud native environment and the cloud native pipeline, and also improve the service access efficiency.

After the IaC automatic migration function is implemented, it takes about one minute for each application to complete the migration of various configurations, the creation of the native cloud environment, and the creation of the native cloud pipeline, all without service access. After completing the above configuration mapping and reconstruction, the application only needs to be simply built and released, and then the abnormal startup caused by compatibility problems can be solved, that is, the IaC migration is completed, and the overall cost is relatively low.

2) Access support

IaC access is different from the upgrade of middleware, involves the application of the release, the deployment system changes, and the stability of the phase current AppStack is not particularly high, so we access strategy is to project room closed access, provide technical support, make sure the business can first solve the problem, improve business engagement and well-being, It can also collect problems in the first time and help us optimize the access process. For example, in the early stage, the business needs to manually create an assembly line, and in the later stage, we can automatically create the corresponding assembly line for the business that needs to be migrated through API.

The implementation of IaC service migration has two stages. In the two stages, we adopt different access modes. By adopting different support modes in different stages, we achieve the goal of stable and fast access of services.

Before Singles’ Day:

  • One resident of the project team will be assigned to support the project room
  • Every Monday to Friday, there are different departments of development to focus on migration in the conference room
  • Every morning training related knowledge, afternoon, evening application switch

After Singles’ Day:

  • Three people are assigned to the project room to support the project
  • Only fixed departments will be moved every week, and fixed people will be dispatched by the department to complete all the migration work in that week
  • Training is held every Monday morning

The difference between the two is mainly due to the stability of the platform in the early stage and the low familiarity of business research and development, so the access is relatively cautious, more in a state of mind of verification and promotion. After the subsequent relatively stable, the overall access is carried out in the mode of horizontal push.

results

1. No major fault occurs

Kaola overseas Shopping’s cloud native transformation cycle is very long. No major failure occurred due to cloud native transformation, whether it was the large-scale promotion like 618 and Double 11, or the general promotion like monthly membership day, under the full cooperation of project team members.

2. Good fusion results

  • Solve the differences between Kaola Overseas Shopping and the group’s application deployment, fully compatible with the current group’s mode, and complete alignment with the group’s technical system in the deployment level.
  • Solve the difference between internal call and group call of Kaola Overseas Shopping.
  • SPE and Active-active devices are built, and the Dr System is aligned with the group.

3. Efficiency improvement and cost saving

  • Migrating stateful containers reduces the number of seconds per batch of deployments by 100, and resolves startup failures caused by IP changes.
  • Configuration and code are strongly bound, and subsequent rollbacks do not require the rollback of the static configuration.
  • Capacity expansion from daily capacity to large capacity expansion from each application to the baseline water level by 0.5 person per day.
  • The number of servers is reduced by 250.

4. Improve cloud product functions

  • Promote the solution of cloud product ease-of-use and stability issues, and enrich the scene richness of cloud middleware products.
  • Promote the solution of issues such as production safety and account number in cloud native process.

In the future, Mesh is one of the power direction

Technology sinking is the general trend of Internet development. In the era of microservices, Service Mesh comes into being. Although the introduction of Mesh proxy brings some performance loss, resource cost, and operation and management cost of Mesh service instances. But it shields many of the complexities of distributed systems, allowing developers to get back to business and focus on real value:

  1. It focuses on service logic and shields the complexity of distributed system communication (such as load balancing, service discovery, authentication and authorization, monitoring and tracking, and traffic control) through Mesh.
  2. The service can be written in any language regardless of the language.
  3. Decoupled infrastructure, transparent to applications, Mesh components can be upgraded individually, and infrastructure can be upgraded faster and iteratively.

Kaola Overseas Shopping has been firmly carrying out the transformation of Cloud original biochemistry in the past year. Although we encountered many challenges in the process, we never doubted the correctness of this direction and gained more business value after solving problems every time. On November 11 this year, the whole cloud native upgrade helped Kaola reduce the number of 250 servers, and deposited a complete set of cloud landing practices on IaaS + PaaS. Kaola’s r&d efficiency on cloud has also been greatly improved. For example, with the service of Aliyun live Broadcasting Center, Kaola has quickly completed the construction of overseas live broadcasting service from 0 to 1. In addition, “tree TV”, “Like community” and other new functions have been launched.

With the continuous development of cloud native transformation, the dividend brought by cloud native is becoming more and more significant. I believe that when business and infrastructure are further decoupled, business and infrastructure will be irrelevant one day. Business r&d only needs to care about its own business and no longer needs to worry about the operating environment, thus greatly improving the efficiency of operation and research.

Author’s brief introduction

** Zhang Hongxiao (name: Fu Jian) ** Senior technical expert of Alibaba New Retail, with 10 years of experience in development, test operation and maintenance, rich experience in the field of infrastructure and R&D efficiency, actively embrace cloud native, advocating sustainable, fast and high-quality software delivery.

If you want to know about the same open source & Cloud products of Kaola Double 11, welcome to participate in Spring Cloud Alibaba Meetup in Guangzhou station on January 30 to learn how Perfect Diary and Huya use SCA family bucket to empower business and implement micro services.