Brief introduction: This is the “how to build the flow condition of online application architecture” the first article of the series, the series of a total of three papers, to use the most simple language will influence the stability of the online application flow and technical problems do a classified, solutions to these questions only some code level of detail, some need tools to cooperate, others need expensive solution, If your application wants to have a one-stop experience of “traffic lossless” on the cloud, you can pay attention to the cloud product “Enterprise Distributed Application Service (EDAS)” of Ali Cloud, and EDAS will continue to evolve towards the direction of default access traffic lossless.

The author | solitary Yi, ten sleeps

preface

Github suffered a global outage of more than six hours due to a software update… Meta(formerly: Facebook) also experienced a configuration push error that caused a global system crash for over 6 hours…

Big IT system failures like these pop up every once in a while. It is the primary responsibility of a system architect to build a secure and reliable online application architecture for an enterprise. In addition to understanding the business system architecture to safely cope with current business traffic, he needs to be able to build the future, that is, the architecture selected should be able to cope with the business growth in the next few years. This ability has nothing to do with the technology trend, and the talent market capacity of the chosen technology, and the business form and growth direction of the enterprise.

Let’s put aside the infrastructure of IT systems and the embodiment of enterprise business and abstract IT down to two key metrics for online applications: traffic and capacity. The goal of capacity is still to meet the basic needs of flow, and our goal of constant optimization is to find a balance point between these two indicators that can represent “technological advancement”. This balance point means: efficiently and accurately servicing existing resources (capacity) to existing and predictable business traffic; Efficiency means performance, precision means lossless.

This three-part series aims to bring technology back to the root of the problem that system architects need to solve: how to maximize traffic for online applications without loss.

Problem definition

We refer to a common business deployment architecture diagram (note: due to the author’s background is a Java technology, familiar with the infrastructure is also mainly cloud service is given priority to, so many of these examples is the use of Java system to some of the technology and elaborates the cloud services, some details may not have other programming languages on the significance of reference.) :

This picture is a typical and very simple a business architecture, the business would serve from users around the world, when a user request arrival, after load balance into the back of the micro service gateway cluster, micro service gateway to finish some basic traffic again after cleaning, such as authentication, audit work according to the business form is routed to the back of the micro service cluster; The entire microservice cluster will eventually exchange (read/write) data with different data services.

According to the above description, we temporarily divide the whole process of traffic request service into four main stages: traffic analysis, traffic access, traffic service and data exchange. In each of these four phases, there is a possibility of traffic loss, and the solutions we take after each of these phases are completely different, some of which can be solved through a few framework configurations, while others may require an overall architecture reconfiguration. We will examine each of these phases in a three-part series, starting with traffic parsing and traffic access.

Traffic analysis

The essence of the resolution scenario is to get the address of a service from a service name. This process is our regular DNS resolution. However, under the influence of the current management strategies of various service providers, traditional domain name resolution often encounters problems such as domain name caching, domain name forwarding, resolution delay, and cross-service access. Especially in global Internet business, traditional Web service through the DNS, will not determine the source of the end user, randomly selected one of the IP address back to the end users, not only does this may be due to the service provider parsing and reduces the analytical efficiency, but also can lead to the end user may result in slower because of the transatlantic access. All of the above problems may directly lead to the loss of our traffic. To cope with the above scenarios, common solutions include intelligent resolution and HTTP DNS technologies, which are described as follows:

1. Intelligent parsing

At the beginning, intelligent resolution is mainly used to solve the problem of network failure caused by cross-network resolution between different carriers. It mainly selects the access point under the corresponding network according to the address of the access end, so as to resolve the correct address. With the continuous iteration and evolution of this technology, some products have added network quality monitoring nodes in different regions on this basis, which can be analyzed more intelligently from a group of machines according to the service quality of different nodes. At present, intelligent analysis on Ali Cloud relies on cloud infrastructure, and can even dynamically expand and shrink nodes in the node pool by application, maximizing the value of flexibility on cloud in this field:

(Image from Aliyunyun analysis document)

2, HTTPDNS

HTTPDNS is a discovery protocol that uses HTTP instead of DNS. Generally, the service provider (or self-built) provides an HTTP server. This server provides a very simple HTTP interface. Before the client initiates the resolution, the HTTPDNS SDK initiates the resolution and Fallback the original LocalDNS resolution. HTTPDNS is particularly effective in resolving DNS hijacking, domain name caching, and other scenarios. The disadvantage is that most of the solutions also require clients to hack into the SDK. Also, the Server can be a bit expensive to build. But as cloud vendors continue to iterate in this space, more and more lightweight solutions are emerging; This will help HTTPDNS become one of the main trends in DNS evolution.

Traffic access

After the DNS resolves the correct IP address, it enters the core entrance of traffic access. The main role of this process is the gateway. A gateway usually plays an important role in routing, authentication, auditing, protocol conversion, API management, and security protection. The common CP in the industry is the combination of load balancing and microservice gateway. However, some scenarios are easy to ignore when these two components are combined. The following figure shows an example.

1. Traffic security

Insecure traffic is divided into two types. The first type is attack traffic. The second type is traffic that exceeds system capacity. The defense against attack traffic is based on software and hardware firewalls. Such solutions are mature and will not be described here. In this case, it is difficult to identify non-attack traffic that exceeds the system capacity. In this case, it is difficult to determine how to maximize the normal traffic to get corresponding services and protect the entire business system that is likely to be avalanches.

A simple attempt is to do ratelimit-like traffic control here at the gateway in terms of QPS requested, number of concurrent requests, number of requests per minute, and even interface parameters. But before that, the system architect is required to have an idea of the capacity of the system. We recommend it is before making a similar safety first to do an overall capacity assessment, the assessment here is not just for some API for a stress test that simple, is recommended to make a real production environment of the whole link pressure measurement (third section will have a dedicated), and then adjust to make the security policy.

2. Gateway application control

In the final analysis, gateway is still an application. According to the online customer system we have come into contact with, a business system with a complete micro-service architecture, this application will occupy 1/6-1/3 of the machine resources of the whole system, which can be considered as a very large application. Since it is an application, it performs some routine operation and maintenance control operations, such as starting, stopping, deploying, and expanding capacity. But because a gateway is the throat of a business system, it cannot be done frequently, which means that a few principles need to be communicated very clearly to the development team:

  • ** Configuration and code decoupling: ** capabilities (such as protocol forwarding, traffic limiting, security policies, and routing rules) must be configurable and delivered dynamically.
  • ** Do not mix business logic: ** Return gateway to gateway essence, do not mix business logic; A gateway that doesn’t play itself is a good gateway!

In addition to the above two principles on the development side, there are also corresponding principles on the operations side. In addition to routine monitoring and alarm, operation and maintenance also requires adaptive flexibility. However, the flexibility of the gateway involves too many points. For example, the gateway needs to be operated together with the previous load balancing (for example, the node needs to be offline or deactivated at the load balancing site before application deployment, and the node can be put online after application deployment is successfully started). At the same time, elasticity also needs to cooperate with the automation of application control system to achieve lossless online flow.

3. Traffic routing

The next step after passing through the gateway is to route the application to the next node. In Internet scenarios where multiple equipment rooms are deployed with high availability, the nearest route is adopted to ensure service quality. That is, if the gateway entrance is in the main machine room, the next hop is expected to route to the node in the same machine room. Similarly, the next hop in a microservice cluster is expected to route to the same machine room. Otherwise, if there is a request to call across the machine room, RT will become very long, resulting in traffic loss due to request timeout, as shown in the following figure:

To achieve the same effect as the machine room call, we need to have a good understanding of the selected service framework. The core principle is to preferentially select the IP address of the same equipment room for the next hop route. However, different frameworks and deployment environments of the business need to make some specific customized development to achieve the same machine room priority call in a strict sense.

conclusion

This is the “how to build the flow condition of online application architecture” the first article of the series, the series of a total of three papers, to use the most simple language will influence the stability of the online application flow and technical problems do a classified, solutions to these questions only some code level of detail, some need tools to cooperate, others need expensive solution, If your application wants to have a one-stop experience of “traffic lossless” on the cloud, you can pay attention to the cloud product “Enterprise Distributed Application Service (EDAS)” of Ali Cloud, and EDAS will continue to evolve towards the direction of default access traffic lossless. In the next post, we will bring a different perspective from online application publishing and service governance. Stay tuned.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.