How to design a micro-distributed architecture?

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post was posted by Mariolu in cloud + Community

Prologue (Original intention)

The original intention of the system is to describe the business (or machine cluster) storage model and analyze the relationship between proxy cache server disk storage and back source rate. The system significance is in the process of Tencent cloud cost optimization, quantitative guidance for the equipment room capacity expansion. The first half is the introduction of the background, CDN cache model to do some theoretical thinking. In the latter part, the actual operation will set up a miniature distributed universal system architecture with all five organs, and finally give the system some functions related to the background to solve the practical problems encountered in the cost optimization.

Cache Server Storage Model Architecture (Background) :

The online routing of Tencent CDN is the OC->SOC->SMid-> source station of user A distributed in various regions and operators. Each tier of nodes deploys a cache server. Part of the request traffic from the user hits the server, and the other part generates the back source traffic.

As the service bandwidth naturally increases and the user bandwidth increases, the disk cache obsoletion and update (obsoletion) rate becomes faster when the service back source rate remains unchanged, which is manifested as the following service bottlenecks: high IOWait, high back source bandwidth, and high back source rate due to cache obsoletion due to limited disk space.

To illustrate this principle. We assume two extremes: one is that the disk capacity of the device is infinite, and the cache of incoming traffic is limited only by the source cache rule. As long as the cache does not expire, the disk can be cached indefinitely. The source traffic only needs the traffic that is accessed for the first time. Therefore, the amount (rate) of the source traffic depends only on the service characteristics (repetition rate). At the other extreme, if the disk limit is small (zero), the number of client visits is 1:1 regardless of whether the service setting cache expires. Assume that the average business cache cycle is 1 hour. Then the first cache bandwidth (multiple accesses to the same cache key, which we consider to be one) in 1 hour will be the required space of the hard drive. This is a reasonable size to ensure that the disk is large enough to accommodate the volume of services. Suppose this amount cannot be reached, or should have been reached, but due to the natural growth of business, 1 hour inland first cache bandwidth becomes more, hard disk space is not enough.

Capacity expansion is a solution. However, before the pressure measurement system, there was no objective data to prove how much equipment needed to be expanded. Or expand the number of equipment without gray verification, equipment in place to beat the head directly online deployment machine. In order to guide the reasonable storage of equipment in the computer room, we replay the online log in the experimental machine and simulate the storage simulation curve. This is the point of building a replay log system.

Small but Complete Replay log Model (Overview)

In this chapter, we define the following modules:

Simulated log server: Downloads the access logs of an equipment room on the line over a period of time. A log store 10 minutes of access records. There are several machines in the room to download a few copies of the log. The log server also provides the query service of task fragment information. Suppose we need to replay the task slice with task ID pig_120t. The following figure shows details of task slices.

Task controller: Switch to start or end a task. Tasks are evenly allocated to specific broilers and proxy servers. Add a Task to the Task Pool to collect real-time total traffic, source return traffic, total request times, and source return times of the server and insert the data into the source return rate result data table.

Broiler: Polls the Task table in the Task Pool. If there is a task, the system requests the log server to download the fragment logs based on the task details (time and online equipment room IP address). Replay the request to the specified proxy server.

Proxy server: provides real-time source data query service. And install components such as THE NWS cache server, which is equivalent to the software module of the online machine room.

Real-time display screen: You can view the real-time source return rate and abnormal status of some tasks at any time.

Figure 3 shows the interaction between client and server. Figure 4 shows the linkage process between the task control terminal and other modules during the task.

Distributed System characteristics

The core of the log replay model is a high-performance pressure measurement system, but some logic needs to be added: log download, log analysis and reconstruction, result data collection, data reporting and presentation. The core of distributed system is: whether it can be expanded, recoverable, easy to build, fault tolerance, automation. The following sections will unfold.

Start with high performance: in a general model. We simulate online logs, and the system needs to be efficient because we replay logs faster than online QPS. The speed of the machine’s playback determines the speed of the analysis results. At the same time, faster speed requires less broiler resources. In Python’s url request library and Golang, the author finally decided to use Golang to implement chicken. Golang does it as fast as native C +epoll, but the code implementation is much easier. Theoretically we have done a proxy performance bottleneck analysis on one. Online logs are more complex than analog logs, and a modest drop in QPS is inevitable. Golang, the client, achieved its goal.

Scalable: At any time we may increase the number of broilers in the simulated machine cluster, or add more idle proxy server resources to the pressure test task. So the system adds new machines to the available machine data table at any time.

Recoverable: The distributed system is different from the single-machine mode. Failure is inevitable, and sometimes part of the system fails, and we prefer not to use that node, rather than continue with the unfinished result. It’s either 0 or 1. There’s no intermediate state. There are distributed system network transmission delay is not controllable. Therefore, the pressure measurement system designed a set of fault tolerance mechanism: including heartbeat detection failure, automatically eliminate the broiler server in the data table. The interface is fault-tolerant. Timeout Indicates the expiration of an unfinished task. Crontab Periodically pulls and exits the process.

Easy setup: use ajS interface, and batch installation scripts. Automate the deployment of chickens and servers. Configure DNS resolution IP (log server, task pool, source rate result database IP), TCP time_wait state reuse, do not remember some system limits (ulimit fd limit, set 100000), Permanent set need to edit the/etc/security/limits the conf). If broiler has a dependent program runtime need to download at the same time. Download the broiler client and configuration from the broiler machine, download the server and configuration from the server machine, download the scheduled pull-up program script, and add it to the crontab for scheduled execution. All of this is done automatically with batch scripts.

Some thoughts on design paradigm

Single-productor and Multi-consumer

In the design of broiler client: read log file one record per line and add to message pipeline, then multiple execution workers fetch URL from message pipeline and execute mock request. The message pipeline passes a log URL to be executed. An IO consuming program is one where if the consumer accesses the log and the result is done instantly, but Productor needs to perform complex string processing on the log (such as regex), it will be blocked by the pipeline the next time it fails to fetch data. The other is cpu-consuming. Productor simply copies the data to the message pipeline if the log URL is already pre-processed. The consumer accesses the URL with unpredictable network latency. Then multiple consumers (because the network access time is included, the number of consumers is designed to exceed the number of CPU cores, such as 2 times) access simultaneously, and the read speed is several degrees slower than the write speed. In an experiment with a log file, we found that it took 0.3 seconds to process 18W logs and 3 minutes to complete the url access task. So obviously this is a CPU consuming process. If it is an IO consuming program. Golang has a message model called Fan Out. We can design it like this: multiple readers read the chan of multiple Chan lists, and one writer writes one chan. Fanout loops the chan on the write side to chan in the chan list.

Map-reduce

We sometimes do a geo-location of a carrier’s machine room log analysis. An equipment room contains the IP addresses of several machines. Reasonable scheduling of concurrent access logs of multiple broiler clients can obtain the data of combined source rate more quickly.

The parallel mechanism is the classic Map-reduce. Log files are distributed according to the IP latitude slices of machines in the equipment room. N broilers are accessed in parallel at the same time. Also used to make proofs with online logs. Mapper here is a reducer, and the generated data table was extracted according to the focus type.

Simplified map-Reducer (not based on distributed file system), data transfer between Map and Reduce is achieved by data table. The log data generated by each Mapper is stored locally and then reported to the data table. However, due to the limited size of the data table, we can only upload headers to access the URL. So if you do it this way, the data is incomplete, or not exactly correct data. Because maybe the combined header data of two broilers just includes the unuploaded log of one broiler (the log does not reach the standard of top visits of single broiler).

So how to solve this problem, the root cause is that the file system where the summary data resides is local, not distributed (hadoop’s HDFS is probably invented based on this requirement). This is fine for status code latitude, because the total HTTP status code is so small. Therefore, if it is the URL latitude, for example, a machine room gives a single task to a single chicken within 10 minutes, and the total number of URL data reaches 180,000. Only the data of broilers whose log duplicates are greater than 100 is viewed. So the maximum error is 100* number of broilers, so for the machine room with 10 broilers, as long as the combined result is >1000. They’re all trustworthy. For domain latitude, a few header customers account for the majority of bandwidth. This is also known as hot-keys, where a few hot-keys account for the majority of traffic. So domain name latitude, this time you can scale the focus in the url list of the specified domain name. If the amount of data reported to the data table locally is too large, you can compress urls with short addresses. Of course, if you don’t want to overtake cars on curves, you need to solve this problem, which may require HDFS, a distributed file system.

Stream-Processing

To implement the log client system, we need to download the log required by this task to the log server (generally a machine access log of 10 minutes). First, the local log will go to the task server to query the replay task. Then go to the log server to download. If the simulation is in DC network to form a cluster, then download a 10 minutes (about 150 m file) log is almost done in 1 seconds, but if this distributed system is established in OC network, then the OC chicken network server to DC (considering machine reliability, log server set up in DC network) to download, After NAT is translated from the Intranet to the Internet, the download takes about 10 seconds. Waiting for the log server to complete the download is also a time overhead.

In distributed systems, the so-called stream-processing differs from batch processing in that data has no boundaries. You don’t know when the log is finished downloading. The relationship between the process before and after batch processing is just like the process of the production line. The first one is completed before the second one is started, and the output of the last one is fully known.

So-called streaming processing requires that when the output of the first part arrives, the second process is started, the first process continues to output, and the second process needs to respond to the processing event. The latter requires frequent schedulers.

Message Broker: A partial output of the previous one, fed to the message system. If the message system detects that the log is complete, the input for the next procedure can be generated. Here we have a problem. Logs can be downloaded much faster (10 seconds) than they can be replayed (3 minutes). The possible actions of a message system are: discard if there is no buffer, cache according to the queue, and perform flow control synchronization with the matching speed of the last step and the previous step. Here we chose to cache by queue. Of course, in a rigorously distributed database design, Message Broker is a node capable of detecting data loss. The Broker sends the full data to subsequent processes and buffers the data to hard disk for backup in case the program core dumps. For slow pre-process, comprehensive scheme configuration, discarding or flow control can be carried out. The message broker differs from a database in that its intermediate unprocessed data is stored temporarily, and processed messages are stored clean.

conclusion

Of course: the distributed system of production line in reality will be far more complex than this, but the mini sparrow distributed system from 0 to 1 implemented in this paper has certain practical significance. It’s not something that happens overnight, version after version. Of course, the system has also completed the author’s KPI-storage model analysis, and the design thinking and improvement in the halfway encountered problems are summarized and shared with everyone here.

Question and answer

What are the formatting requirements for character recognition?

reading

So you are like this http2

How do I step by step find pressure measurement performance bottlenecks with GO

Server Push best practices for HTTP/2

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!