background

About convenience bees

Convenience bee is a new convenience store as the main body of scientific and technological innovation retail enterprises, the company to science and technology as the core driven operation, to “quality life convenient China” as its own responsibility, embrace the “little happiness in your side”, “healthy, safe products and efficient, convenient and satisfactory service. At present, there are more than 2,000 stores in China.

The role of the Internet in stores

Most of the equipment in the convenience bee store is intelligent, and there are dozens of internet-dependent devices, covering everything from order payment to in-store operations. There are a number of algorithms supporting trivial affairs in the store. From ordering, display, inventory, waste, hot meal making, real-time price change, self-service coffee and order settlement, each link has a complex network interaction. To support the normal operation of stores, the stability of store network is very important.

Store Network Architecture

The picture above shows the convenience Bee store network topology. In the convenience bee store scenario, special attention should be paid to the balance between stability and cost. As for the network stability, it can be seen from the topology that the main line is the main line at the Internet outlet, and 4G is the bottom. The algorithm controls the deterioration degree of the main line and decides whether to enable THE 4G network, so as to ensure the reliability of the store network. In terms of cost, convenience store business pays great attention to scale effect, and large-scale scene is usually very sensitive to cost per set. The cost is divided into three parts. The first is equipment cost, the second is Internet access cost, and the third is operation and maintenance labor cost. The cost of equipment is mentioned in the following chapters. For Internet access, we usually use the cheapest broadband as the mainline access.

The difficulties in analysis

Based on the above requirements, in order to achieve [high-quality one person in charge of thousands of stores], we will face several problems:

  1. How to define high quality;
  2. How to solve the problem of multi-brand equipment;
  3. There are so many equipment configurations, because of various reasons, there may be network interruption every day, how to manage a thousand stores.

The most critical factor is the gateway, which is responsible for important line escape decisions, as well as various intelligent detection and information collection tasks, gathering information to the center, and finally analyzing the network situation of stores nationwide in real time. So the gateway is the eyes and hands of the entire management system, and the headquarters system is the brain.

Scheme selection

Hardware and system selection

Hardware selection should meet the following conditions:

  1. Can not be a single supplier, too risky;
  2. The complexity of multiple vendors must not hurt the goal of managing thousands of stores;
  3. Hardware stability is not inferior to big factory equipment;

Our current selection strategy is as follows:

  1. Brand or ODM (equipment manufacturer);

    1. Brands tend to have incompatible Console interfaces and lack programming capability, which makes them less flexible;
    2. ODM shipments are often much lower than those of big brands, so it is necessary to have a certain understanding of hardware and choose carefully.
    3. We are currently leaning towards ODM;
  2. Qualcomm or MTK;

    1. In terms of SDK maturity, high pass dominates, but if MTK is selected, it is necessary to accept the use of low version Kernel.
    2. In terms of cost, MTK dominates;
    3. We are currently leaning towards MTK;
  3. The system uses OpenWRT official OR manufacturer SDK;

    1. If you are obsessed with the cleanliness of the new version and have certain Kernel debug capability, it is suggested to use OpenWRT official. We have explored this road and it can be completely passed.
    2. All things considered, we currently favor vendor SDKS.

Proceeding from the above strategies, the convenience bee will inevitably be a state of coexistence of multiple systems.

Why was Rust chosen as the development language

We have three kinds of embedded hardware, two kinds of ARM and one kind of MIPS, among which the lowest configuration is MT7621 CPU with 880MHz MIPS CPU, 512M memory (available 400M) and 370M Flash, which belongs to embedded environment.

Considering the embedded environment and maturity, the selected languages are: Golang, C, Lua, Shell, Rust. The advantages and disadvantages of each language are analyzed as follows:

  1. Golang:

    1. Advantages: support the transplantation of a variety of platforms, strong asynchronous programming ability, and rapid development;
    2. Disadvantages: Requires Runtime, high memory and CPU usage, and MIPS version memory leakage was found in the test.
    3. Conclusion: exclude;
  2. C:

    1. Advantages: simple code, lightweight and efficient, high execution efficiency, good portability;
    2. Disadvantages: the development efficiency is not high, need to face memory security problems;
    3. Conclusion: Alternative;
  3. Lua:

    1. Advantages: Openwrt is the best choice because of Luci and is lightweight enough to bond perfectly with C languages.
    2. Disadvantages: The workload of porting SDK to other OS is large;
    3. Conclusion: exclude;
  4. Shell:

    1. Advantages: light weight, fast development, low difficulty in getting started, system built-in;
    2. Disadvantages: the definition and inspection of the type is not strict, not suitable for the construction of large projects, high requirements for high-quality delivery;
    3. Conclusion: exclude;
  5. Rust:

    1. Advantages: Fast Runtime, memory security, no Runtime and GC(zero cost abstraction), cross-platform;
    2. Disadvantages: Steep learning curve, difficult to get started, relatively new language, many basic libraries need to be improved;
    3. Conclusion: Alternative.

Finally, we made a choice between C and Rust. After some attempts, we finally decided to use Rust, whose high quality delivery was the advantage we cared most about.

Rust practice

Definition of network quality

The most direct way to measure the network quality of a store is through Http detection or ICMP detection, and the detection results are evaluated according to the packet loss rate, delay and other related indicators. Based on the existing monitoring data of stores, we calculated a reasonable Ping value detection grading range, which is as follows:

  • Level A: Delay <= 200ms or packet loss <= 10%
  • Grade B: delay <= 500ms or packet loss <= 20%
  • Grade C: Delay <= 600ms or packet loss <= 40%
  • Level D: delay > 600ms or packet loss > 40%

The quality of a-level network basically has no impact on stores; Grade B will basically cause the store network to appear lag and temporary service failure and other influences; Level C is a relatively large impact, the store network is temporarily unavailable; D indicates that the unavailability of store network has seriously affected store business.

The problem of line escape

Because of the low cost of broadband and decentralized access, the network environment of some stores is even less reliable than the home network, so the gateway will be equipped with a 4G router as a Standby. Switch to the standby line when the main line is interrupted, and switch back to the main line when the main line is restored. Ensure that the store network is not affected. At the same time, the in-store real-time monitoring data is collected to the center, and the global status is monitored in real time through streaming processing. When there is A problem in A single store, the front-line operation and maintenance will be involved in the generation of work orders. When the overall a-level proportion decreases, the second-line network group will receive the phone alarm and intervene in the first time.

Ideally, the main line quality should always be superior to the standby line quality, so even if the main line and the standby line are of the same grade, the main line should be prioritized. The rest is a matter of who’s good and who stays where.

But there’s more to it than that. The main line usually does not interrupt directly, but is in a weak network state. Most devices can only see the network deteriorates and cannot switch. We will switch the quality in the first time after ABCD rule matching and shaking prevention. However, this leads to A new problem. The fault of the main line will jitter frequently between A and D, and at this time, the flapping state of master/standby switchover will occur. Since the instant of route switching will affect the HTTP access of services, we can reduce the occurrence of Flapping as much as possible by exponential retreat. However, another problem will be exposed: during the retreat period, due to the suppression of frequent switching, the actual line can no longer be used, which will lead to the interruption of the whole store network. At this time, abnormal intervention mode is needed to avoid network paralysis.

Research quickly and clear away obstacles

Now that the requirements are clear and the type selection is clear, the POC is left to do the work.

The first is the concurrency library aspect. In the Rust community, we have several models to look at:

  • crossbeam(Multithreading)
  • async-std(asynchronous)
  • tokio(asynchronous)

Crossbeam is an excellent library that can be designed to function well, but we had to take into account the hardware resource constraints of our devices, especially the CPU, so we preferred an asynchronous runtime to solve our problems.

Both Tokio and Async-STD are good choices. At the time of initial development, the Async-STD project was just starting, and many functions were not yet perfect. In addition, the Tokio library has been verified in many projects, so its stability is better than that of Async-STD in all aspects. So we finally decided to use Tokio as the asynchronous runtime for our entire program.

Secondly, network detection. Network quality inspection not only detects the quality of the primary and secondary lines, but also detects the quality of certain services. Generally, the number of IP addresses can reach 20 to 30. In addition, the frequency, Size and Interface of these tests are also different. Our network equipment has many styles, and the output format of Ping command is not uniform. Besides, as Ping command itself is a blocking operation, we need to open worker to perform these operations in asynchronous operation. Later, WE began to investigate other Ping implementations. To achieve multi-device compatibility, the output of the Ping command installed on various devices needs to be adapted, and nesting with the program is tedious and expensive.

We have the following requirements for using Ping:

  • The frequency and size of messages sent from each address are inconsistent

    • Low-frequency large-packet: The large-packet detection is used for some heavy resource addresses, but at the same time, to ensure that the bandwidth of the store network is not affected, so it is used as low-frequency packets
    • Packet high frequency: The packet high frequency policy is adopted for some sensitive resource addresses to ensure uniform detection per minute
    • Small packet low frequency: for some insensitive, only do link detection do small packet low frequency, can save bandwidth
  • Ping packets must be bound to the Interface

    • To support active/standby line detection, bind Interface (backup: static route).
  • Support for Traceroute (Icmp) is preferred

    • Complete Traceroute information of key monitoring lines should be uploaded to the machine room for later fault analysis
  • Can be combined with Tokio Runtime and support MIPS, ARM, AARCH64 and other architectures

    • Network diversification, multiple devices need to be cross-platform, and have requirements on resource occupancy
  • Control Sequence Number Start Number

  • A programmable

    • Can be customized, ergonomic

The community has several implementations:

  • * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  • fastping: Fastping is a fastping implementation similar to Go, which can Ping multiple addresses at the same time. However, our requirement is to Ping multiple addresses with different duration, different sizes, and different bound network interfaces. Therefore, although it is also batch Ping, But not very applicable;
  • tokio-ping: In fact, tokio-Ping was the most suitable for our project at the beginning, but rust asynchronous ecology was just in the phase of old and new. Tokio-ping was in the previous phase of asynchrony and was not perfectly compatible with our project. It required some compat to use, and the author basically stopped maintaining it.

With that in mind, we implemented an asynchronous Ping program called Surge – Ping that suited our needs.

Code examples:

Program structure

The organization chart of gateway Agent is shown above. Agent is deployed for each store to monitor the quality of store network. Real-time monitoring of store network is carried out by uploading the collected data to the monitoring system, so as to detect faults as soon as possible and reduce store losses.

Multi-device compatibility

At present, due to the diversity of network equipment in our stores, cross-platform requirements are involved. Take ARM64 and MIPS, our two main platforms, as examples, to show Rust’s cross-platform capabilities.

MIPS cross-compile

This needs to be done by the manufacturer’s compilation SDK. Since we are using an old Rom (Mipsel-UCliBC), this is the third tier in rust’s compilation class and is not preset in rust’s official supported target list, so we need to borrow Xargo to achieve compilation.

Determine the device’s triplet information ({arch}-{vendor}-{sys}-{abi}). The toolchain directory can be used to determine the liBC version, such as our: Toolchain-mipsel_24kec_gcc-4.8-linaro_uclibc, Arch is MIpsel, ABI is UCliBC, Vendor is generally unknown, and system is Linux, so our triplet information is mipsel-unknown-linux-UClibc.

The operation is as follows:

TIPS: I encountered a lot of problems with the liBC library support during early builds, so I need to look up the UCliBC code to help improve rust’s liBC library.

ARM64 cross-compile

You need the Cross project: github.com/rust-embedd…

There are few pits that use Cross, and most Unix-like OS can use it, which is compiled using Docker by default. Usage:

For OpenSSL, non-MIPS architectures can directly use the rustls-TLS library as follows:

If MIPS + OpenSSL is used, you need to specify openSSL compilation in Dockerfile. Dockerfile is as follows:

And then compile it

conclusion

Rust is well suited for this type of store embedded scenario, with its complete and easy-to-use toolchain, high-quality community, and secure memory management that significantly reduces uptime and improves delivery. Rust has been supporting the in-store network 100% for 2 years with 99.9999% stability. The demand frequency will be iterated every half month.

Open Source Community Contributions

We also contribute to the community in small ways:

  • Github.com/rust-lang/l…helplibcLibraries perfect formips-uclibcSupport;
  • Github.com/rust-lang/s…Support for bindinginterfaceFunction, and added pairs ofmipsel-uclibcSupport;
  • Github.com/kolapapa/su…An implementation of asynchronous Ping can also be usedtraceroute(ICMP);

Author’s brief introduction

Liu is a development engineer of convenient bee operation and maintenance, mainly responsible for the development and maintenance of monitoring and alarm system and diversified store network projects.

Mr. Pei, head of convenient bee operation and maintenance.

Finally, Convenience bee is looking for excellent partners. We will take every resume seriously and look forward to meeting each other.

Recruitment website

Bianlifeng. Gllue. Me/portal/home…