Article source: Baidu APP technology wechat official account

One, foreword

Network optimization is a recognized deep field in several major technical directions of the client, so Baidu App to bring you a series of in-depth network optimization articles, including a series of DNS optimization, a series of two connection optimization, a series of three weak network optimization, I hope to help you in the direction of network learning and practice.

Baidu started from search, and the network architecture and deployment of the whole company are based on standard Internet protocol. At present, it is full stack HTTPS. In the era of mobile Internet, the overall infrastructure remains unchanged, but a lot of optimization work needs to be done on the client side.

The Domain Name System (DNS) is used to query IP addresses based on Domain names. It is the prerequisite for HTTP. HTTP processes can proceed only after a Domain Name is correctly resolved into an IP address.

Second, the background

The DNS optimization core needs to solve two problems:

[1] Service unavailability due to DNS hijacking or failure, thus affecting user experience and the company’s income.

[2] Performance degradation caused by inaccurate DNS scheduling, thus affecting user experience.

Baidu App carries hundreds of millions of traffic. Every year, it encounters carrier DNS hijacking or carrier DNS failure. The overall impact is very bad, so DNS optimization is urgent.



Principle of carrier hijacking or failure

Third, HTTPDNS

Given this serious problem, how can we optimize DNS? The answer is HTTPDNS.

Most standard DNS servers interact with DNS servers based on UDP. HTTPDNS uses HTTP to interact with DNS servers, bypassing Local DNS services of carriers, effectively preventing domain name hijacking and improving domain name resolution efficiency.



HTTPDNS principle

Baidu App HTTPDNS end implementation is based on baidu SYS team HTTPDNS service, the following figure describes the server deployment structure of HTTPDNS.



HTTPDNS deployment structure

The HTTPDNS service is based on BGP. BGP Border Gateway Protocol is a routing Protocol that dynamically exchanges routing information between autonomous systems. BGP can route users to baidu service sites based on their carriers. The service point will initiate a query to the authoritative DNS of other domain names through the CDN node deployed by Baidu on the carrier to search the optimal IP address of the domain name under the carrier.

Baidu App independently implemented the end of the HTTPDNS SDK, the following figure describes the overall architecture of the end of the HTTPDNS.



The overall architecture of end HTTPDNS

DNS interface layer:

The DNS interface layer solves the problem of shielding the low-level details and providing a simple and clean API to the external world, reducing the user’s starting cost and improving the development efficiency.

DNS policy layer:

DNS policy layer through a combination of multiple policies, so that the HTTPDNS service in the performance, stability, availability are maintained at a high level, the following explains the original intention and concrete implementation of each policy design.

1. Dr Policies

This is a key strategy that addresses the availability of the HTTPDNS service and has proven to help Baidu App save a lot of traffic in abnormal situations.

[1] When the HTTPDNS service is unavailable and there is no local cache or cache failure, it will trigger the downgrade policy, downgrade to the carrier’s localDNS solution, although there is a risk of operator accident or hijacking, but ensure the availability of DNS service.

[2] When both HTTPDNS and localDNS services are unavailable, the backup policy is triggered and the backup IP address on the end is used.

What is backup IP? Backup IP is a multi-group IP list classified by domain name, which can be dynamically updated in the cloud to facilitate subsequent operation and maintenance students to adjust the node IP of the server. Not all domain names have a corresponding Backup IP list, and baidu App can only guarantee the availability of core domain names.

What is the backup IP selection mechanism? Our central idea is to make use of the minimum cost on the end, and consider the load balancing on the server side, to get relatively correct or reasonable selection results. By operators and geographic information, can choose a relatively optimal IP, but for geographic information need much time consuming, and frequency is very high, the price is very big, so we chose the RR algorithm instead of the above methods (RR algorithm is Round – Robin, polling scheduling), so that the client’s cost reduce to a minimum, the server also realize the load balance.

2. Security policy

[1] HTTPDNS addresses the core problem is security, standard DNS query is mostly based on UDP, but also based on TCP, if UDP is blocked, you need to use TCP. Whether UDP or TCP, security is not guaranteed. HTTPDNS queries are based on the standard HTTP protocol. To ensure security, we add a layer of TLS (Secure Transport Layer Protocol) to HTTP.

[2] After the transport layer protocol security is solved, we need to solve the problem of domain name resolution. We mentioned above that HTTPDNS service is based on BGP access, on the end of the HTTPDNS data request in VIP mode (VIP is Virtual IP, VIP does not have a certain binding relationship with a device, Will follow the situation and transformation, such as main/backup VIP services is corresponding to one or several servers), since the request of the original data need to use the IP direct way, then get rid of the operators localDNS resolution limit, so even if the operator failed or was kidnapped, and will not affect baidu App availability.

3. Task scheduling policies

The HTTPDNS service provides two types of HTTP interfaces for requesting optimal domain name results. The first is the multi-domain interface, which delivers the domain name configured for different product lines. The second is the single-domain interface, which returns only the domain name you want to query. This design is basically the same as the standard DNS query, except that it has been changed from UDP to HTTP.

[1] Multi-domain interface will request once during App cold startup and network switchover, so as to obtain domain name results in advance when App network environment is initialized or changed, which will also reduce the number of requests for single domain name interface.

[2] The single domain name interface will trigger a network request after the local cache expires, and then make a single domain name request. The DNS result of the user’s operation will be degraded to the localDNS result, but in the case of no expiration, the next HTTPDNS result will be returned.

4.IP address selection policy

The core problem of IP selection policy is to select the optimal IP address to avoid cross-carrier time caused by incorrect selection of access point. The HTTPDNS server delivers the optimal IP addresses in sequence, and the client selects the first IP address by default. The connectivity check is not performed on the client.

5. Cache policy

Everyone is familiar with DNS cache, it is mainly to improve access efficiency, operating systems, network libraries will do DNS cache.

An important concept in DNS cache is time-to-live (TTL). In localDNS, the TTL Time is different for different domain names. In HTTPDNS, this value is dynamically delivered by the server. If there is no new IP address after the expiration, the old IP address will be used. Of course, you can choose not to use the old IP address and downgrade to the IP address of the localDNS, which depends on how the localDNS handles the expired IP address.

6. Hit ratio strategy

If the HTTPDNS hit ratio is 100%, on the premise of ensuring the stability and efficiency of HTTPDNS service, we can prevent hijacking and improve the ability of accurate scheduling.

[1] In order to improve the HTTPDNS hit ratio, we choose to use the multi-domain interface, during the cold start and network switch, batch pull domain name results and cached locally, easy to use in the following requests.

[2] In order to improve the HTTPDNS hit ratio again, when the user operation triggers the network request to obtain the IP corresponding to the domain name, the local expiration time will be determined in advance, the time is 60 seconds, if the expiration, the single domain name request will be initiated and cached, which will continue to extend the expiration time of the domain name result. Local expiration time and TTL mentioned above are both client and server expiration times to ensure the accuracy of expiration time in abnormal cases.

Basic ability level:

The basic capability layer provides basic capabilities required by the DNS policy layer, including IPv4/IPv6 stack detection capabilities, data transmission capabilities, and caching capabilities. The implementation of each capability is described in the following sections

1.IPv4/IPv6 stack detection:

Baidu App IPv6 transformation is in full swing, on the end of the HTTPDNS IP selection on how to know which protocol stack currently belongs to become a key issue, and this judgment requires high performance, because the frequency of IP selection is too high.

The solution we chose was UDP Connect, so what is UDP Connect? As we all know, TCP is connection-oriented. Before transmitting data, the client calls the connect method to establish a connection through a three-way handshake. UDP is connection-oriented and can send and receive data without establishing a connection. When we call UDP Connect method, the system will detect whether the port is available and the address is correct, and then record the IP address and port number of the peer end, and return it to the caller. Therefore, UDP Connect will not initiate three-way handshake like TCP Connect, resulting in real network loss. The UDP client can initiate real network loss only after calling send or sendto.



UDP Connect principle

With the basic guarantee of UDP Connect, we have made a caching mechanism in the upper layer to reduce the loss of system calls. At present, the detection is only triggered in the cold start and network switch, and the detection in the same network system can basically ensure that the current network is IPv4 stack or IPv6 stack.

Currently, Baidu App client has a conservative IPv4/IPv6 dual-stack strategy, using V6 IP only in the case of ipv6-only, and v4 IP for the rest. The solution under dual-stack needs to be optimized later. The current standard in the industry is Happy Eyeball algorithm. What is a Happy Eyeball? The name happy EyeBall is derived from the fact that IPv4 or IPv6 glitches do not cause eyeballs to wait for loading or fail. The happy eyeball comes in RFC6555 v1 and RFC8305 v2, the former by Cisco and the latter by apple. The core problem of Happy EyeBall is the selection of V4 and V6 IP in complex environment. It is a set of overall solution, which provides provisions on the processing of domain name query, address sorting, connection attempts, etc. Interested students can refer to [5] and [6] in Resources.

2. Data transmission:

Data transmission mainly provides the ability of network request and data parsing.

[1] Network request failure retry mechanism, the success rate of HTTPDNS results will greatly affect the hit rate of HTTPDNS, so the client will have a three retry mechanism, to ensure the success rate.

[2] If the HTTPDNS result is abnormal, it will not overwrite the cache on the end.

3. Cache implementation:

The implementation of caching can be divided into disk caching and memory caching, for HTTPDNS caching scenario, we choose either or both? Baidu App memory cache is selected, the purpose is to prevent the occurrence of problem, our service operations classmate switch flow in an emergency, if the disk cache, leads to baidu App may also be unavailable after the restart, but this kind of question will lead to the App during cold start, HTTPDNS results before returning, there is the risk of failure or hijacked, The overall evaluation is acceptable. If this extreme situation occurs, it affects some requests during the cold start phase, but it will return to normal as soon as the HTTPDNS results are returned.

Best practices for HTTPDNS

Baidu App currently client network architecture due to historical reasons has not been unified, but we are working towards this goal, the following focus on the HTTPDNS in Android and iOS network architecture location and practice.

Location and practice of HTTPDNS in Android Network architecture

Baidu App’s Android network traffic is all on okHTTP, the upper layer of the network facade encapsulation, encapsulation of internal implementation details and external friendly API, for each business and basic modules to use, on okHTTP we extend the DNS module, using HTTPDNS to replace the original system DNS.



The location of HTTPDNS in the Android network architecture

Location and practice of HTTPDNS in iOS network architecture

The iOS network traffic of Baidu App is above Cronet (Chromium NET module). On the upper layer, WE use AOP to inject cronet stack into URLSession. In this way, we can directly use URLSession API for network operation and it is easier to maintain the system. The network facade is encapsulated in the upper layer for use by various business and basic modules. Inside Cronet, we modify the DNS module, adding HTTPDNS logic in addition to the original system DNS logic. Some of the traffic on iOS is in the native URLSession, mainly because some third-party services do not use Cronet but still want to use HTTPDNS alone, so there is the following HTTPDNS encapsulation layer. The method is to directly replace the domain name with IP at the upper layer. Domain name is crucial for many mechanisms at the bottom. For example, HTTPS check, cookie, redirection, and Server Name Indication (SNI). After changing the domain Name to IP direct connection, we handle the above three conditions to ensure the availability of the request.



Location of HTTPDNS in the iOS network architecture


Five, the revenue

DNS optimized benefits mainly has two points, one is to prevent the DNS hijacking (when the problem becomes very important), reduce the network delay (in the case of scheduling is not accurate, will increase the network time delay, reduce the user’s experience), both of which yields need to combine business, baidu App Feed business, for example, the first point we made a larger effect, The iOS hijacking rate decreased from 0.12% to 0.0002%, and the Android hijacking rate decreased from 0.25% to 0.05%. The second point of income is not obvious, because the main target group of Feed business is in China, baidu’s node layout is relatively rich in China, and the overall quality of service is high. Even if the scheduling is inaccurate, The difference isn’t that big, but it could be a lot worse in a foreign country.

Six, the concluding

DNS optimization is a continuous topic, some of the experience and practices of Baidu App introduced above are not perfect, but we will continue to optimize in-depth, for Baidu App DNS ability escort. Finally, thank you for your hard reading, I hope to help you, will continue to launch – Baidu App network depth optimization series “two” connection optimization, please look forward to.

Seven, personal experience

As an engineer, how to do a good job in network optimization is a topic worth discussing. I think we should start from the following five aspects.

[1] Basic knowledge should be learned and consolidated. There are a lot of contents related to the network, which are very complicated and difficult to learn. Students who have read THE RFC released by IETF should feel deeply.

[2] Learn how to turn invisible networks into visible ones. Many students who think they know a lot about networks often recite the principle of TCP protocol, congestion control algorithm, sliding window size, etc., but when they really encounter online problems, they have no way to start. For clients, we need to learn how to use tcpdump and Wireshark and Fiddler and Charles on THE PC. In many cases, the network environment of the PC and mobile phone is not the same, so we need to use iNetTools, Ping&DNS or terminal tools on the mobile phone. After learning to use tools, you need to be able to create different Network environments. For example, Apple’s Network Link, FaceBook’s ATC (Augmented Traffic Control), and so on. With these two scenarios in place, your first reserve comes into play. You need to be able to understand the handshake process, the transfer process, the abnormal disconnect process, etc.

[3] With the above two preparations, you need a platform where all kinds of network problems will occur, so that you can accumulate experience and suffer from online problems under high pressure.

[4] is need data support network optimization, but it takes experience data collection and analysis, some data as it is wide of the mark, how do some data analysis are negative earnings, in general there is a trio to analyze the data, a, offline data collection and analysis, it is concluded that positive earnings, second, gray level data collection and analysis, Third, collect and analyze online data to get positive returns.

[5] The positive benefits of data cannot fully prove the improvement of user experience, so it is often necessary to analyze and optimize for specific scenarios and cases. Even if wechat is recognized as doing a good job, it can not guarantee the best experience in all scenarios.

Viii. Reference materials

  • https://chromium.googlesource.com/chromium/src/+/HEAD/docs/android_build_instructions.md

  • https://chromium.googlesource.com/chromium/src/+/HEAD/docs/ios/build_instructions.md

  • https://github.com/Tencent/mars

  • https://tools.ietf.org/html/rfc7858

  • https://tools.ietf.org/html/rfc6555

  • https://tools.ietf.org/html/rfc8305