CDN is to cache all kinds of Internet content to the edge server close to the user through the distributed deployment of edge servers around the world, so as to reduce the user access delay and significantly reduce the traffic across the core network of the Internet. CDN has become an inevitable choice for Internet services. Traditional website protection is basically to protect the source station, customers buy firewall, WAF and other products can protect their core business content is not malicious theft. However, the traditional protection method cannot fully meet the scenario of business traffic distribution through CDN:
The deployment position is in front of the source station, mainly to protect the source station. In the CDN architecture, pages are basically cached on the CDN, and crawlers can directly crawl away users’ sensitive business data from the CDN.
The identification method mainly relies on embedding JS in the user’s page. This method essentially modifies the user’s page, which is very intrusive, and can only be adapted to Web business, and does not take effect for API business.
Usually, frequency control is adopted to limit high-frequency IP and other features, which is easy to be bypassed. Now crawlers will basically adopt IP proxy pool to randomly modify the header end of the request, so it is difficult to find features for frequency control.
At present, CDN undertakes a large number of business of the main site, so it is necessary to ensure the business browsing and trading experience and prevent the content from being stolen maliciously. More and more business data are cached in the edge server of CDN, and the weight of edge security is getting higher and higher. The machine traffic management based on edge cloud arises at the right moment to deal with the hidden danger of edge security of CDN and realize the security protection of user application data.
The analysis and processing process of machine traffic management based on CDN edge nodes are shown in the figure below:
Internet access is generally divided into normal user access, commercial search engine access, malicious crawler access, etc. Machine traffic management extracts request message features at the edge, identifies request types based on the message features, blocks malicious crawler access at the edge, and protects cache resources on the CDN from malicious crawler access.
The advantages of machine traffic management are as follows: based on the CDN edge network architecture, the machine traffic management ability can be realized, the request type of domain name can be identified through the request message characteristics, the normal request can be distinguished from the malicious machine request, and the user can manage their own request and block the malicious request. By identifying the request type of the domain name and marking the requested message type in real time, the message type in the current business request can be shown intuitively. Customers can intuitively perceive the distribution of the access type of their website and deal with abnormal message type. By manipulating the message type instead of the IP, as long as the message type of the malicious request remains the same, the attacker cannot bypass the random header field or use the second dial proxy IP pool.
Verify the actual results of machine flow management
In the Double 11 business scenario, the machine traffic management identifies all the traffic that visits the detail page of the master station, and classifies the Bot traffic. The core strategy is to release formal commercial crawlers such as search engines and restrict or intercept malicious crawlers.
By analyzing the traffic of detail pages and the behavior characteristics of the requests, it is found that nearly 40% of the requests are malicious visits. Before November 11, I successfully helped a business in the main station to intercept more than 70% of the crawler traffic by opening the disposal strategy. The following figure shows the traffic comparison before and after the disposal is enabled. The blue line shows the traffic trend before the disposal strategy is enabled, and the green line shows the traffic trend after the disposal strategy is enabled. The blocking effect is very obvious and does not affect the actual business operation.
On November 11, basically the access characteristics of the request did not change, and ultimately intercepted hundreds of millions of malicious requests, millions of malicious IP and tens of millions of malicious crawling commodity ID.
CDN machine traffic management assumes more protection for master site services, and it is found that part of the requests crawling master site content can pass through the protection strategy, that is, the behavior of the request crawling has changed. Through the QPS analysis of online surge, it is found that the mutant crawler mainly uses the browser engine of IE, and the source IP largely uses the second dial proxy IP, which has obvious characteristics of commercial crawler. After the report, the emergency plan was quickly formed and the abnormal types were quickly dealt with.
PS: If you want to know more about the business of Aliyun Edge Cloud, you can nail search and add group 35469210