An article to understand the current state of crawler technology

Nuggets original permission just opened, move. If you have already seen this article, please skip it.

In this paper, the principle, technology status and problems of crawler are analyzed comprehensively. If you’re new to reptiles, this is for you, and if you’re an experienced entomologist, the Easter eggs at the end of this article may be of interest to you.

demand

There are countless pages on the World Wide Web, containing a huge amount of information, pervasive, diverse. However, most of the time, no matter for data analysis or product requirements, we need to extract interesting and valuable content from certain websites. However, even though human beings have evolved into the 21st century, they still have only two hands and one eye, so it is impossible to click and look at every web page and then copy and paste. Therefore, we need a program that can automatically obtain web content and extract corresponding content according to specified rules, that is, crawler.

The principle of

The traditional crawler starts from one or several urls of the initial web page and obtains the urls on the initial web page. In the process of crawling web pages, new urls are continuously extracted from the current page and put into the queue until certain stop conditions of the system are met. The work flow of focused crawler is complicated, and it is necessary to filter the links irrelevant to the topic according to certain webpage analysis algorithm, reserve the useful links and put them into the URL queue waiting to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process, until reaching a certain condition of the system to stop. In addition, all crawler web pages will be stored by the system for certain analysis, filtering, and index establishment, so as to facilitate the subsequent query and retrieval; Therefore, a complete crawler generally contains the following three modules:

Network request module
Climb the flow control module
Content analysis extraction module

Network request

We often say that crawler is actually a bunch of HTTP (S) requests, find the link to be crawled, and then send a request packet to get a return packet. Of course, there are also HTTP keep-alive or Stream-based Websocket protocol in H5, which is not considered here, so the core elements are:

url
Request header, body
Response to herder, content

URL

A crawler starts with an initial URL, parses the links in it according to the HTML article it crawls, and continues to crawl. This is like a multi-fork tree, starting from the root node, and generating new nodes with each step. To enable the crawler to terminate, a Depth is usually specified.

The Http request

HTTP request information consists of method, headers, and body. Since method is usually the first line in the header, you can also say that the request header contains the request method. Here is part of the Chrome access request header:

GET / HTTP/1.1
Connection:Keep-Alive
Host:gsw.iguoxue.org
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.02883.95. Safari/537.36
Accept-Encoding:gzip, deflate, sdch, brCopy the code

Please refer to w3c Http Header Field Definitions. For crawler, it should be noted that when the request method is POST, the request parameters need to be urlencode before being sent. After the background receives the request information, some verification may be done, which may affect crawler. Related header fields are as follows:

Basic Auth

This is an old, insecure way of authenticating a user. It usually requires a user name or password (plain text) in the HEADERS Autheration field. If the authentication fails, the request will fail.

Referer

The source of the link, usually when accessing the link, must bring the Referer field, the server will carry out the source verification, the background will usually use this field as the basis of anti-theft chain.

User-Agent

The background usually uses this field to determine the device type, system, and browser version. Some programming language packages have custom user-agents for network requests that can be identified, which can be set to the UA of the browser in crawlers.

Generally, after the user logs in or performs some operations, the server will include Cookie information in the return package and require the browser to set Cookie. If there is no Cookie, it will be easily identified as a forged request.

There are also local through JS, according to the server to return a certain information to process the generated encryption information, set in the Cookie;

JavaScript encryption

In the transmission of sensitive data, javascript is generally used for encryption. For example, Qzone will encrypt the user’s login password with RSA and then send it to the server. Therefore, crawler needs to request the public key and then encrypt it when simulating login.

Custom field

Because HTTP headers can be custom headers, it is important to note that third parties may add custom field names or field values.

Process control

The so-called crawl process is to climb according to the order of the rules. In the case of small crawl tasks, the crawl process control is not too troublesome, many crawl frameworks have already done such as scrapy, just need to implement their own code parsing. However, when crawling some large websites, such as the whole network to grab jingdong comments, weibo all information, attention relationship and so on, the billions to billions of times set hundreds of billions of times of requests must be considered efficiency, otherwise only 86400 seconds a day, then one second to catch 100 times, a day only 8640W requests, It also takes more than 100 days to reach the billion level of requests. Involving large-scale fetching, must have good crawler design, general many open source crawler frame are also limited, because the middle involves a lot of other problems, such as data structure, repeated grab filtering problem, of course the most important thing is to put the bandwidth utilization, so it’s very important to distributed grab, process control is very important at this moment, Distributed is the most important scheduling and coordination of different threads on multiple machines, usually sharing a URL queue, and then each thread through message communication, if you want to capture more and faster, then the throughput of the message system in the middle is also higher. There are also some open source distributed crawl frameworks such as scrapy-Redis, which overrides the scheduling module, queue, and pipeline of scrapy. The Redis database is used to share the request queue in distributed mode, and the scrapyd-API is used to initiate the acquisition of data.

Content analysis extraction

The accept-Encoding field of headers request indicates that the browser tells the server the compression algorithm it supports (currently gZIP is the most popular). If the server is enabled with compression, the response body will be compressed when it returns, and the crawler needs to decompress it by itself.

We often need to get the past comes mainly from the content of the web page HTML document itself, that is to say, we have decided to crawl, are contained in the HTML content, but with the development of web technology in recent years rapid, more and more dynamic web pages, especially the mobile terminal, a large number of SPA applications, these sites in a large number of using ajax technology. The web pages we see in the browser are not all HTML documents said to contain, many are dynamically generated through javascript, generally speaking, we finally see the web page includes the following three types:

Html documents themselves contain content

This situation is the easiest to solve, generally speaking, basically static web pages have written dead content, or dynamic web pages, using template rendering, when the browser gets to the HTML is already containing all the key information, so the content directly seen on the web page can be obtained through specific HTML tags. The analysis of this situation is also very simple, there are several general methods:

CSS selectors
XPATH (this is worth learning)
Regular expression or plain string lookup

The JavaScript code loads the content

Generally speaking, there are two cases: A situation is in the request to the HTML document, web page data in js code, rather than in the HTML tags, we see the web page is normal, that is because, in fact is due to the dynamic execution of js code added to the inside of the label, so this time content in js code, and js is executed in the browser operation, Therefore, when we use the program to request the web address, the response we get is the web code and JS code, so we can see the content in the browser. When parsing, because JS is not executed, we must find the content under the specified HTML tag must be empty. For example, baidu’s home page is this way. In general, it’s about finding the JS code string that contains the content, and then retrieving the content through regular expressions, rather than parsing HTML tags. Alternatively, JavaScript may dynamically generate some DOM when interacting with the user, such as clicking a button to pop up a dialog box. In this case, the content is usually user prompt related content, not valuable, if you really need to analyze the JS execution logic, but this is rare.

Ajax/Fetch Asynchronous request

This is very common these days, especially when content is displayed on a page without a refresh or after some interaction with the page. For this type of page, we need to keep track of all requests during analysis to see where the data is actually loaded. Then when we find the core asynchronous request, we just grab the asynchronous request, if the original page has no useful information, there is no need to grab the original page.

Current state of crawler technology

language

In theory, any language supporting network communication can write crawler. Although the language of crawler itself has little relation, there are always relatively smooth and simple ones. At present, most crawlers are written in background scripting languages, of which Python is undoubtedly the most widely used, and there are many excellent libraries and frameworks such as Scrapy, BeautifulSoup, PyQuery, Mechanize, etc. However, generally speaking, search engine crawlers have higher requirements for crawler efficiency, and they choose c++, Java and go(suitable for high concurrency). The top 50 open source Web crawlers with details are used for data mining. When I was in college, I used c++ to implement a multi-threaded framework, but I found that the crawler efficiency was not significantly improved compared with that of python. The reason is that for simple crawlers, the bottleneck lies in data analysis and extraction, while network efficiency has little relationship with language. It’s worth noting that Node has grown so rapidly in recent years that javascript has become popular. Some people are trying to use Node to do crawlers, but it’s no different than any other background scripting language, and it’s not as easy as Python, because you still can’t make Ajax requests in Node. The DOM of the original page cannot be executed. Because node’s javascript execution environment is not the same as the browser’s. So, is it really impossible to write crawlers in JS and extract content in jquery as you would in a browser? It’s a bold idea. Let’s shelve it.

Runtime environment

Crawler itself does not distinguish whether it runs on Windows or Linux or OSX, but from the perspective of business, we call the crawler running on the server (background) as background crawler. Now, almost all crawlers are background crawlers.

Three problems with background crawlers

Background crawler in popular, also have a little tricky, so far no good solutions to problems, and in the final analysis, the root cause of these problems are caused by congenital deficiency of background crawler, before the formal discussion, let’s think about a problem, “crawler and browsers have what similarities and differences?” .

The same

Essentially, they request Internet data over the HTTP/HTTPS protocol

The difference between

Crawlers are generally automated programs that do not require user interaction, whereas browsers are not
Different running scenarios; Browsers run on the client side, while crawlers generally run on the server side
Different abilities; Browsers contain rendering engines, javascript virtual machines, and crawlers generally don’t have either.

With that in mind, let’s take a look at the backstage issues

Problem 1: Interaction problem

Some web pages often require some interaction with the user to get to the next step, such as entering a verification code, dragging a slider, selecting a few Chinese characters. A lot of the reason websites do this is to verify whether visitors are humans or machines. And crawlers encountered such a situation, it is difficult to deal with the traditional simple image authentication code can read the content through image processing algorithms, but with various, variety, people sickened, more and more abnormal captcha (especially when buying a train ticket, minutes to swear), the problem is more and more serious.

Problem 2: Javascript parsing problems

As mentioned earlier, javascript can dynamically generate the DOM. At present, most web pages belong to dynamic web pages (the content is dynamically filled by javascript), especially in the mobile end, SPA/PWA applications are becoming more and more popular. Most useful data in web pages are dynamically acquired through Ajax/Fetch and then filled into the DOM tree of web pages by JS. There is very little useful data in a pure HTML static page. At present, the main solution is to directly request the URL of AJAX/FETCH for JS Ajax/FETCH request. However, some Ajax request parameters are dynamically generated depending on a piece of javascript, such as a request signature, and password encryption for user login, etc. If a background script is used to do what javascript is supposed to do, it requires a clear understanding of the logic of the original web page code, which is not only very troublesome, but also makes your crawling code extremely large and bloated, but, more deadly, there are some things javascript can do that crawlers are difficult or even impossible to imitate. For example, some sites use a captcha mechanism that drags a slider to a certain location, which is hard to emulate in a crawler. In fact, to sum up some of these disadvantages, in the final analysis, because crawler is not a browser, no javascript parsing engine caused by. To solve this problem, the main strategy is to introduce Javascript engine in crawler, such as PhantomJS. However, it has obvious disadvantages, such as too much resource occupation when the server has multiple crawler tasks at the same time. Also, these windowless javascript engines often don’t work as well as they would in a browser environment, making the flow difficult to control when a jump occurs within the page.

Problem 3: IP restrictions

This is by far the deadliest of all background crawlers. The firewall of the website will limit the number of requests for a fixed IP in a certain period of time. If it does not exceed the line, the data will be returned normally. If it exceeds, the request will be rejected, such as QQ mailbox. It is worth noting that IP restrictions are sometimes not designed specifically for crawlers, but most of the time as a defense against DOS attacks for website security reasons. With limited machines and IP addresses, it is easy to go online and the request is rejected. At present, the main solution is to use proxy, so that the number of IP will be a little more, but proxy IP is still limited, for this problem, it is impossible to solve completely.

eggs

This paper introduces the current situation and problems of crawler technology in detail. But don’t get discouraged. 2017 saw a big advance in crawler technology. What is it? Don’t Google Google, I can’t find it on the Internet, if you are interested, please follow me. Next time, I’m going to debunk it.

The last

It’s the same old story. I just started my blog, and I’m looking for attention, praise, and spread.

This article is allowed to be reproduced free of charge, but please indicate the original author and the link to the original text.