Free books | day data volume than the large-scale crawler is how to achieve?

Delimit key point: please buy good melon seeds, move a good stool to sit down and study, and ready to compete for the prize presented at the end of the article! * * * * \

The most frequent and largest crawlers around us are the major search engines. However, the crawling way of search engines and the way we crawler engineers contact are quite different, so there is no great reference value. What we are going to talk about today is the crawler of public opinion direction (architecture and key technical principles), mainly involving: \

Intelligent text extraction of web pages;
Distributed crawler;
Crawler DATA/URL deweighting;
Crawler deployment;
Distributed crawler scheduling;
Automatic rendering technology;
Application of message queue in crawler field;
Various forms of anti-reptilian;

I. Intelligent text extraction of web pages

Public opinion is actually public opinion, to master public opinion, then we must master enough content information. Except for some large content/social platforms with open commercial interfaces (such as Weibo), others need to rely on crawlers to collect. Therefore, crawler engineers in the direction of public opinion need to face thousands and thousands of sites with different content and structure. Let’s use a graph to illustrate the problems they face:

Yes, their collectors must be able to adapt to the structure of thousands of sites, extracting the main content — title, body, date of publication, and author — from HTML text that varies in style.

What kind of design would you use to meet your business needs?

I have imagined such a question, and I have seen similar questions raised by other technical groups, but it is difficult to get satisfactory answers. Some people say:

Using the classification method, the similar contents are classified together, and then the extraction rules are configured for one kind of content.
Extract the content of the specified tag with re.
Using deep learning, NLP semantic analysis to figure out where the meaningful content is extracted;
Using computer vision, having people click on it and then sort it by page similarity (which is an automated version of categorization);
With the algorithm, calculate the text density, and then extract;

All sorts of ideas have been floated, but nothing has been heard of in practice. Currently, most companies use manual configuration of XPATH rules, matching the extraction rules by url during collection, and then invoking the rules to achieve multi-site crawling. This method is very effective, and has been applied in the enterprise for a long time, relatively stable, but the disadvantages are also obvious – time, labor, money!

One day, I saw someone in the wechat technology group (qing nan, an excellent Python engineer) published a library for automatic text extraction, the GeneralNewsExtractor[1] (hereinafter referred to as GNE). This library refers to the paper “Web page Text Extraction Method Based on Text and Symbol Density” written by Hong Honghui, Ding Shitao, Huang Ao, Guo Zhiyuan et al., wuhan Institute of Posts and Telecommunications, and it is implemented in Python code on the basis of the paper, namely GNE. Its principle is to extract the text and punctuation marks in the DOM of web pages. Based on the density of punctuation marks in the text, the algorithm extends from a sentence to a paragraph of text and an article.

GNE can effectively exclude advertisements, recommendation bars, introduction bars and other “noise” content outside the main body of the web page, and accurately identify the main body of the web page, and the recognition rate is as high as 99% (the test selected content is the domestic mainstream portal/media platform articles).

For details on the GNE algorithm and source code, see chapter 5 of the Python3 web crawler bible.

With it, more than 90% of the crawler analysis needs of public opinion direction can be basically solved, and the remaining 10% can be tailored or fully customized based on extraction rules, freeing a large wave of XPATH engineers.

Second, crawler DATA/URL deduplication

The public opinion business must keep an eye on whether there is any new content released on the website, and the sooner the better. However, due to various hardware and software limitations, it is usually required to monitor the new content within 30 minutes or 15 minutes. To achieve the monitoring of changes in the content of the target site, then we can choose a better way is polling. Constantly visit the web page and determine if there is “new content”, perform a crawl if there is, do not crawl if there is no “new content”.

So how does an application know what is “new” and what is “old”?

To break it down, “new content” is content that hasn’t been crawled. At this time we need to use a thing to record whether the article has been crawled, every time there is to be crawled the article comparison, this is the solution to this problem.

What do you rely on for comparison?

We all know that the article urls are almost the same and will not be repeated, so you can choose the URL of the article as a basis for the decision, also is the URL to crawl into a similar list of containers are stored, every time to crawl urls to judge whether it has been in the container, if have climbed in means, Discard the URL and enter the process of determining the next URL. The overall logic looks something like this:

This is de-weighting in the field of reptiles. In fact, it can be roughly divided into content (DATA) de-weighting and link (URL) de-weighting. What we are talking about here is only the de-weighting demand in the direction of public opinion. If it is de-weighting in the direction of e-commerce, URL cannot be used as the basis for judgment, because the purpose of e-commerce crawler (such as price comparison software) is mainly to judge price changes. At this time, the change should be judged on the basis of the key information of the product (such as price and discount), that is, DATA deduplication.

The principle of weight removal is clear. What is used as the container for storing weight removal? MySQL? Redis? Directing? Memory? In fact, most engineers choose Redis as the container to store the deduplicate basis, but in fact, MySQL, MongoDB and memory can be used as containers, as to why Redis is chosen, and how is it better than other data stores? Check out chapter 3 of the Python3 web crawler bible.

Distributed crawler

Both the crawler in the direction of public opinion and the crawler in the direction of e-commerce have to bear a very large amount of crawlers. As little as a million data a day, as many as a billion data a day. Previously known stand-alone crawler can not meet the demand in terms of performance and resources. If one is not enough, then 10, 100! This is the context in which distributed crawlers emerge.

As we all know, distributed and single machine to face the problem is different, in addition to the business goal is the same, distributed also consider the collaboration between multiple individuals, especially resource sharing and competition.

When there is only one crawler application, only one crawler reads the queue to be climbed, only one crawler stores data, and only one crawler judges whether the URL is repeated. However, when there are dozens or hundreds of crawler applications, it is necessary to distinguish the sequence, so as to avoid the situation that multiple crawler applications visit the same URL (because this is not only a waste of time, but also a waste of resources). In addition, when there is only one crawler application, it only needs to run on one computer (server). However, when there are so many crawler applications suddenly, how should they be deployed on different computers? Manually upload them one by one and then launch them one by one?

Resource issues

First, let’s talk about resource sharing and competition. In order to solve the sharing of URL waiting queue and crawling queue, the queue (that is, the container for storing URL mentioned above) must be placed in a place that can be accessed publicly (multiple crawler applications), such as Redis deployed on the server.

At this time, a new situation arises. As the data volume becomes larger and larger, more and more urls need to be stored, which is likely to lead to increasing costs due to the large storage space requirements. Because Redis uses memory to store data, the more urls it stores, the more memory it needs, and memory is the relatively expensive hardware in the hardware device, so we have to consider this problem.

Fortunately, a guy named Bloom invented an algorithm called the Bloom Filter, which uses hash mapping to mark the existence of an object (in this case a URL), which can significantly reduce memory usage, based on the MD5 value of 100 million 32-character urls. The difference with the Bloom Filter was about 30 times. For an explanation of Bloom Filter’s algorithm and code implementation, refer to chapter 3 of Python3 web crawler.

The deployment problem

Uploading files and running crawlers manually was exhausting. You can ask your operations colleagues for technical support, but you can also explore automated deployment options that will reduce your workload. At present, the well-known continuous integration and deployment in the industry are GitLab GitLab Runner and GitHub Action, or by means of K8S containerization. But they can only help you with deployment and startup, and some of the administrative functions of crawler apps can’t be counted on. So, today I’m going to introduce you to another way of doing this — using Crawlab.

Crawlab is a distributed crawler management platform developed by a well-known foreign company engineer. It not only supports crawlers written in Python language, but is almost compatible with most programming languages and applications. With Crawlab, we can disperse crawler applications to different computers (servers), and set scheduled tasks on the visual interface, check the status of crawler applications on the platform and the information of environment dependence. The details are shown in the figure below:

Faced with such a useful platform tool, we engineers couldn’t help but ask:

How does it spread files among different computers?
How does it realize communication between different computers (multi-node)?
How does it achieve multilingual compatibility?
….

Among them, the multi-node communication that we pay more attention to is realized by Redis, and the file decentralized synchronization is realized by MongoDB. See chapter 6 of the Python3 web crawler guide for more details.

In addition to such platforms, Python crawler engineers often work with Scrapy frameworks and related libraries. The Scrapy team has officially developed a library called Scrapyd, which is dedicated to deploying crawler applications developed by the Scrapy framework. When deploying a Scrapy application, it usually takes only one line of command to deploy a crawler to the server. Do you want to know the logic behind it:

In what form is the program delivered to the server?
How does the program run on the server?
Why can I view the start time and end time of each task?
How to implement the function of canceling the execution of a task in the middle?
How is it versioned?
Is it possible to monitor and manipulate Python applications without Scrapy?

Instead, Scrapy applications are packaged into a compressed package with an “. Egg “suffix that is delivered to the server as HTTP. When a server program needs to execute this program, it copies it to a temporary folder on the operating system, imports it into the current Python environment, and deletes the file after execution. For execution times and interrupts, it actually relies on the Python process interface, consult Chapter 6 of the Python3 Web crawler Bible for details.

Iv. Automatic rendering technology

In order to achieve cool effect, or to save static resources on bandwidth, many websites are using JavaScript to optimize the page content. Python programs themselves can’t interpret JavaScript and HTML code, so they don’t get content that we “see” in the browser, but isn’t actually “real” because it’s rendered by the browser and only exists in the browser, HTML documents are the same text, JavaScript files are the same code, images, videos and effects are not in the code, everything we see is the browser’s work.

Since Python also can’t get the browser rendered content, when we write code to crawl the data as usual, we’ll find that the data we get is not what we see, and the task will fail.

This is where automated rendering comes in. In fact, browsers like Chrome and FireFox open up interfaces that allow other programming languages to manipulate the browser according to protocol specifications. With this technical background, teams develop tools like Selenium and Puppeteer, and we can use Python (and other languages) code to manipulate browsers. Let the browser help us with username and password input, login button clicking, text and image rendering, captcha sliding, and so on, to break down the barrier between Python and the browser itself, rendering content with the browser, returning it to Python, and getting the same content we see on the web.

In addition to browsers, apps have a similar situation. For detailed operation and case details, refer to Chapter 2 of Python3 Web crawler Bible.

The application of message queue in crawler field

In the previous description, we didn’t mention the details of the crawl. Assume such a normal crawler scenario: the crawler first visits the article list page of the website, and then enters the details page according to the URL of the list page for crawling. Note here that the number of detail pages must be N times larger than the number of list pages, or 20 times larger if the list shows 20 items.

If we need to crawl a lot of sites, then we use distributed crawlers. If the distributed crawler only copies one crawler out of N copies to run, then there will be uneven resource allocation, because in the case mentioned above, each crawler needs to do so. In fact, there are better ways to combine them and make the most of their resources. Example From the list page to the detail page can be abstracted as producer and consumer models:

Crawler applications No. 4 and no. 5 are only responsible for extracting the URL of the detail page from the list page and then pushing it to a queue. Other crawlers take the URL of the detail page from the queue for crawling. When there is a large gap between the number of list pages and details pages, we can increase the number of crawlers on the right; when the gap is small, we can reduce the number of crawlers on the right (or increase the number of crawlers on the left, depending on the specific situation).

The crawler on the left is the producer relative to the “data collection production line” of the queue, while the crawler on the right is the consumer. With such a structure, we can adjust the proficiency of producers or consumers to maximize the use of resources. Another benefit is that when producers get more and more urls, but consumers don’t buy them for a while, the urls are kept in the queue and can be balanced again when consumption increases. With such a production line, we don’t have to worry about suddenly pouring in a lot of urls or suddenly consuming all the urls in the queue. The ability of queue cutting peak and filling valley is not only brilliant in the back-end application, but also plays a great role in the crawler application.

The implementation and details of crawlers (and distributed crawlers) accessing message queues can be found in chapter 4 of the Python3 web crawler bible.

Six, various forms of anti-reptilian

I won’t give you what you want!

Website is not going to let you easily crawl site content above, they tend to the different characteristics, programming languages, from the network protocol, the browser, the respect such as man-machine differences to the crawler engineer, common have a slider captcha, puzzle captcha, sealing IP, check the COOKIE, the requirement, set complex encryption logic, confusing front-end code, etc.

Come and go, come and go! Reptilian engineer vs. target site engineer the game is as good as the game of war. “Python3 Anti-crawler principle and Bypass actual combat” book includes more than 80% of the market anti-crawler means and crawler skills, detailed interpretation of both sides used tricks, so that visitors learn a lot from the use of tricks. Specific details can be read the book, appreciate the technical field of the river’s lake!

The small knot

Today, we learned the key technical points on the practice of large-scale crawler with daily data of over 100 million, including intelligent text extraction, distributed crawler, crawler deployment and scheduling, de-duplication, automatic rendering, etc. Once these techniques are learned and mastered, it is not a problem to achieve a crawler with daily data of over 100 million.

These experiences come from front-line reptilian engineers, and these technologies and designs have been verified for a long time and can be directly applied to work.

activity

The Python3 Web Crawler Bible was mentioned many times above, and I bought several books to thank you for your support. If you want a book, please leave a comment in the comment section and say why you want the book, you can participate in the book giving activity.

Activity Rules:

1. The top 10 friends in the comment area will each receive a book as a gift. The number of books is limited (10 books in total).

2, participating in the activities of friends, please leave a message after forwarding this article to the moments of friends and add the end of the article Python small assistant wechat, small editor will go to check when the lottery, if not forwarded, the prize will be postponed to the next person;

3. The activity time is ~ 2020-12-04 18:00;

4, the book will be sent by the Publishing House of Electronics Industry blog (7 working days), after the lottery small editor will contact the winning friends to provide the harvest address, please be sure to add the end of the Python small assistant wechat ****;

References

[1] GeneralNewsExtractor: github.com/kingname/Ge…

Article: NightTeam, the official account of Python Chinese Community reserves the right to interpret this event; \