Python Crawler series: Is this what you think Python crawlers are?

This is the 25th day of my participation in Gwen Challenge

We must rebuild our inner world every single day, or we will accomplish nothing. – When DeMian Wanders off as a Teenager

I talked about Requests library and common code framework for web crawler before. Maybe you are not very clear about what crawler is used for, so you can go to the recruitment software to search for crawler engineer. The salary is about 10,000 yuan per month, of course, there is work experience.

So are our regular crawlers the same as corporate crawlers? In fact, there are differences, mainly reflected in the scale of crawlers. In the company, the amount of crawler data is very large, and the speed is also very high requirements, otherwise you baidu a thing wait a few minutes to try.

So xiaobian here to give you a simple classification (in accordance with the size of the size) :

Small scale, small amount of data, and insensitive crawl speed, Requests library is generally used: mainly reflected in the crawl web page.
Medium scale, generally large data scale, sensitive to the crawl speed, generally adopt Scrapy library: mainly used for crawling web pages, crawling a series of websites.
For large-scale search engines such as Baidu, 360, etc., the crawl speed is very critical, the technology is generally customized, and the crawl content is the whole network.

Because a large number of access requests are generated during the crawler process, the level of crawler code you write, or for other purposes, will impose a huge overhead on the web server that crawls the object.

Of course, there are also anti-crawler and Robots protocols to limit crawlers:

Anti-crawler: it is mainly reflected in source review, that is, judging and restricting the access frequency of user-Agent.

As for Robots protocol, it tells all crawlers its crawling rules and requires them to abide by them.

Robots protocol full name: Robots Exclusion Standard (Web crawler Exclusion Standard)

What it does: The site’s administrator tells crawlers which pages can and cannot be crawled.

Form: Create a robots.txt file in the root directory of the website.

So how do we average users abide by this protocol?

Generally speaking, when we crawler, we will first look at robots.txt, and then crawl the content. As for whether or not to comply with the agreement, the editor suggests to comply as much as possible, of course, if you have a small number of visits and do not make commercial profits, then generally not a problem. After all, robots protocols are only self-regulated and have no compulsory binding, that is, they can not be complied with, but there are legal risks.

In general, as long as there is no commercial interest and the page view is not very large, xiaobian or recommended to comply with, but if it constitutes a commercial interest or is to climb the whole network such as Baidu, Sogou, etc., these are must comply with, otherwise it is to assume legal responsibility.

Finally, there are risks for web crawlers:

Other people have property belonging to the data on the server, if you used for commercial trade, is the need to assume legal responsibility, and easy to cause the user data leakage, small make up a casual revealed, before ever seen speculation user data, to think and think, after data reveal that the most annoying is harassment problems, such as harassing phone calls, text messages, such as friends, So xiaobian has always reminded to protect their privacy, even in the absence of privacy on the Internet.

Finally xiaobian or to remind you not to learn from the entry into prison oh.

Python crawler series, to be continued…

Python Crawler series: Is this what you think Python crawlers are?

Related Posts

How to install, update, and uninstall docker-ce on Ubuntu18.04

Analysis of the election process during zooKeeper startup

Consensus algorithm: Paxos algorithm