Talk about [crawler development] these six months experience

This article starts: Hit the water – talk about [crawler development] these six months experience

preface

In my work, I have been using crawlers to do requirements on and off for nearly half a year. In this half a year, FROM a little Python white to crawler entry, and then to function implementation. From busy PHP morning to busy Python afternoon to the current PHP/Python freewheeling switch, the twists and turns are self-evident, but also really took a lot of detours. But it paid off, and over the course of six months, I grew up with Python while using it. Now to sum up, I hope to help students in need.

Has written several Python and Scrapy articles before, Portal:

Python Learning Notes (Basics)
Python Learning Notes (Common extensions)
Research on Python crawlers
Python scrapy
How does Python crawl dynamic websites?

Learning article

python

From the beginning of getting to know the crawler, the first thing you need to know is the environment construction, syntax and features of Python. There are many related tutorials on the Internet. The following are some of the tutorials I used in the learning process, which can play a great role in getting a quick start of Python

This is where Professor Liao xuefeng started his Python course. Read the article and learn while writing the demo
The basic Python tutorial in the Rookie tutorial can also be used as a quick tutorial
Python3 documentation, which can be found quickly if you forget some functions

The crawler frame

Once you have a good understanding of Python’s basic syntax, you can start learning about the Python crawler framework. The Python crawler framework is best known for its Scrapy framework. Learn how the Scrapy framework works and how to use it. Learn to Scrapy by reading documents, blog posts, and watching videos. Here are some of the best Scrapy blogs and videos I’ve seen along the way

Python Scrapy notes and Simple combat – this is my own summary post
Scrapy (Scrapy)
Scrapy, Scrapy, Scrapy, Scrapy, Scrapy
This video isn’t long, but it’s enough to get started with Scrapy
Learn how to use Scrapy to create a book
B station Scrapy video (3) The beginning of the following video to learn Scrapy, the instructor in it is also very good. But when I was writing this article, I found that the original video had disappeared. If you need to know more, you can contact UP

In addition to the links above, github also has some well-developed Python and Scrapy projects that you can try to understand.

Xpath

XPath is used to navigate through elements and attributes in AN XML document. In other words, you can use Xpath to locate elements on the page and retrieve their contents. Almost all of the crawler code I write uses Xpath to retrieve page content. So it’s important to learn Xpath.

Xpath is just like the Dom of Javascript. It can locate a specified element according to its ID, class, etc., and obtain the corresponding content. Some common ways to use Xpath are listed briefly

/Next level element
//Child elements
. The current node
getGet a single value
getallGet all values

See the Xpath tutorial at W3school for details.

regular

Crawler is naturally inseparable from the regular, need to use the regular to obtain the specified content in the string of many scenarios. If you are not regular or not familiar with regular, then it will directly affect the work efficiency (of course, not familiar with colleagues can also ask, but their own hands to make food and clothing). I ate the loss of not familiar with the regular, just to learn the reptile needs to use the regular, take this opportunity to learn the regular again.

For documentation, refer directly to the rookie tutorial regular expressions

Practical article

Once you’ve learned how to use Python, Scrapy, and Xpath, you’re ready to try your hand at developing a crawler. I started by trying to write a crawler that can crawl the whole site of a simple book. The test is able to crawl the data, but after crawling a lot of data, there will be some problems (these problems will be mentioned below). This is a Python Scrapy demo that crawls all the articles on my website. The code is very simple, and crawls all the article titles and content on my blog. Xiaobai should also be well understood

The climb

As mentioned above, there will be some problems when climbing more than a certain amount of data. The main reason is that I was forbidden to crawl. The reason is that I crawled a large number of articles of Jane book at the same time, so my IP was briefly listed in the blacklist of Jane Book, so I could not crawl the data. After a while, it will be ok, but it will crawl again until the IP is blocked, and the amount of data retrieved in the middle will be much less than the first time. This is Jane’s anti-crawl mechanism.

I’ve also written about crawlers and crawlers in Python: Crawler Studies, which lists common crawlers and strategies for crawlers. It’s a summary of what I’ve learned about crawlers for a while. May not understand too deep, can be used as understanding.

The proxy IP

As for the proxy IP, it is necessary to separate it out, because crawler projects must rely on many IP to complete tasks. Otherwise, if an IP is blocked by the website, the business will stop, which is not allowed. Therefore, we need to establish a proxy IP pool for our crawler, store the usable and good IP, and switch a normal IP as a proxy access when THE IP is blocked.

How to set up proxy IP pool, there are many schemes on the Internet, because the IP quality of this scheme is not very good, so I did not try. If you want to play, you can set up an IP address pool based on the IP proxy pool solution. Basically go to a public IP proxy site, crawl all the IP addresses, save them in your own IP proxy pool (either database or Redis), and write a script to periodically monitor whether these IP addresses are normal, if they are normal put them in the proxy pool, otherwise remove them from the proxy pool.

Common IP agents, such as quick agents, support the purchase of a certain number of proxy IP. Switching an IP address reduces the number of available IP addresses. After testing, the IP quality was found to be quite high. But this limited quantity does not meet our business needs.

You can also use some tunnel link IP agents, that is, IP is not limited, unified access by tunnel, agents forward your request. Such agents as elephant agents. However, the IP of the small elephant agent is really ordinary, or it may be due to the particularity of our business, the IP of the small elephant is not very useful to us.

The last one is ScripingHub, which uses Crawlera to provide the proxy service. The agency is surprisingly high quality and stable. Because it is a foreign agent, those with sufficient budget can adopt such agents. (About $349 a month)

Verification code

Captcha is one of the most common methods in reverse crawl processing, and at the beginning of this situation, it is hard to think of how to crack captcha. To understand the current mature OCR technology, but this is particularly tedious to use, and the failure rate is very high, even if the verification code cracked, but the following request will still appear verification code, OCR algorithm identification verification code is also very time-consuming, will lead to a decrease in the efficiency of the crawl.

If you can’t hack captchas efficiently, what else can you do? The answer is definitely yes, and it is much easier and more efficient to use the request middleware class to determine whether the page is a captcha page, and if so, simply switch to a proxy IP request. With Crawlera, just make another request.

Decoding the captcha takes time and effort. Simple and efficient IP change, recommended.

Scrapy Redis

Scrapy Redis is used to build distributed crawlers. It is equivalent to store the links to be crawled in the Redis queue. Multiple crawler scripts can be opened in different servers to consume the Redis queue to achieve the purpose of distributed crawling.

Switching to Scrapy Redis is also easy. The spider class inherits RedisSpider and adds redis_key to the crawler class, specifying the queue name. Remove start_url. Add some necessary configuration for Scrapy Redis and Redis connection information to the configuration file

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
REDIS_HOST = 
REDIS_PORT = 
REDIS_PARAMS = {
    'db': '',
    'password': 
}
Copy the code

Scrapy-Redis introduction to combat

Scrapy Crawlera

Crawlera is a third-party platform that uses proxy IP address pools for distributed downloads. Our online business has been using this proxy, which is very stable, and there is almost no case of shielding or access failure. It’s just a little expensive

Scrapy-crawlera

ScrapingHub Crawlera introduction and fees

ScrapingHub Crawlera Api document

tip

Xpath Helper

Xpath Helper is a browser widget that lets you type Xpath expressions directly into a web page to verify that your written expressions are regular.

Scrapy Shell

Scrapy shell is also one of the tools that scrapy provides for debugging. It can easily open the specified web page in the command line, and then input the corresponding code to debug the page content.

conclusion

Above is about the whole content of this article, summarized the beginner Python crawler learning path, the practical advanced and some little skill, can improve the work efficiency of course used in the actual work, the need to understand the knowledge with far more than these, if you want to play it, must constantly to learn, explore and try.

Use so far is only the tip of the iceberg, there are more things to learn.

‘.