preface
Scrapy is coming!!
After seven basic crawler articles, I finally got to Scrapy. Scrapy ushered in the era of crawler 2.0, bringing crawlers to developers in a new form.
I started to learn Scrapy when I was an intern in 18 years. I spent a month learning Scrapy in combination with theory and practice. This article is not about code operations, just cause and effect and theory, hopefully you know Scrapy.
Native crawlers face problems
Whether you’re using Java’s Jsoup or Python’s requests, there are a number of problems with developing crawlers:
1. The distributed
Crawlers usually only run on one host. If the same crawler is deployed on different hosts, it is a separate crawler. If you want to create a distributed crawler, the usual idea is to divide the crawler into URL collection and data collection.
Now the URL is crawled down and put into the database, and then by where condition restriction, or directly using the List structure of Redis, let the crawlers on different hosts read different URLS, and then the data is crawled.
2. The url to heavy
When you climb data, you will often encounter repeated URLS. If you repeat the climb, it is not a waste of time. The idea of url deweighting is to put the crawled URL into the set and determine whether the URL exists in the set every time. Then, if the program stops, the collection in memory will no longer exist, and restarting the program will not be able to determine which ones have been crawled.
Then use the database and insert the url that has been climbed into the database, so that even if you restart the program, the URL that has been climbed will not be lost. But if I just want to start crawling again, do I have to manually clean the URL table in the database? The time spent on each database query is something to consider.
3. Break and climb
If have 1000 pages need to climb to take, climb 999 page, progress bar is full right away when, program clunker hanged, with respect to difference, but still did not climb, how to do? I chose to restart the program, so how do you suggest I climb directly from 999th?
Here’s the first crawler I wrote: crawl the POI information of 10+ cities.
Internship, the first time to develop crawler, do not know what autonavi POI interface, so I found a website to climb POI information. The site was probably in its infancy at the time, the server bandwidth was probably low, the access speed was really slow, and maintenance stops were frequent, so my application had to stop along with it. If every time start to climb again, it is estimated that several years will not climb, so I thought of a way.
First, I manually input the number of items (available on the website) of all districts and counties under all cities into the database table. Every time the crawler is restarted, I first count the number of items that have been crawled by each district and county in the result data table and compare them with the total number of items. If less than, that has not been climbed, and then through a county has climbed the number/site per page display number of the number of I have climbed to this county page, and then through the remainder of the location to the number of I climbed to this page. In this way, 163W pieces of data were crawled without loss.
In other words, put the url in the table, restart the program to start to climb the URL, first to determine whether the URL exists in the data table, if it exists, do not climb, this can also achieve breakpoint climbing. The original URL is also used to remove the weight of the idea.
4. Dynamic loading
In the sixth fund article wrote a jSONP dynamic loading, is a relatively simple one, as long as the request interface to obtain data processing can be found. Article 7 is about javascript encryption for TV Cat’s eval(), which is a very complex dynamic load. The parameters of the request interface are encrypted, and it takes a lot of time to parse through dense JS to calculate the 186-bit parameters.
So, is there a way I can get away from reading and analyzing JS and bypass dynamic loading?
Sure!!!!! First about dynamic loading, it can be understood that the browser kernel renders the data in the front end by performing JS. So we do a browser kernel in the program, we directly get js after rendering the page data can not it?
Selenium + Chrome, PhantomJS, pyVirtualDisplay are usually used to handle dynamic loading, but there are more or less performance issues.
With all that said, you should know what I’m going to say.
About Scrapy
Scrapy brings me the feeling is: the module is clear, the structure encapsulates, the function is powerful.
WHAT
Scrapy is a distributed crawler framework. I think of it as the Spring of crawlers. Reqeusts is like servlets in that all the functional logic needs to be implemented on its own, while Spring does the integration and the underlying layer is transparent to the user.
Just as we know that Spring initializes beans in the Application configuration file and defines database operations in the Mapper, consumers do not care how Spring reads these configuration files for various operations. Similarly, Scrapy provides this configuration.
Scrapy is a crawler framework, and requests are a crawler module.
WHY
My politics teacher once said: there is no such thing as love or hate without any reason or cause. Based on my personal experience, here’s why I recommend Scrapy.
- Performance: Making asynchronous requests based on Twisted is no big deal!
- Configuration: The configuration file defines the request concurrency, delay, and retry times
- Plug-in rich: provides dynamic loading, breakpoint crawl, distributed solution, a few lines of configuration is ready to use
- Command line operation: you can generate, start, stop, and monitor the crawler status through the command line
- Web interface operation: The Web interface is integrated to start, stop and monitor crawlers
- Provides test environment: Provides shell interactive test environment
HOW
Scrapy is also a framework, function is also so powerful, is not hard to learn ah.
There’s no need to worry. Installing Scrapy modules is as easy as installing any other Python module, once you know what four of them do. Scrapy crawler development logic, on the other hand, is less code, more hierarchical, and much simpler than requests.
Application scenarios
Some people consider Scrapy frameworks to be too heavy-weight to use in their requests. Here it can only be said that the application scenarios and priorities are different.
Scrapy development is more like an engineering project. It is usually used for crawler data integration of multiple data sources, such as integrating video, novel, music, comic and other information data into a data table. Developers only need to agree on data fields in advance, because scrapy can use the yield keyword to put data in the database, without explicitly calling any method.
Requests are better suited for the development of a single crawler that does not require unified management and distributed deployment.
conclusion
In fact, the first article should write Scrapy architecture and installation, but I think it is necessary to understand the function and application scenario of a technology before using it, so I write this theoretical knowledge.
I wrote this article twice, but after the first time, for some reason, it was overwritten in the editor, so I had to write it again. Fortunately, the middle part of the screenshot sent to a friend, but also to write a part less. I also finally understand a feeling that once spread on the Internet: homework was torn by the dog, do not want to write again.
I hope this article has given you a deeper understanding of the theoretical knowledge of crawlers. Look forward to your next encounter.
After 95 small programmers, writing is daily work in practice, place oneself in the beginner’s point of view from 0 to 1, detailed and serious. Article will be in the public number [entry to give up the road] first, looking forward to your attention.