Hello, everyone, I am Four MAO, recently opened a personal public account “to program with Python”, welcome you to “follow”, so that you can receive quality articles.
Today I’m going to share some tips on finding crawler portals.
The reason for writing this article is that I need to do a task about homework help at work, and homework help only has APP. Of course, data can be obtained by capturing packages of APP. At the same time, the data can be obtained step by step according to the related problems of some questions. I started with the latter approach, but it had a few drawbacks:
(1) It is easy to obtain dirty data. For example, my original task was to obtain some data about Chinese problems, but when I took a Chinese problem as the initial url to climb, I found that the data obtained after were all English types.
(2) Very low efficiency.
So, is there a better crawler entry for this crawler task? Apparently, there is.
Find the crawler entrance
1. Entrance of this mission
A better entry point for this crawler is the search engine that we normally use. Although there are many kinds of search engines, but in fact, they are doing the same thing, including web pages, processing, and then provide search services. In the usual use of the process, we are usually directly input the keywords on the direct search, but in fact, there are many search skills, for example, for this task, as long as we search, we can get the data we want.
Now let’s try it on Baidu, Google, Sogou, 360 and Bing respectively:
So it’s obviously better to use this data as an entry point for this task. As for the measures to deal with anti-reptilian, it will test the basic skills of individuals.
2. Other entrances
(1) Mobile terminal entrance
Data can be obtained better and faster through the mobile terminal entrance of the website.
The easiest way to find the mobile portal is to open Google Chrome in developer mode, click on something that looks like a phone, and then refresh it.
(2) Site map
The site map refers to the convenience of the site administrator to inform the search engine of their site what can be captured on the web page, so through these site maps can be more efficient and more convenient access to some of the next level of access to the SITE.
(3) Modify the value in the url
First of all, this technique is not a panacea.
This technique is mainly to obtain the required data from a request to the maximum extent through the value of some fields in the website, reduce the number of requests, reduce the risk of being blocked by the website, and improve the efficiency of crawler. Here’s an example:
When crawling QQ music of a singer’s all music data, capture the package obtained in the following format:
https://xxxxxxxxx&singermid=xxxx&order=listen&begin={begin}&num={num}&songstatus=1
The following packets are returned:
Some of these fields are replaced by XXX. Please note that the num field is displayed on the next page. Begin is the first value of each page, and num is the number of columns on the page. Usually, we can fetch data page by page. The default value of QQ Music is 30. Do we have to ask at least 4 times to get the full data?
Of course not. In fact, at this time, we can try to change some values in the url, the return result will send changes. Begin = 0; num = 1; begin = 0; num = 1; begin = 0;
So we can get all the data in two requests. The first request gets the total number and then changes the url to get all the data.
A similar field is Pagesize.
conclusion
The tips above for finding crawler portals can help us get more results with less effort, sometimes with less cost.