“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Probably the simplest crawler Demo ever
The simplest crawler Demo:
First crawler, two lines of code to write a crawler:
import urllib #Python3
print(urllib.request.urlopen(urllib.request.Request("GitHub - richardpenman/wswp_places")).read() )
Copy the code
These two lines of code work fine under Python3.6 to get example.webscraping.com****
The content of this page;
Note: If Python3, use the following two lines of code:
import requests #Python3
print(requests.get('http://example.webscraping.com').text)
Copy the code
If you don’t have the requests library, you’ll need to install it using the PIP install requests command.
Note: most of the code in this handout is based on Python3.6 code bits. In appendix A of the handout, Python2 and Python3 will be included in the comparison table of the most important libraries in the crawler. According to this table, it is convenient to realize the transplantation of Python2 and Python3 in the crawler.
Second, review HTTP , HTTPS agreement
1. About URL:
URL (abbreviation for Uniform/Universal Resource Locator) : A Uniform Resource Locator (URL) that describes the addresses of web pages and other resources on the Internet.
Basic format: scheme://host[:port#]/path/… /[?query-string][#anchor]
Scheme: protocol (for example, HTTP, HTTPS, FTP)
Host: indicates the IP address or domain name of the server
Port# : port of the server (default port 80 if it is the default protocol port)
Path: indicates the path for accessing resources
Query-string: parameter, data sent to the HTTP server
Anchor: anchor (jump to the specified anchor location of a web page)
Such as:
www.baidu.com
ftp://192.168.1.118:8081/index
The URL is the entry point for the crawler, which is very important. * * * *
2. HTTP protocol, HTTPS agreement
HyperText Transfer Protocol (HTTP) is a method for publishing and receiving HTML pages. HTTP protocol is an application-layer protocol. It is connectionless (only one request is processed per connection) and stateless (transport is independent for each connection).
Hypertext Transfer Protocol over Secure Socket Layer (HTTPS) is a Secure version of HTTP, adding SSL Layer to HTTP. HTTPS = HTTP+ Secure Sockets Layer (SSL) Is a Secure Web transport protocol that encrypts network connections at the transport Layer to ensure the security of data transmission over the Internet
The HTTP port number is 80. The HTTPS port number is 443.
3,HTTP Request There are two common ways to request:
Get: To obtain information from the server, the data transmitted to the server is not secure and the data size is limited.
Post: Transmits data to the server. The data transmission process is secure and the size is theoretically unlimited.
4, ** aboutUser-Agent**
User Agent (UA for short) is a special string header that enables the server to identify the operating system and version used by the customer, CPU type, browser and version, browser rendering engine, browser language, and browser plug-in.
What is the user-agent that our simplest crawler tells the server when it runs? In this example, we see that the Python crawler has a default user-agent with a version number, which makes it easy to recognize that the crawler is written in Python. So if we use the default user-Agent, anti-crawler programs can recognize at a glance that we are a Python crawler, which is bad for Python crawlers.
So, how can we modify the User-Agent to disguise our crawler?
Headers ={" user-agent ":"Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/65.0.3325.181 Safari/537.36"} req = request.request ("http://www.sina.com.cn", Headers =headers) # return http.client.httpresponse response = request.urlopen(req)Copy the code
5,HTTP Response Response status code:
200 is success, 300 is jump;
400, 500 means an error:
Note: The information returned by the server to the crawler can be used to judge whether our crawler is currently running normally;
When abnormal errors occur: Generally speaking, if the error is 500, the crawler will enter the hibernation state, indicating that the server has broken down; If the error is 400, the crawler’s crawler crawl strategy needs to be modified. It may be that the website is updated or the crawler is banned. In a distributed crawler system, it is easier to discover and adjust crawler strategies.
6, the HTTP The response body is the content of the protocol part that our crawler needs to care about:
Python’s interaction with the environment makes it easy to see the request and response information intuitively, which shows the Swiss Army knife of Python.
>>> import requests #Python3
>>> html = requests.get(‘example.webscraping.com’)
>>> print(html.status_code)
200
>>> print(html.elapsed)
0:00:00. 818880
>>> print(html.encoding)
utf-8
>>> print(html.headers)
{‘Server’: ‘nginx’, ‘Date’: ‘Thu, 01 Feb 2018 09:23:30 GMT’, ‘Content-Type’: ‘text/html; charset=utf-8’, ‘Transfer-Encoding’: ‘chunked’, ‘Connection’: ‘keep-alive’, ‘Vary’: ‘Accept-Encoding’, ‘X-Powered-By’: ‘web2py’, ‘Set-Cookie’: ‘session_id_places=True; httponly; Path=/, session_data_places=”6853de2931bf0e3a629e019a5c352fca:1Ekg3FlJ7obeqV0rcDDmjBm3y4P4ykVgQojt-qrS33TLNlpfFzO2OuXnY4nyl5sDvd q7p78_wiPyNNUPSdT2ApePNAQdS4pr-gvGc0VvnXo3TazWF8EPT7DXoXIgHLJbcXoHpfleGTwrWJaHq1WuUk4yjHzYtpOhAbnrdBF9_Hw0OFm6-aDK_J25J_ asQ0f7″; Path=/’, ‘Expires’: ‘Thu, 01 Feb 2018 09:23:30 GMT’, ‘Pragma’: ‘no-cache’, ‘Cache-Control’: ‘no-store, no-cache, must-revalidate, post-check=0, pre-check=0’, ‘Content-Encoding’: ‘gzip’}
>>> print(html.content)
Too much
Three,On crawler grasping strategy
Generally, when crawler data is captured, we will not just grab the URL data of an entry and stop. What do we do when we have multiple URL links to grab?
1. Depth-first algorithm
Depth first refers to the search engine from the first to grab a link on the web page, into the link of the page, the content of the scraping of the page, and then continue to follow the current page on the link for scraping bottom go to, until you follow the links on this page all grab, deepest no links on a page, The crawler then crawls back to follow another link on the first web page; As shown in the figure below.
2. Breadth/width first algorithm
Breadth first is another process that runs through the hierarchy before moving on.
As shown below:
Exercise: Construct a complete binary tree and implement its depth-first and breadth-first traversal algorithm.
In a binary tree, only the degree of the nodes in the lowest layer can be less than 2, and the nodes in the lowest layer are concentrated in the leftmost positions of the layer, while in the last layer, some nodes in the right are missing, then the binary tree becomes a complete binary tree.
\
The complete binary tree is as follows:
Results of depth-first traversal: [1, 3, 5, 7, 9, 4, 12, 11, 2, 6, 14, 13, 8, 10]
Results of breadth-first traversal: [1, 3, 2, 5, 4, 6, 8, 7, 9, 12, 11, 14, 13, 10]
3. How to combine grab strategy in practice
1. Generally speaking, important web pages are very close to the entrance site;
2. Width preference is beneficial to the parallel cooperation of multiple crawlers;
3. The strategy of grasping can be realized by combining depth and breadth: give priority to breadth,
Limit the depth maximum depth;
Summary: The process of a general crawler is as follows:
* * * *