Classification of crawlers in use scenarios:
1. General crawler: crawler is an important part of the crawler system, and crawls the data of a whole web page 2. Focused crawler: it is built on the basis of general crawler and crawls specific local content in the page 3. Incremental crawler: Monitors data updates on your site.Copy the code
Anti-crawl mechanism:
Portal website can prevent crawler from crawling website data by making corresponding strategies or technical means.Copy the code
Anti-crawl strategy:
Crawler can break the anti-crawler mechanism in portal website by making relevant strategies or technical means, so as to obtain portal website dataCopy the code
Gentlemen’s Agreement:
It specifies which data on a website can and cannot be crawled by crawlers.Copy the code
The HTTP protocol
A concept: is a form of data interaction between the server and the client. Common request header information - user-agent: indicates the identity of the request carrier. -Connection: indicates whether to disconnect or maintain the Connection after the request is completeCopy the code
Common response headers
- Content-type: indicates the data Type of the response from the server to the clientCopy the code
The HTTPS protocol:
Secure hypertext Transfer protocol encryption - symmetric key encryption - asymmetric key encryption - certificate key encryptionCopy the code
Case (to achieve the sogou web page crawling)
For beginners, check out the requests module and install it in the following terminal: PIP install requestsCopy the code
The case code is as follows:
If __name__=="__main__": url="https://www.sogou.com/"; Response = requests. Get (url=url); Page_text =response.text; Print (page_text) # with open("./sougou.html","w",encoding=" utF-8 ") as fp: Fp. write(page_text) print(" End of crawl ");Copy the code
After executing the above code, a sougou.html file is generated locally (as shown below), which can be viewed by entering the file and selecting the appropriate browser.