First day of learning how to crawl

Classification of crawlers in use scenarios:

1. General crawler: crawler is an important part of the crawler system, and crawls the data of a whole web page 2. Focused crawler: it is built on the basis of general crawler and crawls specific local content in the page 3. Incremental crawler: Monitors data updates on your site.Copy the code

Anti-crawl mechanism:

Portal website can prevent crawler from crawling website data by making corresponding strategies or technical means.Copy the code

Anti-crawl strategy:

Crawler can break the anti-crawler mechanism in portal website by making relevant strategies or technical means, so as to obtain portal website dataCopy the code

Gentlemen’s Agreement:

It specifies which data on a website can and cannot be crawled by crawlers.Copy the code

The HTTP protocol

A concept: is a form of data interaction between the server and the client. Common request header information - user-agent: indicates the identity of the request carrier. -Connection: indicates whether to disconnect or maintain the Connection after the request is completeCopy the code

Common response headers

- Content-type: indicates the data Type of the response from the server to the clientCopy the code

The HTTPS protocol:

Secure hypertext Transfer protocol encryption - symmetric key encryption - asymmetric key encryption - certificate key encryptionCopy the code

Case (to achieve the sogou web page crawling)

For beginners, check out the requests module and install it in the following terminal: PIP install requestsCopy the code

The case code is as follows:

If __name__=="__main__": url="https://www.sogou.com/"; Response = requests. Get (url=url); Page_text =response.text; Print (page_text) # with open("./sougou.html","w",encoding=" utF-8 ") as fp: Fp. write(page_text) print(" End of crawl ");Copy the code

After executing the above code, a sougou.html file is generated locally (as shown below), which can be viewed by entering the file and selecting the appropriate browser.

Classification of crawlers in use scenarios:

Anti-crawl mechanism:

Anti-crawl strategy:

Gentlemen’s Agreement:

The HTTP protocol

Common response headers

The HTTPS protocol:

Case (to achieve the sogou web page crawling)

The case code is as follows:

Related Posts

Study Notes on JUC in Deep Technology Stack (Issue 4) (1)

Producer of message queue Kombu

Understanding Pattern Matching in Lua