Web crawler


D: the program that simulates the client (generally refers to the browser) to send a network request and obtain the corresponding response. Generally, according to certain rules, it automatically captures Internet information.


E: Data collection, software testing, other uses (snatching tickets, brushing tickets, SMS bombing, etc.)


Classification:


1.
General crawler
: Usually refers to crawlers from search engines and large Web service providers


2.
Focused crawler
: Crawlers that obtain the required data for a specific website can be divided into three categories


1.Accumulative crawler: Continuously crawls data from start to finish. The process is de-duplicated

2.Incremental crawler: a crawler that has downloaded a page, crawls the page change data, or crawls only the newly generated page

3.Deep web crawler: crawler that cannot be retrieved through static links, hides behind search forms, and can only be retrieved if the user logs in or submits data.



Limitations of generic search engines:


①· 90% of the returned page content is useless


② It is difficult to understand Chinese search engine


(3) Limited information possession and coverage


④ To keyword search, pictures, databases, audio, multimedia, etc., not very support


(5) The search community and personalized is not good, without considering the differences of people’s region, gender, age


⑥ Crawling dynamic web pages is not good




ROBOTS agreement


D: “Web Crawler Exclusion Protocol”, through which websites tell search engines which pages can and cannot be crawled.


Location: The robots.txt file should be placed in the root directory of the site




Focused crawler principle


C:/Users/Jowell520home/AppData/Local/YNote/data/[email protected]/9aa6e623e7e0410f9f49ed38002a2c21/clipboard.png



The difference between HTTP and HTTPS


HTTPS adds SSL (Secure Sockets Layer) to the application layer (HTTP) and transport layer (TCP) compared to HTTP. HTTPS is more secure than HTTP, but its performance is worse






A Get request differs from a Post request:


Request method


GET


POST


role


To request resources


Used to transmit physical data


Transmitted data size


Small, 2 m


big


Location of entity data


url


Request body







Common HTTP headers:

1. Host (Host and port number)

2. Connection

3. Upgrade-insecure Requests (Upgrade to HTTPS Requests)

4.

User-Agent

The (user agent) server can identify the operating system and version used by the customer, CPU type, browser and version, browser rendering engine, browser language, browser plug-ins, and so on

5. Accept (Type of file transferred)

6.

Referer

Page jump, from where to the current page, can be used to prevent crawling and anti-theft chain

7. Accept-encoding Specifies the Encoding type supported by browsers. The main Encoding format is gzip Compress deflate

8.

Cookies

: used for state retention and user identification

9. X-requested-with: XMLHttpRequest is an Ajax asynchronous request





Reqests


D: It is written in Python, based on urllib encapsulation of the open source HTTP library, improve our efficiency


E: Sends a network request and obtains the returned response data


Pip3 install requests import requests pip3 install requests import requests


Response =requests. Get (URL)


③ Common methods of response:


1. Response. status_code Obtains the status code

2. Response. headers Obtains the response header

3. Response. request Obtains the request corresponding to the response

4. Response. text Gets the response of STR type, and predicts the text encoding, which may be garbled

Respones. Content Retrieves the response as bytes, which can be decoded () to retrieve the string



Note:

① Request with headers: simulate the browser to obtain the content consistent with the browser

Response =request.get(URL,headers =headers) ——– Headers is a dictionary format, which is required for UA

Request with parameters: params=params as a parameter to send a request, also a dictionary form.

③ Send a POST request: data =data as a parameter to send a request, also a dictionary form



IP agent


D: That is, proxy server, whose function is mainly to obtain network information on behalf of network users, figuratively speaking, it is the transfer station of network information


E: Simulate multiple user requests, not the same user, to prevent anti-crawling because IP sends too many requests; To keep our real address from being leaked, from being sued


U: Note: The key must be set based on the actual type of proxy


proxies = {

“HTTP” : “http://12.34.56.79:9527”,

“HTTPS” : “https://12.34.56.79:9527”,

}

requests.get(url, proxies=proxies)



The Cookie and Session


Benefits of cookies and sessions: Many sites must be logged in (or granted certain permissions) before they can request data


Disadvantages of cookies and sessions: A set of cookies and sessions usually correspond to one user. Request too fast, request too many times, easy to be identified by the server as crawler. Thus the account is compromised.


Usage suggestions:


1. Avoid using cookies when you don’t need them


2. In order to obtain the page after login, we must send a request with cookies. At this time, to ensure the security of the account, we should try to reduce the data collection speed


U: Cookie Use: request header to add Cookie or set dictionary Cookie = Cookie as parameter to send a request


Session = requests.Session ()


Session.get (URL)




Requests tip


1. Conversion between Cookiesjar and dictionary


Reqeusts.utils.dict_from_cookiejar converts the Cookiejar object into cookies in dictionary format

The requests.utils.cookiejar_from_dict converts the cookies in dictionary format into cookieJAR objects

2. Request an SSL certificate for authentication


Usage scenario :Requests validates SSL certificates for HTTPS Requests, just like a Web browser. SSL authentication is enabled by default, and Requests will throw SSLError if certificate authentication fails

Method of use: the response = requests. Get (” https://www.12306.cn/mormhweb/ “, verify = False)

3. Set the timeout


Usage scenario: If some sites or agents respond slowly, which seriously reduces the efficiency, you can set timeout

Response = requests. Get (url,timeout=10)

Learn more about Python at gzitcast