1. Introduction to crawlers
1.1 What is a crawler
A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.
1.2 Basic process of crawler
- Sending a Request: Sends a Request to the target site using the HTTP library. A Request contains the Request header and Request body
- Get the Response content: If the server responds properly, it will get a Response. The Response contains: HTML, JSON, image, video, etc
- Parsing content: Parsing HTML data: regular expressions, third-party parsing libraries such as Beautifulsoup, PyQuery, etc. => Parsing JSON data: THE JSON module parses binary data: writes files in b mode
- Save data: database/file
2.ProxyHandler handler
The principle of proxy: before requesting the destination website, first request the proxy server, and then let the proxy server to request the destination website, the proxy server gets the data of the destination website, and then forward to our code.
Httpbin.org: This site makes it easy to view the parameters of HTTP requests.
Using proxies in code:
use
The proxy is a dictionary whose key depends on the type that the proxy server can receive
And the value is
Created using the previous step
, as well as
To create a
Created using the previous step
, the call
Function to initiate a request. Example code is as follows:
- use
from urllib import request url = 'http://httpbin.org/ip' # 1. Handler = request.ProxyHandler({" HTTP ":""}) # 2. Build_opener = request.build_opener(handler) # 3. Resp = opener. Open (URL) print(resp.read())Copy the code
3. Requests the request
Send a GET request:
Just send the GET request and call requests. Get. Call whatever method you want to send the type of request.
response = requests.get("https://www.baidu.com/")
Some properties of Response:
Import requests KW = {'wd':' China '} headers = {" user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} # params Dictionary types are automatically converted to URL encoding, Response = requests. Get ("http://www.baidu.com/s", params = kw, headers = headers) Print (response.text) # Print (response.content) # View the full URL address print(response.url) # View the response header character encoding print(response.encoding) # Print (response.status_code)Copy the code
Response. text and response.content:
- Response. content: This is the data captured directly from the network. Without any decoding. So it’s a bytes type. In fact, strings transmitted on the hard disk and over the network are bytes.
- Response. text: This is the STR data type, which is the string that the Requests library decoded Response. content. Decoding requires that an encoding be specified, and Requests will use their own guesses to determine the encoding. So sometimes it’s possible to guess wrong, which will lead to garbled decoding. That’s when it should be used
Perform manual decoding.
Send a POST request:
Sending a POST request is very simple. Just call the requests. Post method. If you return JSON data. Response.json () can then be called to convert the JSON string into a dictionary or list.
Using a proxy:
In the request method, pass the Proxies parameter.
Processing cookies:
If you want to share cookies across multiple requests. Then you should use sessions. Example code is as follows:
import requests url = "http://www.renren.com/PLogin.do" data = {"email":"970138074@qq.com",'password':"pythonspider"} Headers = {' user-agent ': "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/62.0.3202.94 Safari/537.36"} session = requests.Session() session.post(url,data=data,headers=headers) response = session.get('http://www.renren.com/880151247/profile') with open('renren.html','w',encoding='utf-8') as fp: fp.write(response.text)Copy the code