UrlLib
Urllib is an HTTP request library built in Python3, which can be used normally without complicated installation process and is very suitable for crawler entry
Urllib contains four modules, respectively
- Request: request processing module
- Parse: URL processing module
- Error: indicates the exception handling module
- Robotparser: robot. TXT Parsing module
We will explain how to use each module in URllib separately below, but due to space problems, this article will only cover the common content of the module
Details you can refer to the official document: docs.python.org/3.7/library…
Use urllib
Before I start, I’ll give you a test site, www.httpbin.org/
The site returns information about the request sent on the page, making it ideal for practice
All right, let’s get started!
1. Request module
The request module is the most important module in URllib. It is used to send requests and receive responses
(1) Urlopen method
urllib.request.urlopen()
Copy the code
The urlopen method is undoubtedly one of the most commonly used methods in the Request module. The common parameters are described as follows:
-
Url: Mandatory, a string that specifies the URL of the destination website
-
Data: Specifies form data
This defaults to None, in which case urllib uses the GET method to send the request
When a parameter is assigned, urllib uses the POST method to send the request, carrying the form information in the parameter (bytes)
-
Timeout: An optional parameter that specifies the wait time. If no response is received after the specified time, an exception will be thrown
This method always returns an HTTPResponse object. Common HTTPResponse properties and methods are as follows:
geturl()
Returns the URLgetcode()
: Returns the status codegetheaders()
: Returns all response headersgetheader(header)
: Returns the specified response headerread()
: Returns the response body (bytes), which is usually useddecode('utf-8')
Convert it to STR type
Example 1: Send a GET request
>>> import urllib.request
>>> url = 'http://www.httpbin.org/get'
>>> response = urllib.request.urlopen(url)
>>> type(response)
# <class 'http.client.HTTPResponse'>
>>> response.geturl()
# 'http://www.httpbin.org/get'
>>> response.getcode()
# 200
>>> response.getheaders()
# [(' Connection 'and' close '), (' Server ', 'gunicorn / 19.9.0'), (' Date ', 'Sat, 11 Aug 2018 01:39:14 GMT), (' the content-type', 'application/json'), ('Content-Length', '243'), ('Access-Control-Allow-Origin', '*'), (' access-control-allow-credentials ', 'true'), ('Via', '1.1 vegur')]
>>> response.getheader('Connection')
# 'close'
>>> print(response.read().decode('utf-8'))
# {
# "args": {},
# "headers": {
# "Accept-Encoding": "identity",
# "Host": "www.httpbin.org",
# "the user-agent" : "Python - urllib / 3.7"
#}.
# "origin" : "183.6.159.80, 183.6.159.80",
# "url": "https://www.httpbin.org/get"
#}
Copy the code
Example 2: Sending a POST request
Urllib.parse. Urlencode () : Urllib.parse. Urlencode () : Urllib.parse. Urlencode () : Urllib.parse
Encode (‘ UTF-8 ‘) : Converts STR data into bytes data
>>> import urllib.request
>>> import urllib.parse
>>> url = 'http://www.httpbin.org/post'
>>> params = {
'from':'AUTO'.'to':'AUTO'
}
>>> data = urllib.parse.urlencode(params).encode('utf-8')
>>> response = urllib.request.urlopen(url=url,data=data)
>>> html = response.read().decode('utf-8')
>>> print(html)
# {
# "args": {},
# "data": "",
# "files": {},
This is the form data we set up
# "from": "AUTO",
# "to": "AUTO"
#}.
# "headers": {
# "Accept-Encoding": "identity",
# "Connection": "close",
# "Content-Length": "17",
# "Content-Type": "application/x-www-form-urlencoded",
# "Host": "www.httpbin.org",
# "the user-agent" : "Python - urllib / 3.6"
#}.
# "json": null,
# "origin" : "116.16.107.180",
# "url": "http://www.httpbin.org/post"
#}
Copy the code
(2) Request object
In fact, we can also pass a request object to the urllib.request.open() method
Why do WE need to use a Request object? Because we cannot specify the request header in the above argument, it is very important for the crawler
Many sites may first examine the User-Agent field in the request header to determine whether the request was initiated by a web crawler
But by changing the USER_AGENT field in the request header, we can easily bypass this layer of inspection by disguizing the crawler as a browser
Here is a website for finding common user-agents:
urllib.request.Request()
Copy the code
The parameters are described as follows:
url
: Specifies the URL of the destination web sitedata
: Form data submitted when a POST request is sent. Default is Noneheaders
: Request header attached when sending a request, default {}origin_req_host
: The host name or IP address of the requester, which defaults to Noneunverifiable
: The requester’s request cannot be validated, default is Falsemethod
: Specifies the request method. Default is None
>>> import urllib.request
>>> url = 'http://www.httpbin.org/headers'
>>> headers = {
'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
>>> req = urllib.request.Request(url, headers=headers, method='GET')
>>> response = urllib.request.urlopen(req)
>>> html = response.read().decode('utf-8')
>>> print(html)
# {
# "headers": {
# "Accept-Encoding": "identity",
# "Connection": "close",
# "Host": "www.httpbin.org",
# "user-agent ": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" This is the user-agent we set up
#}
#}
Copy the code
(3) Use cookies
What is a Cookie?
Cookie refers to data stored on a user’s local terminal by some websites for identifying the user’s identity and session tracking
1) get a Cookie
>>> import urllib.request
>>> import http.cookiejar
>>> cookie = http.cookiejar.CookieJar()
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> for item in cookie:
print(item.name + '=' + item.value)
# BAIDUID=486AED46E7F22C0A7A16D9FE6E627846:FG=1
# BDRCVFR[RbWYmTxDkZm]=mk3SLVN4HKm
# BIDUPSID=486AED46E7F22C0A7A16D9FE6E627846
# H_PS_PSSID=1464_21106_26920
# PSTM=1533990197
# BDSVRTM=0
# BD_HOME=0
# delPer=0
Copy the code
(2) using cookies
>>> import urllib.request
>>> import http.cookiejar
>>> Save the Cookie to a file
>>> cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> cookie.save(ignore_discard=True,ignore_expires=True)
>>> Read Cookie from file and add to request
>>> cookie = http.cookiejar.MozillaCookieJar()
>>> cookie = cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> The Cookie request returns a response
Copy the code
(4) Use proxies
For some websites, if the same IP sends a large number of requests in a short time, the IP may be judged as a crawler, and then the IP will be banned
So it’s necessary to use random IP addresses to get around this layer of checking. Here are a few websites for finding free IP addresses:
- Agent: www.xicidaili.com/nn/
- Cloud agent: www.ip3366.net/free/
- Fast agent: www.kuaidaili.com/free/
Note that free proxy IP is generally unstable and can be updated at any time, so it’s best to write a crawler to maintain it
>>> import urllib.request
>>> import random
>>> ip_list = [
{'http':'61.135.217.7:80'},
{'http':'182.88.161.204:8123'}]>>> proxy_handler = urllib.request.ProxyHandler(random.choice(ip_list))
>>> opener = urllib.request.build_opener(proxy_handler)
>>> response = opener.open('http://www.httpbin.org/ip')
>>> print(response.read().decode('utf-8'))
# {
# "origin" : "61.135.217.7"
#}
Copy the code
2. Parse module
The Parse module can generally be used to process urls
(1) Quote method
When you use Chinese in the URL, you will find that the program will have inexplicable errors
>>> import urllib.request
>>> url = 'https://www.baidu.com/s?wd= crawlers'
>>> response = urllib.request.urlopen(url)
# UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
Copy the code
This is where the quote method comes in handy, which replaces special characters with escape characters to process the above URL as a valid URL
>>> import urllib.parse
>>> url = 'https://www.baidu.com/s?wd=' + urllib.parse.quote('crawlers')
>>> url
# 'https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB'
Copy the code
(2) Urlencode method
The urlencode method has been used in the above article, I don’t know if you still remember, here we go again
The urlencode method converts dict data into URL-compliant STR data.
>>> import urllib.parse
>>> params = {
'from':'AUTO'.'to':'AUTO'
}
>>> data = urllib.parse.urlencode(params)
>>> data
# 'from=AUTO&to=AUTO'
Copy the code
(3) URlparse method
The urlParse method parses the URL and returns a ParseResult object
This object can be thought of as a sextuple, corresponding to the general structure of the URL: Scheme ://netloc/path; parameters? query#fragment
>>> import urllib.parse
>>> url = 'http://www.example.com:80/python.html?page=1&kw=urllib'
>>> url_after = urllib.parse.urlparse(url)
>>> url_after
# ParseResult(scheme='http', netloc='www.example.com:80', path='/python.html', params='', query='page=1', fragment='urllib')
>>> url_after.port
# 80
Copy the code
3. Error module
The Error module is generally used for exception handling and contains two important classes: URLError and HTTPError
HTTPError is a subclass of URLError, so HTTPError must be handled first.
>>> import urllib.request
>>> import urllib.error
>>> import socket
>>> try:
response = urllib.request.urlopen('http://www.httpbin.org/get', timeout=0.1)
except urllib.error.HTTPError as e:
print("Error Code: ", e.code)
print("Error Reason: ", e.reason)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('Time out')
else:
print('Request Successfully')
# Time out
Copy the code