Crawler series (three) urllib basic use

UrlLib

Urllib is an HTTP request library built in Python3, which can be used normally without complicated installation process and is very suitable for crawler entry

Urllib contains four modules, respectively

Request: request processing module
Parse: URL processing module
Error: indicates the exception handling module
Robotparser: robot. TXT Parsing module

We will explain how to use each module in URllib separately below, but due to space problems, this article will only cover the common content of the module

Details you can refer to the official document: docs.python.org/3.7/library…

Use urllib

Before I start, I’ll give you a test site, www.httpbin.org/

The site returns information about the request sent on the page, making it ideal for practice

All right, let’s get started!

1. Request module

The request module is the most important module in URllib. It is used to send requests and receive responses

(1) Urlopen method

urllib.request.urlopen()
Copy the code

The urlopen method is undoubtedly one of the most commonly used methods in the Request module. The common parameters are described as follows:

Url: Mandatory, a string that specifies the URL of the destination website
Data: Specifies form data

This defaults to None, in which case urllib uses the GET method to send the request

When a parameter is assigned, urllib uses the POST method to send the request, carrying the form information in the parameter (bytes)
Timeout: An optional parameter that specifies the wait time. If no response is received after the specified time, an exception will be thrown

This method always returns an HTTPResponse object. Common HTTPResponse properties and methods are as follows:

geturl()Returns the URL
getcode(): Returns the status code
getheaders(): Returns all response headers
getheader(header): Returns the specified response header
read(): Returns the response body (bytes), which is usually useddecode('utf-8')Convert it to STR type

Example 1: Send a GET request

>>> import urllib.request
>>> url = 'http://www.httpbin.org/get'
>>> response = urllib.request.urlopen(url)
>>> type(response)
# <class 'http.client.HTTPResponse'>
>>> response.geturl()
# 'http://www.httpbin.org/get'
>>> response.getcode()
# 200
>>> response.getheaders()
# [(' Connection 'and' close '), (' Server ', 'gunicorn / 19.9.0'), (' Date ', 'Sat, 11 Aug 2018 01:39:14 GMT), (' the content-type', 'application/json'), ('Content-Length', '243'), ('Access-Control-Allow-Origin', '*'), (' access-control-allow-credentials ', 'true'), ('Via', '1.1 vegur')]
>>> response.getheader('Connection')    
# 'close'
>>> print(response.read().decode('utf-8'))
# {
# "args": {},
# "headers": {
# "Accept-Encoding": "identity",
# "Host": "www.httpbin.org",
# "the user-agent" : "Python - urllib / 3.7"
#}.
# "origin" : "183.6.159.80, 183.6.159.80",
# "url": "https://www.httpbin.org/get"
#}
Copy the code

Example 2: Sending a POST request

Urllib.parse. Urlencode () : Urllib.parse. Urlencode () : Urllib.parse. Urlencode () : Urllib.parse

Encode (‘ UTF-8 ‘) : Converts STR data into bytes data

>>> import urllib.request
>>> import urllib.parse
>>> url = 'http://www.httpbin.org/post'
>>> params = {
    'from':'AUTO'.'to':'AUTO'
}
>>> data = urllib.parse.urlencode(params).encode('utf-8')
>>> response = urllib.request.urlopen(url=url,data=data)
>>> html =  response.read().decode('utf-8')
>>> print(html)
# {
# "args": {},
# "data": "",
# "files": {},
This is the form data we set up
# "from": "AUTO",
# "to": "AUTO"
#}.
# "headers": {
# "Accept-Encoding": "identity",
# "Connection": "close",
# "Content-Length": "17",
# "Content-Type": "application/x-www-form-urlencoded",
# "Host": "www.httpbin.org",
# "the user-agent" : "Python - urllib / 3.6"
#}.
# "json": null,
# "origin" : "116.16.107.180",
# "url": "http://www.httpbin.org/post"
#}
Copy the code

(2) Request object

In fact, we can also pass a request object to the urllib.request.open() method

Why do WE need to use a Request object? Because we cannot specify the request header in the above argument, it is very important for the crawler

Many sites may first examine the User-Agent field in the request header to determine whether the request was initiated by a web crawler

But by changing the USER_AGENT field in the request header, we can easily bypass this layer of inspection by disguizing the crawler as a browser

Here is a website for finding common user-agents:

Techblog.willshouse.com/2012/01/03/…

urllib.request.Request()
Copy the code

The parameters are described as follows:

url: Specifies the URL of the destination web site
data: Form data submitted when a POST request is sent. Default is None
headers: Request header attached when sending a request, default {}
origin_req_host: The host name or IP address of the requester, which defaults to None
unverifiable: The requester’s request cannot be validated, default is False
method: Specifies the request method. Default is None

>>> import urllib.request
>>> url = 'http://www.httpbin.org/headers'
>>> headers = {
    'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
>>> req = urllib.request.Request(url, headers=headers, method='GET')
>>> response = urllib.request.urlopen(req)
>>> html = response.read().decode('utf-8')
>>> print(html)
# {
# "headers": {
# "Accept-Encoding": "identity",
# "Connection": "close",
# "Host": "www.httpbin.org",
# "user-agent ": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" This is the user-agent we set up
#}
#}
Copy the code

(3) Use cookies

What is a Cookie?

Cookie refers to data stored on a user’s local terminal by some websites for identifying the user’s identity and session tracking

1) get a Cookie

>>> import urllib.request
>>> import http.cookiejar
>>> cookie = http.cookiejar.CookieJar()
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> for item in cookie:
    	print(item.name + '=' + item.value)
        
        
# BAIDUID=486AED46E7F22C0A7A16D9FE6E627846:FG=1
# BDRCVFR[RbWYmTxDkZm]=mk3SLVN4HKm
# BIDUPSID=486AED46E7F22C0A7A16D9FE6E627846
# H_PS_PSSID=1464_21106_26920
# PSTM=1533990197
# BDSVRTM=0
# BD_HOME=0
# delPer=0
Copy the code

(2) using cookies

>>> import urllib.request
>>> import http.cookiejar
>>> Save the Cookie to a file
>>> cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> cookie.save(ignore_discard=True,ignore_expires=True)
>>> Read Cookie from file and add to request
>>> cookie = http.cookiejar.MozillaCookieJar()
>>> cookie = cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
>>> cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
>>> opener = urllib.request.build_opener(cookie_handler)
>>> response = opener.open('http://www.baidu.com')
>>> The Cookie request returns a response
Copy the code

(4) Use proxies

For some websites, if the same IP sends a large number of requests in a short time, the IP may be judged as a crawler, and then the IP will be banned

So it’s necessary to use random IP addresses to get around this layer of checking. Here are a few websites for finding free IP addresses:

Agent: www.xicidaili.com/nn/
Cloud agent: www.ip3366.net/free/
Fast agent: www.kuaidaili.com/free/

Note that free proxy IP is generally unstable and can be updated at any time, so it’s best to write a crawler to maintain it

>>> import urllib.request
>>> import random
>>> ip_list = [
    {'http':'61.135.217.7:80'},
    {'http':'182.88.161.204:8123'}]>>> proxy_handler = urllib.request.ProxyHandler(random.choice(ip_list))
>>> opener = urllib.request.build_opener(proxy_handler)
>>> response = opener.open('http://www.httpbin.org/ip')
>>> print(response.read().decode('utf-8'))
# {
# "origin" : "61.135.217.7"
#}
Copy the code

2. Parse module

The Parse module can generally be used to process urls

(1) Quote method

When you use Chinese in the URL, you will find that the program will have inexplicable errors

>>> import urllib.request
>>> url = 'https://www.baidu.com/s?wd= crawlers'
>>> response = urllib.request.urlopen(url)
# UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
Copy the code

This is where the quote method comes in handy, which replaces special characters with escape characters to process the above URL as a valid URL

>>> import urllib.parse
>>> url = 'https://www.baidu.com/s?wd=' + urllib.parse.quote('crawlers')
>>> url
# 'https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB'
Copy the code

(2) Urlencode method

The urlencode method has been used in the above article, I don’t know if you still remember, here we go again

The urlencode method converts dict data into URL-compliant STR data.

>>> import urllib.parse
>>> params = {
    'from':'AUTO'.'to':'AUTO'
}
>>> data = urllib.parse.urlencode(params)
>>> data
# 'from=AUTO&to=AUTO'
Copy the code

(3) URlparse method

The urlParse method parses the URL and returns a ParseResult object

This object can be thought of as a sextuple, corresponding to the general structure of the URL: Scheme ://netloc/path; parameters? query#fragment

>>> import urllib.parse
>>> url = 'http://www.example.com:80/python.html?page=1&kw=urllib'
>>> url_after = urllib.parse.urlparse(url)
>>> url_after
# ParseResult(scheme='http', netloc='www.example.com:80', path='/python.html', params='', query='page=1', fragment='urllib')
>>> url_after.port
# 80
Copy the code

3. Error module

The Error module is generally used for exception handling and contains two important classes: URLError and HTTPError

HTTPError is a subclass of URLError, so HTTPError must be handled first.

>>> import urllib.request
>>> import urllib.error
>>> import socket
>>> try:
	response = urllib.request.urlopen('http://www.httpbin.org/get', timeout=0.1)
except urllib.error.HTTPError as e:
	print("Error Code: ", e.code)
	print("Error Reason: ", e.reason)
except urllib.error.URLError as e:
	if isinstance(e.reason, socket.timeout):
		print('Time out')
else:
	print('Request Successfully')
    
# Time out
Copy the code