Python Urllib module

Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities.

This paper has participated inProject DigginTo win the creative gift package and challenge the creative incentive money.

preface

In the last period of computer network knowledge, we have already understood and learned the basis of computing network, such as TCP/IP four-layer protocol model, URL resource locator, HTTP/HTTPS protocol and so on.

Python is known as a high-level language that supports not only GUI programming but also network programming.

The Python built-in library provides urllib modules for the function of manipulating urls, and requests library modules for servers

So, in this installment, we’ll use the Python crawler to learn about the urllib module, the most commonly used module. Let’s go~

1. Overview of urllib module

urllib is python’s built-in HTTP request library that requires no installation.

Urllib is very powerful and provides support for many of the following features

Web page request
Response for
Proxy and cookie Settings
Exception handling
URL parsing

The URllib provision contains four modules

Request: This is the most basic HTTP request module that simulates sending a request
Error: Exception handling module that can catch exceptions if errors occur
Parse: A tool module that provides a number of URL handling methods, such as splitting, parsing, merging, and so on
Robotparser: It is mainly used to identify robots.txt files of web sites and determine which sites can be climbed

Important note

In Python2, you can import urllib directly
In Python3, you need to specify the imported module when importing

2. Urllib. request module related methods

The urllib.request module contains methods to open HTTP urls in various situations, such as authentication, redirection, and cookie operations.

The urllib.request module is most used when sending requests to the server.

The urllib.request module provides the following common methods for requesting urls:

methods	role
urllib.request.urlopen(url,data)	Open the URL, which can be a character or a Request object
urllib.request.build_opener()	String the functions together and return OpenerDirector
urllib.request.Request(url,data)	Send data data to the server
urllib.request.HTTPBasicAuthHandler()	Handles the authentication of remote hosts
urllib.request.ProxyHandler(proxies)	Use the specified proxy

Key notes

The urllib.request module is commonly used as follows:

Import the module: import urllib.request
The file manager with reads the contents of web pages

import urllib.request

req = urllib.request.urlopen("https://juejin.cn/user/211521683863847/posts")

with req as f:

    print(f.read(300).decode('utf-8'))
Copy the code

3. Urllib. parse module related methods

The urllib.parse module is used to parse the URL string into protocol, network location, path, and other parts

The urllib.parse module can also combine the parts into URL strings, converting relative urls into complete absolute URLS

The urllib.parse module defines two parts: method URL parsing and URL transcoding.

Urllib.parse provides the following methods for parsing urls:

methods	role
urllib.parse.urlparse(urlstring)	Scheme ://netloc/path; parameters? query#fragment
urllib.parse.parse_qs(qs)	Parse as a string argument (application/x-www-form-urlencoded) gives the query string
urllib.parse.urlnparse(parts)	Returns the constructed URL as a tuple
urllib.parse.urlsplit(urlstring)	Parses the URL and returns the tuple form through the index to query url parameters
urllib.parse.urljoin(base,url)	Combine the base and URL to form a complete URL
urllib.parse.urldefrag(url)	Returns a URL that does not contain the fragment identifier

Urllib. parse provides the following methods for transcoding urls:

methods	role
urllib.parse.quote(string)	Replace special characters in string with %xx escape characters
urllib.parse.urldecod(query)	Converts a string to an encoded ASCII text string

Urllib. parse parse the url

attribute	instructions
schema	URL agreement
netloc	Network location
path	Layered path
query	Query component
fragment	Fragment identifier
username	The user name
password	password
hostname	Host name (lowercase)
post	The port number

Example

import urllib.parse

pa = urllib.parse.urlparse("https://juejin.cn/user/211521683863847/posts")

print("Agreement:",pa.scheme)
print("Network location:",pa.netloc)
print("Hierarchical path:",pa.path)
Copy the code

4. Urllib. error module related methods

The urllib.error module deals specifically with the exception class scenarios raised by urllib.request

Urllib. error mainly provides two HTTPError and URLError

methods	role
urllib.error.HTTPError	Handle exceptions raised by HTTP errors
urllib.error.URLError	Handles an exception thrown when the program encounters a problem

Important note

HTTPError is a subclass of URLError that handles HTTP error exceptions
URLError is a subclass of OSError that contains Reason as the reason for raising the exception

5. Urllib. robotParser module related methods

urllib. robotParser is dedicated to resolving robots.txt files for web sites that crawl specific urls

class	role
urllib.robotparser.RobotFileParse(url)	Provides a series of methods for reading, parsing, and answering url questions

methods	role
set_url(url)	Set point`robots.txt`The URL of the file
read()	read`robots.txt`URL and enter it into the parser.
parse()	Parse row arguments.
can_fetch()	If you allowuseragentAs resolved`robots.txt`File rules to obtainurlIt returns`True`.

conclusion

Request, urllib.parse, urllib.error, urllib. robotParser are four modules provided by urllib library.

This is the content of this issue, welcome big guys to like and comment ღ(´ ᴗ · ‘) than heart, see you next time ~ ᴗ

preface

1. Overview of urllib module

2. Urllib. request module related methods

3. Urllib. parse module related methods

4. Urllib. error module related methods

5. Urllib. robotParser module related methods

conclusion

Related Posts

Algorithm analysis (7) : Fibonacci sequence

Eureka source code – Client configuration update

Async,await execution stream can’t read? Never again after reading this