Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities.

This paper has participated inProject DigginTo win the creative gift package and challenge the creative incentive money.

preface

In the last period of computer network knowledge, we have already understood and learned the basis of computing network, such as TCP/IP four-layer protocol model, URL resource locator, HTTP/HTTPS protocol and so on.

Python is known as a high-level language that supports not only GUI programming but also network programming.

The Python built-in library provides urllib modules for the function of manipulating urls, and requests library modules for servers

So, in this installment, we’ll use the Python crawler to learn about the urllib module, the most commonly used module. Let’s go~

1. Overview of urllib module

⌨️urllib is python’s built-in HTTP request library that requires no installation.

Urllib is very powerful and provides support for many of the following features

  1. Web page request
  2. Response for
  3. Proxy and cookie Settings
  4. Exception handling
  5. URL parsing

The URllib provision contains four modules

  • Request: This is the most basic HTTP request module that simulates sending a request

  • Error: Exception handling module that can catch exceptions if errors occur

  • Parse: A tool module that provides a number of URL handling methods, such as splitting, parsing, merging, and so on

  • Robotparser: It is mainly used to identify robots.txt files of web sites and determine which sites can be climbed

🔔 Important note

  • In Python2, you can import urllib directly
  • In Python3, you need to specify the imported module when importing

2. Urllib. request module related methods

The 🎉 urllib.request module contains methods to open HTTP urls in various situations, such as authentication, redirection, and cookie operations.

The urllib.request module is most used when sending requests to the server.

The urllib.request module provides the following common methods for requesting urls:

methods role
urllib.request.urlopen(url,data) Open the URL, which can be a character or a Request object
urllib.request.build_opener() String the functions together and return OpenerDirector
urllib.request.Request(url,data) Send data data to the server
urllib.request.HTTPBasicAuthHandler() Handles the authentication of remote hosts
urllib.request.ProxyHandler(proxies) Use the specified proxy

📣 Key notes

The urllib.request module is commonly used as follows:

  1. Import the module: import urllib.request
  2. The file manager with reads the contents of web pages
import urllib.request

req = urllib.request.urlopen("https://juejin.cn/user/211521683863847/posts")

with req as f:

    print(f.read(300).decode('utf-8'))
Copy the code

3. Urllib. parse module related methods

The 🎉 urllib.parse module is used to parse the URL string into protocol, network location, path, and other parts

The urllib.parse module can also combine the parts into URL strings, converting relative urls into complete absolute URLS

The urllib.parse module defines two parts: method URL parsing and URL transcoding.

Urllib.parse provides the following methods for parsing urls:

methods role
urllib.parse.urlparse(urlstring) Scheme ://netloc/path; parameters? query#fragment
urllib.parse.parse_qs(qs) Parse as a string argument (application/x-www-form-urlencoded) gives the query string
urllib.parse.urlnparse(parts) Returns the constructed URL as a tuple
urllib.parse.urlsplit(urlstring) Parses the URL and returns the tuple form through the index to query url parameters
urllib.parse.urljoin(base,url) Combine the base and URL to form a complete URL
urllib.parse.urldefrag(url) Returns a URL that does not contain the fragment identifier

Urllib. parse provides the following methods for transcoding urls:

methods role
urllib.parse.quote(string) Replace special characters in string with %xx escape characters
urllib.parse.urldecod(query) Converts a string to an encoded ASCII text string

Urllib. parse parse the url

attribute instructions
schema URL agreement
netloc Network location
path Layered path
query Query component
fragment Fragment identifier
username The user name
password password
hostname Host name (lowercase)
post The port number

📣 Example

import urllib.parse

pa = urllib.parse.urlparse("https://juejin.cn/user/211521683863847/posts")

print("Agreement:",pa.scheme)
print("Network location:",pa.netloc)
print("Hierarchical path:",pa.path)
Copy the code

4. Urllib. error module related methods

The 🎉urllib.error module deals specifically with the exception class scenarios raised by urllib.request

Urllib. error mainly provides two HTTPError and URLError

methods role
urllib.error.HTTPError Handle exceptions raised by HTTP errors
urllib.error.URLError Handles an exception thrown when the program encounters a problem

📣 Important note

  • HTTPError is a subclass of URLError that handles HTTP error exceptions
  • URLError is a subclass of OSError that contains Reason as the reason for raising the exception

5. Urllib. robotParser module related methods

🎉urllib. robotParser is dedicated to resolving robots.txt files for web sites that crawl specific urls

class role
urllib.robotparser.RobotFileParse(url) Provides a series of methods for reading, parsing, and answering url questions
methods role
set_url(url) Set pointrobots.txtThe URL of the file
read() readrobots.txtURL and enter it into the parser.
parse() Parse row arguments.
can_fetch() If you allowuseragentAs resolvedrobots.txtFile rules to obtainurlIt returnsTrue.

conclusion

Request, urllib.parse, urllib.error, urllib. robotParser are four modules provided by urllib library.

This is the content of this issue, welcome big guys to like and comment ღ(´ ᴗ · ‘) than heart, see you next time ~💖💗 ᴗ