Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities.
This paper has participated inProject DigginTo win the creative gift package and challenge the creative incentive money.
preface
In the last period of computer network knowledge, we have already understood and learned the basis of computing network, such as TCP/IP four-layer protocol model, URL resource locator, HTTP/HTTPS protocol and so on.
Python is known as a high-level language that supports not only GUI programming but also network programming.
The Python built-in library provides urllib modules for the function of manipulating urls, and requests library modules for servers
So, in this installment, we’ll use the Python crawler to learn about the urllib module, the most commonly used module. Let’s go~
1. Overview of urllib module
⌨️urllib is python’s built-in HTTP request library that requires no installation.
Urllib is very powerful and provides support for many of the following features
- Web page request
- Response for
- Proxy and cookie Settings
- Exception handling
- URL parsing
The URllib provision contains four modules
-
Request: This is the most basic HTTP request module that simulates sending a request
-
Error: Exception handling module that can catch exceptions if errors occur
-
Parse: A tool module that provides a number of URL handling methods, such as splitting, parsing, merging, and so on
-
Robotparser: It is mainly used to identify robots.txt files of web sites and determine which sites can be climbed
🔔 Important note
- In Python2, you can import urllib directly
- In Python3, you need to specify the imported module when importing
2. Urllib. request module related methods
The 🎉 urllib.request module contains methods to open HTTP urls in various situations, such as authentication, redirection, and cookie operations.
The urllib.request module is most used when sending requests to the server.
The urllib.request module provides the following common methods for requesting urls:
methods | role |
---|---|
urllib.request.urlopen(url,data) | Open the URL, which can be a character or a Request object |
urllib.request.build_opener() | String the functions together and return OpenerDirector |
urllib.request.Request(url,data) | Send data data to the server |
urllib.request.HTTPBasicAuthHandler() | Handles the authentication of remote hosts |
urllib.request.ProxyHandler(proxies) | Use the specified proxy |
📣 Key notes
The urllib.request module is commonly used as follows:
- Import the module: import urllib.request
- The file manager with reads the contents of web pages
import urllib.request
req = urllib.request.urlopen("https://juejin.cn/user/211521683863847/posts")
with req as f:
print(f.read(300).decode('utf-8'))
Copy the code
3. Urllib. parse module related methods
The 🎉 urllib.parse module is used to parse the URL string into protocol, network location, path, and other parts
The urllib.parse module can also combine the parts into URL strings, converting relative urls into complete absolute URLS
The urllib.parse module defines two parts: method URL parsing and URL transcoding.
Urllib.parse provides the following methods for parsing urls:
methods | role |
---|---|
urllib.parse.urlparse(urlstring) | Scheme ://netloc/path; parameters? query#fragment |
urllib.parse.parse_qs(qs) | Parse as a string argument (application/x-www-form-urlencoded) gives the query string |
urllib.parse.urlnparse(parts) | Returns the constructed URL as a tuple |
urllib.parse.urlsplit(urlstring) | Parses the URL and returns the tuple form through the index to query url parameters |
urllib.parse.urljoin(base,url) | Combine the base and URL to form a complete URL |
urllib.parse.urldefrag(url) | Returns a URL that does not contain the fragment identifier |
Urllib. parse provides the following methods for transcoding urls:
methods | role |
---|---|
urllib.parse.quote(string) | Replace special characters in string with %xx escape characters |
urllib.parse.urldecod(query) | Converts a string to an encoded ASCII text string |
Urllib. parse parse the url
attribute | instructions |
---|---|
schema | URL agreement |
netloc | Network location |
path | Layered path |
query | Query component |
fragment | Fragment identifier |
username | The user name |
password | password |
hostname | Host name (lowercase) |
post | The port number |
📣 Example
import urllib.parse
pa = urllib.parse.urlparse("https://juejin.cn/user/211521683863847/posts")
print("Agreement:",pa.scheme)
print("Network location:",pa.netloc)
print("Hierarchical path:",pa.path)
Copy the code
4. Urllib. error module related methods
The 🎉urllib.error module deals specifically with the exception class scenarios raised by urllib.request
Urllib. error mainly provides two HTTPError and URLError
methods | role |
---|---|
urllib.error.HTTPError | Handle exceptions raised by HTTP errors |
urllib.error.URLError | Handles an exception thrown when the program encounters a problem |
📣 Important note
- HTTPError is a subclass of URLError that handles HTTP error exceptions
- URLError is a subclass of OSError that contains Reason as the reason for raising the exception
5. Urllib. robotParser module related methods
🎉urllib. robotParser is dedicated to resolving robots.txt files for web sites that crawl specific urls
class | role |
---|---|
urllib.robotparser.RobotFileParse(url) | Provides a series of methods for reading, parsing, and answering url questions |
methods | role |
---|---|
set_url(url) | Set pointrobots.txt The URL of the file |
read() | readrobots.txt URL and enter it into the parser. |
parse() | Parse row arguments. |
can_fetch() | If you allowuseragentAs resolvedrobots.txt File rules to obtainurlIt returnsTrue . |
conclusion
Request, urllib.parse, urllib.error, urllib. robotParser are four modules provided by urllib library.
This is the content of this issue, welcome big guys to like and comment ღ(´ ᴗ · ‘) than heart, see you next time ~💖💗 ᴗ