“This is the second day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.

How to surf the web?

Is based on the URL to get its web information, although we see in the browser is elegantly beautiful picture, but in fact is explained by the browser to render, essence it is a piece of HTML code, add JS, CSS, if compare web pages to a person, then the HTML is his skeleton, JS is his muscles, CSS is its clothes. So the most important part is in the HTML, so here’s an example of ripping off a web page

from urllib.request import urlopen
 
response = urlopen("http://www.baidu.com")
print(response.read().decode())

Copy the code

The real program is two lines, execute the following command to see the results, feel

Look, the source of this page has been stripped down by us, is not very sour cool?


2. Common methods

  • requset.urlopen(url,data,timeout)

    • The first parameter url is the URL, the second parameter data is the data to be transmitted when accessing the URL, and the third parameter timeout is to set the timeout time.

    • The second and third parameters are not passed. Data is null by default, None, and timeout is socket._GLOBAL_DEFAULT_TIMEOUT by default

    • The first parameter URL must be sent. In this example, we send baidu URL. After executing urlopen method, we return a response object, where the returned information is stored.

  • response.read()

    • The read() method reads the entire contents of the file, returning bytes
  • response.getcode()

    • The HTTP response code is returned. 200 is successfully returned
  • response.geturl()

    • Returns the actual URL that returns the actual data, preventing redirection problems
  • response.info()

    • Returns the HTTP header for the server response

3. The Request object

The urlopen parameter can be passed in as a request, which is an instance of the Request class. It is constructed by passing in the Url,Data, and so on. For example, in the two lines above, we could rewrite them like this

from urllib.request import urlopen from urllib.request import Request request = Request("http://www.baidu.com") response  = urlopen(requst) print(response.read().decode())Copy the code

The result is exactly the same, except that there is a request object in the middle, which is recommended because there is a lot of stuff that needs to be added to build the request. By building a request, the server responds to the request and gets an answer, it is logically clear


4. Get request

Most of the HTML, images, JS, CSS,… are transferred to the browser. All requests are made through the GET method. It is the primary method of obtaining data

For example, www.baidu.com search

Parameters of Get request are reflected in Url. If there are Chinese characters, transcoding is needed, then we can use them

  • urllib.parse.urlencode()

  • urllib.parse. quote()

5. A Post request

We said Request has a data parameter in the Request object, which is used in the POST, and the data that we’re going to send is this data parameter, data is a dictionary, and it’s going to match key and value pairs

Send a request/response header.

The name of the meaning
Accept Tells the server what data types the client supports
Accept-Charset Tells the server the encoding used by the client
Accept-Encoding Tells the server which data compression format the client supports
Accept-Language Tell the server the client’s locale
Host The client uses this header to tell the server the name of the host it wants to access
If-Modified-Since This header tells the server how long the resource has been cached
Referer The client uses this header to tell the server from which resource it is accessing the server. (Generally used to prevent theft)
User-Agent This header is used by the client to tell the server about the client’s software environment
Cookie The client uses this header to tell the server that it can bring data to the server
Refresh The server uses this header to tell the browser how often it should refresh
Content-Type The server sends back the type of data through this header
Content-Language The server tells the server’s locale through this header
Server The server tells the browser the type of server through this header
Content-Encoding The server uses this header to tell the browser the compressed format of the data
Content-Length The server uses this header to tell the browser how long to send back data

6. Response encoding

Response status code

The response status code consists of three digits, the first of which defines the category of the response and has five possible values. Common status codes:

number meaning
100 ~ 199 Indicates that the server successfully received some requests and requires the client to submit the remaining requests to complete the process
200 ~ 299 Indicates that the server successfully received the request and completed the entire processing. Common 200 (OK request successful)
300 ~ 399 To complete the request, the customer needs to further refine the request. For example: the requested resource has been moved to a new address, common 302 (the requested page has been temporarily moved to a new URL), 307, and 304 (using cached resources)
400 ~ 499 404 (the server cannot find the requested page), 403 (the server denied access, not enough permission)
500 ~ 599 Server side error, common 500 (request not completed. The server encountered an unexpected condition.

7. Ajax requests to get data

Some web content is loaded using AJAX, and AJAX usually returns JSON. Post or GET directly to the AJAX address returns JSON data

8. Request an SSL certificate for authentication

Sites that start with HTTPS are now ubiquitous, and urllib can verify SSL certificates for HTTPS requests, just like a Web browser, if the site’s SSL certificate is ca-certified, it can be accessed normally, such as www.baidu.com/

If the SSL certificate verification fails or the operating system does not trust the security certificate of the server, for example, when the browser accesses 12306 websites such as www.12306.cn/mormhweb/… 12306 website certificate is made by oneself, did not pass CA certification)

Context = ssl._create_unverified_context() response = urllib.request. Urlopen (request, context = context)Copy the code