preface

Want to become an experienced Internet Python developer from scratch. There are several stages that have to take place. The first is the introductory stage, commonly known as the Basics of Python, which is by far the largest category of people learning Python. Some friends even stay for a long time at this stage without substantial breakthrough. Like the fantasy novels we read, it feels like a breakthrough, but just a little bit. The teacher will speak in class, why can’t you think of it, why can’t you do it?

So the second stage is after the breakthrough of the first stage. Python advanced, at this stage, really divides Python into two groups, those who are less good to learn and those who are better to learn. Because at this stage of Python you’ll find yourself learning a lot more than just Python, such as network programming, multithreading, database techniques, data analysis, and even frameworks (Flask, Django, Scrapy…). . This stage is paving the way for our last project stage. An enterprise project requires more than just Python syntax. Python is a basic language. So, the advanced Python phase eliminated a large number of people who wanted to do Python development.

Today we are going to take a look at some Python network programming. Of course, knowledge points are not enumerated so complete, we can also be based on this in-depth study. Do you want to start pulling hair….

Network analysis

Stateless protocol

Let’s take a look at HTTP. It’s a hypertext transfer protocol. When data is transferred between computers, a protocol must be used.

HTTP is a stateless protocol. What is a stateless protocol? So let’s do an explanation of that.

Stateless protocols do not store state. In simple terms, the information about the sent request and response packets is not retained. Whenever a new request is sent, a new response is generated. This has the advantage of being able to handle a large number of transactions quickly

HTTP is a stateless protocol, so this raises a problem.

Let’s say we’ve built a Web site in Python, and now we have a user who logs in to the site and gets their personal user information. Based on this point, I click other page information in the user information. Obviously, the operation here needs to keep the login state to carry out subsequent operations. Since HTTP is a stateless protocol, how does it solve this problem?

In fact, here is the use of Cookie technology, the use of Cookie can be achieved to maintain a good login state. When a user makes a request, it sends a Cookie in the request header.

HTTP is a stateless protocol that reduces the CPU and memory resource consumption of the server. Control the state of the client by writing Cookie information in the request and response. When we request it, what field is it transmitted from the server? As can be seen from the figure, it is transmitted through set-cookie, and the client is also informed to save the Cookie.

The value of set-cookie can be seen in the response header of the GET method

The meaning of the response header field:

Server: indicates the Server version

Content-type Specifies the text type

Transfer-encoding Indicates the Transfer Encoding of the packet body

The connection long connection

Version of the X-powered-by programming language

The cache-control don't Cache

Date Date and time when the packet is created

set- Cookie The server sent you a cookie

Copy the code

Get and POST methods

Get method and POST method will not be introduced too much, the difference between them to make a comparison. In addition to them, there are put, head, etc.

  • Data characteristics: The GET method has no request body and has a limited length. The POST method has a request body that has no limit on size.

  • Security: Data transfer get method is less secure than POST request. Parameters of get method can be seen in the address bar.

The head method

In our use, the head method is very rare. It is similar to the GET method, except that it does not return the corresponding response body. It can determine the validity of the URL and the date and time of resource updates.

Status code

When sending a request for data, the server gives a status code to prompt the client for the data result. Using the status code, the user knows if the server is processing the request properly.

There are many types of status codes, consisting of three digits. The first is its response category. These are generally divided into five major categories.

1XX Information Status code The received request is being processed

2xx Success status code The request is processed normally

3xx Redirection status code Further operations are required to complete the request

4XX client error The server cannot process the request

5XX Server error The server fails to process the request

Copy the code

The following figure shows the common status codes

Whether it’s doing the Web back end in Python, or some other language. There is always a common problem in the network, that is, the security of information transmission. Information is transmitted on the Internet, and there is always the possibility of eavesdropping. So how do we ensure the security of their communication, which cannot be separated from HTTPS

Network security

Whether it’s doing the Web back end in Python, or some other language. There is always a common problem in the network, that is, the security of information transmission. Information is transmitted on the Internet, and there is always the possibility of eavesdropping. So how do we ensure the security of their communication, which cannot be separated from HTTPS

HTTPS is encrypted with an s after HTTP. If you are not familiar with HTTP, you can check out the Python network section to make it easier to understand.

A look at HTTP’s pros and cons, and the problems it can cause when communicating, will help explain why HTTPS is safer.

https

Advantages of HTTP:

  • 1. Fast transmission speed, very flexible.
  • 2. It is a stateless protocol, which reduces CPU and memory resource consumption of the server.
  • 3. Each connection processes only one request, and then breaks down to reduce resource consumption.

HTTP faults:

  • 1. Large amounts of data cannot be transmitted.
  • 2. Communication is in plain text, which leads to easy eavesdropping and poor security.
  • 3. Do not authenticate your identity when communicating. Add a UA to the Python crawler to disguise it as a browser.
  • 4. The integrity of data transmission cannot be verified, and data is tampered without being aware of it

To do such eavesdropping is very simple, using the packet capture tool on the browser, to capture the network, you can see a large amount of data.

HTTP Security

Take HTTP security as an argument. On the Internet, the network can be connected to the whole world. In the working mechanism of TCP/IP protocol family, the contents above the line can be seen by the gene, even if you have encrypted, those encrypted communication data can also be seen, but can not be cracked. So this is where encryption comes in.

Encryption can prevent information from being stolen. There are two common encryption methods: the first is communication encryption, and the second is content encryption.

Communication encryption is actually the way of HTTPS, content encryption or HTTP, but it is the transmission of the content of the encryption, some in addition to content encryption, the interface will also be encrypted, this depends on the demand. After the content is encrypted, it doesn’t matter if you catch the packet and see a bunch of gibber, which is the way that sensitive information is transmitted

Communication encryption

How is the process of communication encryption implemented? HTTP itself has no encryption mechanism. It can be combined with SSL (Secure Sockets Layer), which provides encryption. SSL encrypts HTTP traffic and establishes a secure channel through which HTTP can communicate securely.

HTTPS is more secure because SSL not only provides encryption, but also authenticates the other party with a certificate issued by a trusted tripartite authority that proves that the two parties actually exist and are not forged.

After the certificate is used, the identity of the client can be confirmed and the identity of the server can be authenticated, thus reducing information leakage. One might think, what if the certificate is forged? The odds are so low that it’s technically difficult, so you can rest assured.

Encryption of content

Another way to encrypt content is to encrypt it. HTTP itself has no encryption mechanism, so it encrypts the content before sending the request. This raises a problem, the content is encrypted before transmission, the server side how to decrypt, so they have to negotiate the way of encryption and decryption. Generally more common is the MD5 method, or add salt can be. Content encryption, when used, should also be determined according to the company’s business needs.

HTTP and HTTPS security comparison, the difference between them and the origin of these, if you want to do an in-depth discussion of HTTPS certificate issuance process, it is necessary to learn the common encryption and decryption methods on the Internet. It will also be explored in the next section.

Network data analysis

The web is so important in development that many people, when writing crawlers, have no idea what the fields in the packet capture tool mean. Network knowledge is very much, this chapter let’s for its request and transmission to do an in-depth discussion.

First look at a URL address: image.baidu.com/. Open the packet capture tool, and there is a request packet inside. Analyze the message, which contains the request line and various header fields.

As can be seen in the request line, it is a GET request, and the protocol is HTTP /1.1 protocol. Let’s briefly discuss a basic problem, which is the format of the URL address. Protocol :// host address/Lu Jinghttps is the protocol, image.baidu.com is the host address, but also the domain name. And then you have the path, and then you have the parameter, and then you have some URLS where a lot of people don’t understand what the symbols are after the path.

www.jd.com/?cu=true&ut…

Other parameters

After url analysis is complete, the meanings of these fields in the request packet are sent to other servers

1. Accept The types that can be processed by the client

2. Accept-language Chinese is preferred

3. User-agent Identifier of the browser

4. Accept-encoding Indicates the preferred compression mode

5.Host Host IP address

6. The connection long connection

7. Cache-control Cache Control

Cookie Indicates the cookie generated by the server

Copy the code

Communication process

Now that you know a few parameters, what do computers do when they communicate when they send a request? For example, if I type Python into the text box, when I click search to send the request, it will quickly show you what you want, so let’s explain it.

TCP/IP is a protocol family that is divided into application layer, transmission layer, network layer and data link layer.

First, it obtains the IP address from the local DNS server. The local DNS address can be queried using the ipconfig-all command. You can also go to your computer and find the hosts file on your local computer.

The original Internet managed a database file called hosts, which contained domain names and IP addresses. As long as a new one is added, it is necessary to update this file, which is very troublesome. So they set up a system to manage it, the DNS system.

When it queries an IP address, it searches the local DNS server first. If it fails to find an IP address, it searches the root DNS server. You can also see that there are some hackers who want to do something bad to your computer, do something to DNS or do something to cheat on you, and that they have arranged for you to visit the sites.

After we get the IP address, it encodes the Python string that we input, and the processing is in the application layer, where HTTP and DNS are, and it takes the header field of the request message and sends it, and after the application layer does that, it forwards it to the next layer.

After the application layer is processed, the transport layer includes TCP, UDP and so on. The TCP header is appended to the previous application layer data and then sent to the IP. The TCP header contains the source port number and target port number, which is used to identify the sending host and interface host. And determine the number of bytes to send, and checksum. The MAIN function of TCP is to divide HTTP request packets into segments and forward them in sequence.

Then it comes to the network layer, where the IP also adds its own IP header based on the data previously passed. The IP header contains the IP address of the receiver and the IP address of the sender, and an important piece of information is used to determine the TCP and UDP data. A driver that is then sent to the network interface. Send it to a specific recipient and know the Mac address of the recipient.

Finally, add the Ethernet header at the link layer, which contains the Mac address and type of protocol.

After the connection is established here, there is a TCP three-way handshake. The FUNCTION of THE IP protocol is to search for the address and transmit the packet segment at the same time. TCP also reassembles the received packet segments in sequence. This is the whole process of data transmission.

Knowledge knows no boundaries, students according to their own work and study situation reasonable arrangement of time. Find their weak points, symptomatic learning, master good methods, you can see the effect in the short term. However, learning is still a long-term process of accumulation, do not aim too high, step by step is a long-term solution.