This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
👻 recently, many fans of the private letter I asked – what is crawler? How do you learn a reptile? 👻
😬 In fact, WHAT I want to say is what some big bull once said to me when I was a little white — most of the time we all have a heart to learn new knowledge, but we are always afraid of the ignorance of the content we want to learn, which is also the failure of most people and even regret for life: because they have never started! 😬
😜 borrow a few years ago to take me into the pit of the elder words – pit in front of you, do not always hesitate to wander, a bit bold: step forward, into the pit, mang head is forward, do not think, do not look back, one day – > you will become someone else’s elder! 😜
Today’s chicken soup has been delivered successfully. Destination: Your heart! 💗 💗 💗
Back to the subject ~~~ the original intention of the blogger writing this article is also to achieve an effect with the help of this article:
Lead those who want to learn reptiles but have been slow to start, or those who are interested in reptiles want to learn to learn this technology – officially into the pit!! |
💩<-🐷 First, let me answer the first question — what is a crawler? 🐷 – > 💩
👉 in fact, you can baidu baidu to a large official definition, but those who are not friendly to the new, crawler! In a word: is ** to simulate the browser to send a request, get a response! * * 👈
🎈 As for the second question: how to learn a reptile? You’ll have to wait until you’re done reading this post to see if you still need me to answer that question. 🎈
1. Concept of crawler
(1) Concept of crawler (specialization definition) :
Web crawlers, also known as web spiders, refer to a special classAutomatic batchDownload network resources program, this is a more colloquial definition. A more professional and comprehensive definition is:Web crawlers arecamouflageInto a client and server for data interaction procedures.
(2) Application of crawler:
-
Data collection Big data era is coming, data is the core, data is productivity, more and more enterprises begin to pay attention to the collection of user data, and crawler technology is an important means of data collection.
For example: capture micro blog comments (machine learning public opinion monitoring) capture recruitment website recruitment information (data analysis, mining) Baidu news websiteCopy the code
-
Search engine Baidu, Google and other search engines are based on crawler technology. (PS: Reptilian bigwigs!)
Knowledge supply station: some well-known headlines is to rely on reptilian wealth oh!!Copy the code
-
Simulated operation crawler is also widely used to simulate user operation, test robot, irrigation robot and so on.
-
Automated test entomologist for software test crawlers
-
Network security SMS bombardment web vulnerability scanning
(3) Classification of reptiles:
🎈 According to different standards, the classification of reptiles is also different. The three common classification standards and their classification are as follows: 🎈
The first one is classified according to the different number of crawlers: ① General crawler: usually refers to the crawler of search engines. General crawler is an important part of search engine crawler system (Baidu, Goole, Yahoo, etc.). The main purpose is to download the Web pages of the Internet to the local, forming a mirror backup of the Internet content. (One big problem is that they are very limited: most of the content is useless — different search purposes return the same content!)
② Focused crawlers: crawlers for specific websites. It is a web crawler oriented to specific subject requirements. The difference between it and general search engine crawler lies in that the focused crawler will process and filter the content when implementing page crawling, and try to ensure that only the webpage information related to requirements is captured!
① Functional crawler: for example, vote, like… ② Data incremental crawler: for example, recruitment information…
Thirdly, according to whether the URL address and corresponding page content change, data incremental crawler can be divided into: (1) Data incremental crawler based on the CHANGE of URL address, the content also changes; ② DATA incremental crawler with unchanged URL address and changing content.
(4) General development process of crawlers:
① The simplest single page data crawling: URL — > send request, get response — > extract data — > save data ② multi-page data crawling: send request, get response — > extract URL address, continue to request
(5) Key and difficult points of crawler development:
The difficulties of crawler can be divided into two directions:
- Data acquisition (PS: why bother people!) Network public resources are prepared for users. In order to avoid being collected by crawlers, the server will set up a lot of Turing tests to prevent crawlers from malicious crawling, which is also anti-crawling measures. The crawler development engineer needs to solve these anti-crawler measures when developing crawler. During the development of crawler, a large part of our work is to deal with these anti-crawler measures.
- In the era of big data collection, a huge amount of data is required, often at tens of thousands of levels, or even hundreds of millions of pieces. If the acquisition speed is slow and the time is too long, it will not meet the commercial requirements. Concurrency and distribution are often used to solve speed problems. This is another focus of the crawler development process.
Robots protocol: Websites can tell us which pages can and cannot be captured by search engines through robots protocol, but it is only a moral constraint. |
2. HTTP and HTTPS
2. The architecture used by most business applications:1.c/s = client/server2.b/s = browser server3.m/ S is moblie (mobile terminal) server (server) collectively referred to as client and server!!Copy the code
Web crawler is a program disguised as client and server for data interaction. So how do clients and servers interact with each other? Just as we Chinese communicate in Chinese, speaking Chinese grammar, we can communicate normally. If the client side and the server side are not unified, it will not be chaotic, so in the network transmission has many protocols, HTTP is one of them.
(1) HTTP protocol
Ninety percent of the traffic on the Internet is HTTP (HTTP is an application-layer protocol). Note: Make sure you know what protocol is used before you crawl the data you want! Although 90% are based on HTTP protocol, there are still 10% using other protocols, such as: bullet maku may use websocket protocol! That way, our traditional crawlers can’t get it.)
HTTP stands for Hyper Text Transfer Protocol. It is a Transfer Protocol used to Transfer hypertext from WWW:World Wide Web servers to local browsers. |
HTTP is based on TCP/IP communication protocol to transfer data (HTML files, image files, query results, etc.).Note: TCP/IP has a connection-oriented feature! (Meaning: Ensure data integrity)Let’s take a look at the TCP/IP communication protocol three handshake four wave:
- Three handshakes to establish a connection:
🔑 client says: Hey server girl! I want to connect with you. (hello) 🔒 server said: ok, I listen to you. 🔑 client says: good, let’s start data interaction (shy). (do the shy thing ing, data interaction).
- Four waves to disconnect:
🔑 client said: I have finished the data exchange with you, let’s disconnect! 🔒 server: are you sure to disconnect? (unwilling) 🔒 server end say again: that you disconnect! 🔑 client say: ok, that I disconnect connection!
(2) HTTP request process:
When you search for something in a browser, you type in a URL, and the browser automatically translates it to HTTP.
The basic process of an HTTP request is that a client sends a request to the server, and the server returns a response to the client after receiving the request. So a complete HTTP request consists of a request and a response.
How the browser sends an HTTP request:
1.domain name resolution -->2Initiated TCP3Handshake - >3Send an HTTP request after establishing a TCP connection -->4The server responds to the HTTP request, and the browser gets the HTML code -->5The browser parses the HTML code and requests resources (such as JS, CSS, images, etc.) in the HTML code -->6. The browser renders the page to the user. Network->Name->Request Headers view parsed Connection:keep-aliveStay connected and you won't have to shake hands three times and wave four times!Copy the code
The content of the crawler corresponding to the URL address is different from the content of the crawler. When extracting data, the crawler should be based on the corresponding response of the URL address. |
(3) URL (the content in the browser search box!)
When sending an HTTP request, network resources are located based on the URL.
Uniform Resource Locator (URL). Is the address used to identify a resource. That is, we often say the web site. The following uses the URL as an example to describe the components of a common URL: Protocol + domain name (port 80 by default)+ path + parameters
Note: 1. The default HTTP port number is 80. The HTTPS port number defaults to 443. (Note: the domain name can be used to determine which computer it is; And the port number is to determine which application on that computer! 2. Domain names are usually mapped to IP addresses, and port numbers are not specified by default. When we search, for example, baidu:www.baidu.com/, the HTTPS protocol here…
(4) HTTP request format
The request message that the client (that is, we the user) sends an HTTP request to the server consists of the following parts: the request line, the request header, the blank line, and the request data.
General format: Note: The URL of the requested line in the figure above refers to the path in the (2) URL!
1. Request method:
(1) Classification
According to the HTTP standard, HTTP requests can use multiple request methods.
Five request methods: OPTIONS, PUT, DELETE, TRACE, and CONNECT.
(2) Classification explanation
The common methods are GET and POST.
GET
1. Obtain data from the server. 2. Add request parameters to the URL and display them in the address bar 3. The request string limit of 1024 bytes is more efficient and convenient than POST.
POST
1. Submit data to the server. 2. There is no limit on the size (usually 2M).
(3)
2. Request header
3.HTTP request body (request data)
The request body is usually the data sent using the POST method. The GET method does not have the request body.
The request body is separated by a blank line following the message header.
A link between the past and the next! Now that the request format is OK, we can make the server understand what we are saying; Now we need to understand what the server is saying to us. |
(6) HTTP response format
The HTTP response also consists of four parts: the status line (response line), the message header, the blank line, and the response body. General format:
1.HTTP response status code:
When the client initiates a request to the server, the server will contain an HTTP status code in the response header returned by the client (we can know whether the current crawler code is OK by judging this status code when we conduct crawler combat!). .
The HTTP status code is represented by three digits, and the first digit indicates the type of the status code. Generally speaking, there are five types:
Note: A redirect is equivalent to a mediation redirect. (All HTTP response status code details point I view!)
2.HTTP response headers
(7) Summary:
1.HTTP process summary:
2. Features of HTTP protocol:
HTTP three points to note:
-
HTTP is connectionless: connectionless means to limit processing to one request per connection. The server disconnects from the customer after processing the request and receiving the reply from the customer. In this way, transmission time can be saved.
-
HTTP is media independent: this means that any type of data can be sent over HTTP as long as clients and servers know what to do with the data content.
-
HTTP is stateless: THE HTTP protocol is stateless. Stateless means that the protocol has no memory for transaction processing. The lack of state means that if the previous information is needed for subsequent processing, it must be retransmitted, which can result in an increase in the amount of data transferred per connection. On the other hand, the server responds faster when it doesn’t need the previous information.
Stateless HTTP refers to the fact that HTTP protocol has no memory for transaction processing, that is, the server does not know the state of the client. When we send a request to the server, the server parses the request and returns the corresponding response. The server is responsible for this process, and this process is completely independent, the server does not record the state changes before and after, that is, the lack of state records. This means that if the previous information needs to be processed later, it must be retransmitted, which results in the need to pass some additional repeated previous requests in order to get the subsequent response, which is obviously not the desired effect. In order to maintain the front-to-back state, we certainly can't retransmit all previous requests at once, which would be a waste of resources, especially for pages that require users to log in. This is where two techniques for maintaining HTTP connections emerge: sessions and Cookies. The following will introduce oh!Copy the code
Note: stateless means that, for example, you enter your account and password in a web page to log in to Qzone, but HTTP is stateless, so you need to enter your account and password again to log in to QQ mailbox in Qzone, and the login status will not be remembered. But it can be solved by using session technology. |
3. The HTTPS protocol:
Enhanced VERSION of HTTP, a fighter in the rooster!! HTTPS (full name: Hyper Text Transfer Protocol over Secure Socket Layer (Hypertext Transfer Protocol Secure) is a Secure HTTP channel. Simply speaking is HTTP security version!
HTTP is based on TCP/IP, while HTTPS is based on HTTP and adds SSL/TLS to encrypt data during transmission.
Note: The HTTPS protocol default port is 443.
HTTP is more secure than HTTP because it is a plaintext transmission and HTTPS is a ciphertext transmission, but the performance is lower because the decryption takes time!Copy the code
3. HTTP stateless session technology
HTTP is stateless, so how can a server distinguish between consecutive requests from the same user? This uses session technology: cookie and session.
(1) the Cookie
Cookie is also sometimes used in the plural. Data stored (usually encrypted) on a user’s local terminal by some web sites for identification and session tracking. The latest specification is RFC6265.
A Cookie can be understood as a credential
- 1. The server sends special information to the client.
- 2. The information is stored in the client as a text file.
- 3. The client carries this special information with it every time it sends a request to the server.
- 4. After receiving the Cookie, the server will verify the Cookie information to identify the user’s identity.
Why do we use cookies in crawlers?
- Benefits of cookies: ① Access to the login page. ② Normal browsers will certainly bring cookies when requesting the server (except for the first request), so the other server may determine whether we are a crawler by carrying cookies, which can play a certain anti-crawler role.
- Disadvantages of bringing cookies: (1) A set of cookies usually corresponds to a user’s information, and requests too frequently are more likely to be identified as crawlers by the other party. ② Generally use multiple accounts to solve the problem.
(2) the Session
Session, often translated as Session in Chinese, originally refers to a series of actions/messages that start and end. For example, when making a phone call, the process from picking up the phone to dialing up to hanging up the phone can be called a Session. This word is used in various fields.
In our web& crawler world, we generally use what it means, a browser window between opening and closing.
The purpose of the Session is that all requests made by a client between opening and closing the browser can be identified as the same user. This is done by generating a cookie, SessionID (note: SessionID included in cookie), this ID will be carried on every access, and the server will recognize this SessionID and save the data related to this SessionID on the server. Thus the client state recognition is realized. So sessions are cookie-based!
Session is the opposite of Cookie. Session is the data stored on the server and is only judged by the SessionId sent by the client. Therefore, compared with Cookie, Session is more secure.
Generally, the SessionID is discarded when the browser is closed, or the server verifies the Session activity level. For example, if a SessionID is inactive for 30 minutes, the Session id will be identified as invalid.
The purpose of session – to achieve client and service session persistence! Session (state) preserve: ① save cookie; ② Long connection with the server.
(3) The difference between Cookie and session:
- Cookie data is stored in the client’s browser and session data is stored on the server.
- Cookie is not very secure, others can analyze the cookie stored locally and cookie spoofing;
- Sessions are stored on the server for a certain period of time. When the number of accesses increases, the performance of the server is reduced.
- A single cookie can hold no more than 4K of data, and many browsers limit the number of cookies a site can hold to 20.
(4) Take a picture to understand such boring words:
When a user logs in for the first time:A session table is generated on the server, where keys are hash generated data and values are a list of information. At the same time, a text file cookie is generated locally on the client, which contains the sessionID, and the value of the sessionID is the hash key in the server. When the user logs in again:The system automatically carries the sessionID and its value, which is compared with the hash key in the server to determine whether the user has logged in successfully. If successful, the system obtains the user login data and returns it to the interface requested by the user.
(5) Practice a wave to see the attribute structure of Cookies:
(Take Qzone for example!)
F12 open the browser developer tool, and then follow the steps shown in the figure to create Cookies :(you can see that there are many entries, each of which can be called Cookies.)
The property name | Attribute value explanation |
---|---|
Name | The name of the Cookie. Once created, it cannot be modified! |
Value | The value of the Cookie. If the value is a Unicode character, the character encoding is required. If the value is binary data, BASE64 encoding is required. |
Domain | The domain name of the Cookie can be accessed. |
Max Age | This Cookie Expires when, in seconds, it is usually used with Expires to calculate its expiration time. If Max Age is positive, the Cookie expires after Max Age seconds. If it is negative, the Cookie is invalid when the browser is closed, and the browser does not save the Cookie in any form. |
Path | The usage path of the Cookie. If the Cookie is set to /path/, only pages whose path is /path/ can access the Cookie. If the Cookie is set to /, all pages under the domain name can access the Cookie. |
Size | The size of this Cookie. |
HTTP field | The Httponly attribute of the Cookie. If this property is true, the Cookie is only contained in the HTTP header and cannot be accessed through document.cookie. |
Secure | Whether the Cookie is transmitted only using a secure protocol. Secure protocols such as HTTPS and SSL encrypt data before transmission over the network. The default is false. |
4. Crawler combat: use socket to download a picture
(1) Socket learning
Socket; socket; At the same time, because it has the concept of “socket” and “word”, so it is also called socket.
Knowledge replenishment station :(mix a familiar line!) Socket is an interprocess communication mechanism. It provides an operating system call for programs to access the communication protocol, making it as easy to read and write data on the network as to read and write local files. A Socket is a sequence of "instructions"; You already have the concepts of "sockets" (establishing network or interprocess communication) and "words" (interoperable ordered strings of instructions).Copy the code
① using socket to simply build a server :(Click me for another advanced TCP server setup article)
import socket
# server object
server = socket.socket()
Socket (socket.af_inet, socket.sock_stream) socket.AF_INET: uses IPV4; Socket. SOCK_STREAM: Creates a socket socket. ' ' '
# 1. Bind the server
server.bind(("0.0.0.0".8800)) #0.0.0.0 allows all access; 8800 is the port number
# 2. Listen
server.listen(5)
while True:
# 3. Wait for connection
# accept is a blocking method (I won't move until you come!) , waits for connections, and creates a separate channel for each connection.
# conn: channel parameter; Addr: indicates the channel address.
conn,addr=server.accept()
# 4. Receive data
data=conn.recv(1024)
print(data)
response="HTTP / 1.1 200 OK \ r \ nContent - Type: text/HTML. charset=utf-8; \r\n\r\n
"
# 5. Send data
conn.send(response.encode())
print("Already responded")
# 6. Closed
server.close()
Copy the code
To access the server, enter 127.0.0.1:8800 in the local browser:
② use socket to simply build a client :(climb baidu home page interface)
import socket
By printing this client server object, you can see that IPV4 is used by default and the protocol is TCP.
client=socket.socket()
# 1. Establish a connection
client.connect(("www.baidu.com".80))
Construct request message
data=B "GET/HTTP / 1.1 \ r \ nHost: www.baidu.com\r\n\r\n"
# 2. Send the request
client.send(data)
res=b""
# 3. Receive data
temp=client.recv(4096)
while temp:
print("*"*50)
res += temp
temp = client.recv(4096)
print(temp.decode())
# 4. Disconnect
client.close()
Copy the code
(2) actual combat: use socket to climb a beautiful MM picture:
It is said that sogou is not set up to climb, just pick soft persimmon knead, so we will climb it.
1. First, analyze the web page:
The URL of the image we want to crawl is in the Request URL in the header. CV big method can!
2. Code:
Download a picture of sogou
import socket
import re
# Sogou pictures
img_url="https://i02piccdn.sogoucdn.com/a3ffebbb779e0baf"
"' development: How to use HTTPS request #HTTPS request import SSL client = SSL.wrap_socket (socket.socket()) #ssl.wrap_socket a decorator client.connect(('i02piccdn.sogoucdn.com',443)) '''
client = socket.socket()
Note that we crawled the HTTPS url above, but we can also use HTTP because it is automatically redirected
client.connect(("i02piccdn.sogoucdn.com".80)) The mapping of the IP address can be located to its server
Construct request message
data = "GET/a3ffebbb779e0baf HTTP/1.1\r\nHost:i02piccdn.sogoucdn.com \ r \ n \ r \ n"
# Send data
client.send(data.encode()) The packet must be in the form of bytecode
# Receive data
first_data = client.recv(1024)
print("first_data",first_data)
length = int(re.findall(b"Content-Length: (.*?) \r\n",first_data)[0]) # is in the list, so add 0; The response is also in bytecode form, so plus b
print(length) # Content length
The reason for writing this line is that there may or may not be data after double \r\n
#.* matches all \r\n newline characters except \r\n newline characters.
image_data = re.findall(b"From Inner Cluster \r\n\r\n(.*?) ",first_data,re.S)
if image_data:
image_data = image_data[0]
else:
image_data = b""
Get the corresponding length of data
while True:
temp = client.recv(1024)
image_data += temp
if len(image_data)>=length:
break
# 4. Disconnect
client.close()
Write file
with open("girl.jpg"."wb") as f:
f.write(image_data)
Copy the code
3. Effect:
The console output is: first_data b'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Thu, 08 Jul 2021 17:04:43 GMT\r\nExpires: Fri, 08 Jul 2022 17:04:43 GMT\r\nX-NWS-UUID-VERIFY: 1266ff4f6f6197f273f603ca87522cc9\r\nExpiration-Time: Sun, 26 Dec 2021 13:11:13 GMT\r\nX-Daa-Tunnel: hop_count=3\r\nAccept-Ranges: bytes\r\nX-Cache-Lookup: Cache Miss\r\nLast-Modified: Sun, 27 Jun 2021 01:11:13 GMT\r\nCache-Control: max-age=31536000\r\nContent-Length: 19594\r\nX-NWS-LOG-UUID: 14051802991302897940\r\nConnection: keep-alive\r\nX-Cache-Lookup: Hit From Inner Cluster\r\n\r\n' 19594
5.In the end!
Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts! |
This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!
If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article, please contact me to explain the meaning and mark the source and the name of the blogger, thank you!Copy the code