directory
The crawler
Skills needed to write crawlers in Python
Disadvantages of universal crawler
Urllib crawls web pages
Return the status code response.getCode ()
Decoding encoding
Mock browser
Set the timeout
HTTP request: used for message passing between client and server
The crawler
Web crawler, also known as web spider, web ant, web robot, etc., can automatically browse information in the network. Of course, browsing information needs to be carried out in accordance with the rules formulated by us, and these rules are called web crawler algorithm. Using Python, it is very easy to write crawler programs for automatic retrieval of Internet information
- A crawler is essentially a program (a script)
- It can help us to automatically collect the text information, pictures and other resources we need
- Emulating automatic web browsing in a browser (99%)
I also wrote a blog about crawlers earlier, for reference:
Python crawler crawler web novel blog.csdn.net/qq_36171287…
Python crawlers practice crawling information blog.csdn.net/qq_36171287…
Python crawler experiment views – cool blog.csdn.net/qq_36171287…
Design of crawler:
- First determine the URL of the page to crawl
- To obtain the corresponding HTML page through HTTP protocol
- Extract useful data from HTML pages
Skills needed to write crawlers in Python
- Basic Python syntax
- How to grab a page: HTTP request processing, ullib processing after the request can simulate the browser to send a request, get the server to respond to the file
- Parse the contents of the server response: re, xpath, BeautifulSoup4, JsonPath, PyQuery. The purpose is to extract the data of the matching rule using some descriptive syntax
- How to get dynamic HTML, captcha processing: General dynamic page acquisition, Tesseract(Machine learning library, machine image recognition system, recognition of text in images)
- Scrapy framework: China’s common framework Scrapy, Pyspider — high customization performance (asynchronous network framework Twisted), so the data download speed is very fast, provides data storage, data download, extraction rules and other components
- Distributed strategy: scrapy-Redis – on the basis of the addition of a set of redis database as the core of a set of components, so that the framework of redis to support the distributed function, the main redis to do request fingerprint to remove, request allocation, data storage
Disadvantages of universal crawler
- Only text-related content (HTML, Word, PDF) can be provided, but multimedia (music, pictures, videos) and binary files (programs, scripts) cannot be provided
- It is impossible to provide different search results for different people
- Inability to understand human verbal retrieval
Urllib crawls web pages
Example:
import urllib.request
Make a request to the specified URL and return the data (file object) that the server responds to.
response = urllib.request.urlopen('http://www.baidu.com')
Read the entire contents of the file and assign the value to a string variable
data = response.read()
print(data)
# store the crawl contents in a.txt
with open(r'a.txt'.'wb') as f:
f.write(data)
Copy the code
Running results:
Example:
import urllib.request
# During urlRetrieve execution, some changes are stored
response = urllib.request.urlretrieve('http://www.baidu.com',filename=r'a.html')
# clear cache
urllib.request.urlcleanup()
Copy the code
Running results:
Example:
The value is assigned to the list variable when the line is read
data = response.readlines()
print(data)
Copy the code
Running results:
Returns information about the current environment response.info()
# the response properties
Return information about the current environment
print(response.info())
Copy the code
Running results:
Return the status code response.getCode ()
HTTP response status code:
100 | The customer must continue to make requests |
---|---|
101 | The client requires the server to convert the HTTP protocol version upon request |
200 | A successful deal |
201 | Prompt to know the URL of the new file |
202 | Accept and process, but the process is not complete |
203 | The returned information is inconclusive or incomplete |
204 | The request was received, but the return message is empty |
205 | The server completes the request and the user agent must reset the files that have been currently browsed |
206 | The server has completed some of the user’s GET requests |
300 | The requested resources are available in multiple places |
301 | Delete request data |
302 | Request data was found at another address |
303 | You are advised to access other urls or access methods |
304 | The client has performed GET, but the file has not changed |
305 | The requested resource must be available at the address specified by the server |
306 | The code used in the previous version of HTTP is not used in the current version |
307 | Declare the requested resource temporarily deleted |
400 | Error requests, such as syntax errors |
401 | Request authorization failed |
Example:
response = urllib.request.urlopen('http://www.baidu.com')
# return status code
print(response.getcode())
Copy the code
Running results:
Decoding encoding
Decoding unquote ()
url = 'https://www.so.com/s?ie=utf-8&src=hao_isearch2_3.6.9&q=%E7%A6%BB%E5%A4%A9%E5%A4%A7%E5%9C%A3&eci='
# decoding
newUrl = urllib.request.unquote(url)
print(newUrl)
Copy the code
Running results:
Coding the quote ()
url = From the day the risk = 'https://www.so.com/s?ie=utf-8&src=hao_isearch2_3.6.9&q='
# decoding
newUrl = urllib.request.quote(url)
print(newUrl)
Copy the code
Running results:
Mock browser
import urllib.request
url = 'http://www.baidu.com'
# Mock request headers
headers = {
'User-Agent':'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '
}
Set a request body
req = urllib.request.Request(url,headers=headers)
Make a request
response = urllib.request.urlopen(req)
data = response.read().decode('utf-8')
print(data)
Copy the code
Running results:
Set the timeout
If the page does not respond for a long time, the system determines that the page times out and cannot be crawled
import urllib.request
url = 'http://www.baidu.com'
# Mock request headers
headers = {
'User-Agent':'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '
}
Set a request body
req = urllib.request.Request(url,headers=headers)
Make a request
for i in range(1.100) :try:
response = urllib.request.urlopen(req,timeout=0.5)
print(len(response.read().decode('utf-8')))
except:
print('Request timed out, proceed to next crawl')
Copy the code
Running results:
HTTP request: used for message passing between client and server
www.runoob.com/http/http-m…
GET: Sends information to the URL directly. POST: sends data to the server, which is a popular and secure data transfer method. PUT: requests the server to store a resource. DELETE: requests the server to DELETE a resource. HEAD: requests to obtain the corresponding HTTP header information. OPTIONS: can obtain the request type supported by the current URL
GET: Features: concatenates data to the end of the request path and sends the data to the server. Advantages: High speed. Disadvantages: Carries a small amount of data and is insecureCopy the code
Advantages: Large quantity and secure (Post is recommended when modifying server data) Disadvantages: Low speedCopy the code
0, GET: GET is the most common, essentially sending a request to retrieve a resource on the server. Resources are returned to the client through a set of HTTP headers and render data (such as HTML text, or images or videos). Render data is never included in a GET request.
1. HEAD: A HEAD is essentially the same as a GET, except that a HEAD contains no render data, only HTTP headers. Some people may think that this method is useless, but it is not. Imagine a business situation: To determine whether a resource exists, we usually use GET, but HEAD makes more sense.
2, PUT: This method is relatively rare. HTML forms do not support this either. In essence, PUT and POST are very similar in that they send data to the server, but there is one important difference. PUT usually specifies where resources should be stored. POST does not. For example: a URL for submitting a blog post, /addBlog. With PUT, the submitted URL would look like “/addBlog/abc123”, where abc123 is the address of the blog post. With POST, the address is notified to the client by the server after submission. Most blogs are like that these days. Obviously, PUT and POST are used differently. Which one to use also depends on the current business scenario.
3. DELETE: Deletes a resource. This is rare, but there are places like Amazon’s S3 cloud service that use this method to delete resources.
4. POST: submits data to the server. This method has a wide range of applications and all current submission operations are done by this method.
5, OPTIONS: This method is interesting, but rarely used. It is used to get the methods supported by the current URL. If the request is successful, it will include a header named “Allow” in the HTTP header with the value of the supported method, such as “GET, POST”
Learn together and make progress together. If there are any mistakes, please comment