Python Series — Start burying (1)

    • Avoid crawler violations
    • Crawler classification in usage scenarios
      • General crawler
      • Focused crawler
      • Incremental crawler
    • Reptiles and anti-reptiles
      • The climbing mechanism
      • Anti-crawl strategy
      • Robots.txt agreement
    • Common request and response headers
      • Request header
        • User-Agent
        • Connection
      • Response headers
        • Content-Type
    • Requests the module
      • Climb sogou to formulate the corresponding search results page (simple web collector)
      • Cracking Baidu Translation

I’ve also blogged about crawlers before, for reference: crawler IP proxy pool code notes

Python crawlers crawl web novels

Python crawlers practice crawling for information

Python crawler experiment views – cool

Crawlers (crawls web pages, simulates browsers, sets timeout, HTTP requests)

Avoid crawler violations

  • Optimize your application from time to time to avoid interfering with the normal operation of the site being visited
  • In the use and spread of the crawl to the data, review the crawl to the content, if found to involve user privacy business secrets and other sensitive content need to stop the crawl or spread \

Crawler classification in usage scenarios

General crawler

Grasp an important part of the system, grasp is a whole page data

Focused crawler

Built on the basis of the universal crawler. Fetching is a specific local content of the page

Incremental crawler

Monitor data updates on the site. Will only crawl the latest update out of the content \

Reptiles and anti-reptiles

The climbing mechanism

Portal website can prevent crawler from crawling website data by making corresponding strategies or technical means

Anti-crawl strategy

Crawler can break the anti-crawler mechanism in portal website by making relevant strategies or technical means, so as to obtain portal website information

Robots.txt agreement

Gentlemen’s agreement. It specifies which data in a website can and cannot be crawled by crawlers

You can enter /robots. TXT after the website name to view the robots protocol of the website



\

Common request and response headers

Request header

User-Agent

Request the identity of the carrier, such as Google Chrome

Connection

After the request is complete, whether to disconnect or keep the connection

Response headers

Content-Type

The type of data the server responds to the client

Requests the module

Python is a native network request based module, very powerful, simple and efficient

Function: Simulates a browser to send a request

Use steps:

  • The specified url
  • The initiating
  • Get response data
  • Persistent storage

PIP Install Requests

Example:

import requests

Download a web page

url = 'http://www.linlida.com/0_646/'
# Simulate a browser sending an HTTP request
response = requests.get(url)
response.encoding='gbk'
print(response.encoding)
# print(response.text)
# Persistent storage
with open('./requestsDemo.html'.'w', encoding='utf-8') as fp:
    fp.write(response.text)
Copy the code

Effect:

Climb sogou to formulate the corresponding search results page (simple web collector)

Code:

import requests
import random

# UA camouflage
# user-agent (request carrier): the portal server will detect the carrier identity representation of the request. If it detects that the carrier identity of the request is a certain browser, the current request must be a normal request
user_agent_list=[
            'the Mozilla / 5.0 (compatible; MSIE9.0; WindowsNT6.1; Trident / 5.0) '.'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '.'the Mozilla / 4.0 (compatible; MSIE7.0; WindowsNT6.0) '.'Opera / 9.80 (WindowsNT6.1; U; En) Presto / 2.8.131 Version 11.11 / '.'the Mozilla / 5.0 (WindowsNT6.1; The rv: Gecko / 20100101 firefox 2.0.1) / 4.0.1 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'.'Opera / 8.0 (Windows NT 5.1; U; en)'.'the Mozilla / 5.0 (Windows NT 5.1; U; en; Rv :1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; En) Opera 9.50 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox 34.0 / '.'the Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'.'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'.'the Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)'.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; SE 2. MetaSr 1.0 X) '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        ]

if __name__ == '__main__':
    url = 'https://www.sogou.com/web'
    # Handle urls that carry parameters, encapsulated in dictionaries
    kw = input('Enter search content:')
    param = {
        'query': kw
    }
    header = {
        'User-Agent': random.choice(user_agent_list)
    }
    The request to the specified URL is a URL that carries parameters, and parameters are processed during the request
    response = requests.get(url, param, headers=header)
    page_text = response.text
    fileName = kw + '.html'
    with open(fileName, 'w', encoding='utf-8') as fp:
        fp.write(response.text)
    print(fileName, 'Saved successfully')
Copy the code

Effect:

Cracking Baidu Translation

Analysis: After each input, the page will be partially refreshed. After network analysis, it is found that SUGS transmit each other and send Ajax requests

Code:

import requests
import random
import json

user_agent_list=[
            'the Mozilla / 5.0 (compatible; MSIE9.0; WindowsNT6.1; Trident / 5.0) '.'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '.'the Mozilla / 4.0 (compatible; MSIE7.0; WindowsNT6.0) '.'Opera / 9.80 (WindowsNT6.1; U; En) Presto / 2.8.131 Version 11.11 / '.'the Mozilla / 5.0 (WindowsNT6.1; The rv: Gecko / 20100101 firefox 2.0.1) / 4.0.1 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'.'Opera / 8.0 (Windows NT 5.1; U; en)'.'the Mozilla / 5.0 (Windows NT 5.1; U; en; Rv :1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; En) Opera 9.50 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox 34.0 / '.'the Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'.'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'.'the Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)'.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; SE 2. MetaSr 1.0 X) '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        ]

if __name__ == '__main__':
    post_url = 'https://fanyi.baidu.com/sug'
    # post request parameter processing
    data = {
        'kw': 'cat'
    }
    header = {
        'User-Agent': random.choice(user_agent_list)
    }
    response = requests.post(post_url, data, headers=header)
    The json() method returns an obj object (you can only use the json() method if you confirm that the response data is of json type)
    page_json = response.json()
    print(page_json)
    # Persistent storage
    fp = open('./cat.json'.'w', encoding='utf-8');
    json.dump(page_json, fp=fp, ensure_ascii=False)
    print('Cat. json saved successfully')
Copy the code