Preparation of reptile

01 crawler is to simulate the browser to grab things, crawler trilogy: data crawling, data parsing, data storage

Data crawling: Mobile phone and PC Data parsing: regular expression Data storage: Stored in a file or database

02. Related Python libraries

Crawlers require two library modules: Requests and RE

1. The requests

Requests is an easy-to-use HTTP library, which is much simpler than Urllib, but requires installation because it is a third-party library.

HTTP features supported by the Requests library:

Keep active and connection pools, Cookie persistence sessions, segmented file uploads, chunked requests, and more

The Requests library has a number of methods, all of which are called at the bottom through request() methods, so strictly speaking the Requests library only has request() methods, but request() methods are not normally used directly. Here are the seven main approaches to the Requests library:

(1) requests. The request ()

Construct a request that supports the request method

Requests. Request (method,url,**kwargs)

Method: Request method, such as GET, POST, and PUT

Url: Link to the URL of the page to be retrieved

**kwargs: Controls access parameters

(2) requests. The get ()

HTM the main method to GET a web page, corresponding to HTTP GET. Construct a Requests object that Requests resources from the server and returns a Response object containing the server resources.

Properties of the Response object:

Property Description r.tatus_code Return status of the HTTP request (200 is returned after a successful connection; Connection failure returns 404) r.texthttp String form of response content, that is: Page content corresponding to the URL R.Encoding Response content encoding method guessed from the HTTP header R.aparent_encoding Response content encoding method analyzed from the content (alternative encoding method) R.ContenthTTP Binary response content

Res =requests. Get (url)

Code =res.text (text is text; Bin is binary. Json for JSON parsing)

(3) requests. The head ()

Gets the HTML page header, which corresponds to the HTTP HEAD

Res =requests. Head (url)

(4) requests. Post ()

Method of submitting a POST request to a web page, corresponding to an HTTP POST

Res =requests. Post (URL)

(5) requests. The put ()

Submit a PUT request method to a web page, corresponding to HTTP PUT

6 requests. Patch ()

Submit a request for local modification to the web page, corresponding to the HTTP PATCH

All landowners requests. The delete ()

Submit a DELETE request to a web page, corresponding to HTTP DELETE

“”” Requests operation Exercise “”

import requests

import re

# Data crawl

h = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36’

}

The response = requests. Get (‘ movie.douban.com/chart, head…)

html_str = response.text

# data parsing

The pattern = re.com running (‘ < a class = “NBG”. *? The title = “(. *?)” > ‘) #. *? Any match as few characters as much as possible

result = re.findall(pattern,html_str)

print(result)

2. Re :(Regular Expression)

A special string of letters and symbols used to find sentences in the format you want

About. *? Explanation:

* Matches the preceding subexpression zero or more times. For example, ZO matches “Z” and “zoo”. Equivalent to {0,}.

? Matching patterns are non-greedy. Non-greedy mode matches the searched string as little as possible. For example, for the string “oooo”, “o+?” A single “O” will match, and “O +” will match all “O” s.

Matches any single character other than “\n”. To match any character including “\n”, use words like “(“.

.* is greedy, first matching until no match can be made, and then backtracking according to the following regular expression.

. *? On the contrary, after a match, it goes down, so there is no backtracking, and it has the nature of minimum match (match as few characters as possible but match all characters).

(.*) is greedy matching represents as many matching characters as possible so it matches all the characters between h and L

03. Xpath parses source code

import requests

import re

from bs4 import BeautifulSoup

from lxml import etree

# data crawl (some HTTP header information)

h = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36’

}

The response = requests. Get (‘ movie.XX.com/chart, head…)

html_str = response.text

# data parsing

# regular expression parsing

def re_parse(html_str):

pattern = re.compile(‘<a class=”nbg”.*?title=”(.*?)”‘)

results = re.findall(pattern,html_str)

print(results)

return results

# bs4 parsing

def bs4_parse(html_str):

soup = BeautifulSoup(html_str,’lxml’)

items = soup.find_all(class_=’nbg’)

for item in items:

print(item.attrs[‘title’])

# LXML parse

def lxml_parse(html_str):

html = etree.HTML(html_str)

results = html.xpath(‘//a[@class=”nbg”]/@title’)

print(results)

return results

re_parse(html_str)

bs4_parse(html_str)

lxml_parse(html_str)

04. Python architecture for writing crawlers

As can be seen from the figure, the whole basic crawler architecture can be divided into five categories: crawler scheduler, URL manager, HTML downloader, HTML parser and data storage.

Here are the functions of these five categories:

**① crawler scheduler: ** is mainly used to coordinate with the call of the other four modules, the so-called scheduling is to call other templates.

**② URL manager: ** is responsible for the management of URL links, URL links are divided into crawling and not crawling, which requires the URL manager to manage them, and it also provides an interface for obtaining new URL links.

**③ HTML downloader: ** is the HTML of the page to be climbed down.

** (4) HTML parser: ** is the data to be crawled from the HTML source code, and also sends the new URL link to the URL manager and the processed data to the data store.

**⑤ Data storage: ** is to store the data sent from the HTML download to the local.

Whois crawl

Each year, millions of individuals, businesses, organizations and government agencies register domain names. Each registrant must provide identifying information and contact information, including name, address, email, contact number, administrative contact, and technical contact. This type of information is commonly known as WHOIS data

“” “

whois

whois.chinaz.com/sina.com

“” “

import requests

import re

h = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36’

}

Response = requests. Get (‘whois.chinaz.com/’+input(“

print(response.status_code)

html = response.text

#print(html)

# parse data

pattern = re.compile(‘class=”MoreInfo”.*?>(.*?)

‘,re.S)

result = re.findall(pattern,html)

# Method 1:

# str = re.sub(‘\n’,’,’,result[0])

# print(str)

# Method 2:

print(result[0].replace(‘/n’,’,’))

Crawl movie information

“”” Crawl top 100 movie info “””

import requests

import re

import time

# count = [0,10,20,30,40,50,60,70,80,90]

h = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36’

}

Responce = requests. Get (‘ XX.com/board/4?off… ‘, headers=h)

responce.encoding = ‘utf-8’

html = responce.text

Time.sleep (2)

Patter = re.com running (‘ class = “name” >. *? “title =” (. *?) “. *? Starring: (. *?)

. *? Release date :(.*?)

‘, re.S)

#time.sleep(2)

result = re.findall(patter, html)

print(result)

with open(‘maoyan.txt’, ‘a’, encoding=’utf-8′) as f:

For item in result: # fetch the contents of result (stored as tuples) =”

for i in item:

f.write(i.strip().replace(‘\n’, ‘,’))

#print(‘\n’)

Climb take pictures

“” “* elf crawl exercise 616 pic.com/png/ XX.616pic.com/ys\_img/00/ = = “…” “”

import requests

import re

import time

# data crawl img url

def get_urls():

response = requests.get(‘XX.com/png/’)

html_str = response.text

# parse the data to get the URL

pattern = re.compile(‘<img class=”lazy” data-original=”(.*?)”‘)

results = re.findall(pattern,html_str)

print(results)

return results

#

# Download image

def down_load_img(urls):

for url in urls:

response = requests.get(url)

with open(‘temp/’+url.split(‘/’)[-1], ‘wb’) as f:

f.write(response.content)

Print (url.split(‘/’)[-1],’ downloaded successfully ‘)

if __name__ == ‘__main__’:

urls = get_urls()

Climb the fairy

“” Headline beauty crawl ==== method 1″ “import requests

import re

Url = ‘www.XX.com/api/search/… ‘

response = requests.get(url)

print(response.status_code)

html_str = response.text

# parse “large_image_url” : “(. *?) “

pattern = re.compile(‘”large_image_url”:”(.*?)”‘)

urls = re.findall(pattern,html_str)

print(urls)def down_load(urls):

for url in urls:

response = requests.get(url)

with open(‘pic/’+url.split(‘/’)[-1],’wb’) as f:

f.write(response.content)

Print (url.split(‘/’)[-1],’ downloaded successfully ‘)

if __name__ == ‘__main__’:

down_load(urls)

“” Headline beauty climb ==== method two” “import requests

import re

from urllib.parse import urlencode

# www.XX.com/api/search/… get_urls(page):

keys = {

‘aid’:’24’,

‘app_name’:’web_search’,

‘offset’:20*page,

‘keyword’ : ‘beauty’,

‘count’:’20’

}

keys_word = urlencode(keys)

Url = ‘www.XX.com/api/search/…

response = requests.get(url)

print(response.status_code)

html_str = response.text

# parse “large_image_url” : “(. *?) “

pattern = re.compile(‘”large_image_url”:”(.*?)”‘,re.S)

urls = re.findall(pattern, html_str)

Return urls# download image

def download_imags(urls):

for url in urls:

response = requests.get(url)

with open(‘pic/’+url.split(‘/’)[-1]+’.jpg’,’wb’) as f:

f.write(response.content)

Print (url. The split (‘/’) [1] + ‘JPG’, “downloaded ~ ~”) if __name__ = = “__main__ ‘:

for page in range(3):

urls = get_urls(page)

print(urls)

download_imags(urls)

5 the thread pool

Thread pooling is a form of multithreaded processing in which tasks are added to a queue and then automatically started after a thread is created. Thread pool threads are background threads. Each thread uses the default stack size, runs at the default priority, and is in a multithreaded cell.

“”” thread pool “””from concurrent.futures import ThreadPoolExecutor

import time

import threadingdef ban_zhuang(i):

Print (threading. Current_thread (). The name, “began to move brick {} * * * *”. The format (I))

time.sleep(2)

Print (” * * {…}) employees move brick complete * * a total move brick: {} “. The format * * 2) (I, 12) # to the contents of the output in the format {} if __name__ = = “__main__ ‘: # main thread

start_time = time.time()

Print (threading.current_thread().name,” threading.current_thread() “)

with ThreadPoolExecutor(max_workers=5) as pool:

for i in range(10):

p = pool.submit(ban_zhuang,i)

end_time =time.time()

Format (end_time-start_time) print(” total number of bricks {} seconds “. Format (end_time-start_time)

Crawlers with multiple threads:

“Headline beauties crawl” “import requests

import re

from urllib.parse import urlencode

import timeimport threading

# www.XX.com/api/search/… get_urls(page):

keys = {

‘aid’:’24’,

‘app_name’:’web_search’,

‘offset’:20*page,

‘keyword’ : ‘beauty’,

‘count’:’20’

}

keys_word = urlencode(keys)

Url = ‘www.XX.com/api/search/…

response = requests.get(url)

print(response.status_code)

html_str = response.text

# parse “large_image_url” : “(. *?) “

pattern = re.compile(‘”large_image_url”:”(.*?)”‘,re.S)

urls = re.findall(pattern, html_str)

Return urls# download image

def download_imags(urls):

for url in urls:

try:

response = requests.get(url)

with open(‘pic/’+url.split(‘/’)[-1]+’.jpg’,’wb’) as f:

f.write(response.content)

Print (url.split(‘/’)[-1]+’.jpg’,” download ~~”)

except Exception as err:

print(‘An exception happened: ‘)

if __name__ == ‘__main__’:

start = time.time()

thread = []

for page in range(3):

urls = get_urls(page)

#print(urls)

# multithreaded

for url in urls:

th = threading.Thread(target=download_imags,args=(url,))

#download_imags(urls)

thread.append(th)

for t in thread:

t.start()

for t in thread:

t.join()end = time.time()

Print (‘ time: ‘,end-start)

Tips — Crawler protocol

Robots Protocol, also known as crawler Protocol and robot Protocol, is fully known as The Web Crawler Exclusion Protocol, which is used to tell crawlers and search engines which pages can and cannot be captured, usually in a robots.txt text file. Usually put in the root directory of the website.

Robots protocol: in the root directory of the web page +/robots.txt such as www.baidu.com/robots.txt

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
Copy the code

Follow the crawler protocol. Nah. Can only be used to crawl to play ha ~~~ remember to hang agent ~~~ (the links in the article I have changed, want to practice private chat me, or find their own links… It’s fun.

7 Related Links

Requests the installation and use www.jianshu.com/p/140012f88…

Re the use of www.cnblogs.com/vmask/p/636…

Other crawler related articles blog.csdn.net/qq\_2729739…

Crawler’s video www.imooc.com/learn/563