This article is participating in Python Theme Month. See the link for details

Work to prepare

After the climb of TOP10 coldest cities in summer, through the data climb of the previous chapters, I probably know how the process of interface and page climb is. This article will get the next joke set for friends ~

The local runtime environment is also based on docker, which builds rough details. You can check the previous article on portal

The code

Demand analysis

Here we choose Qiushibaike. We can quickly find the location of the elements we want by checking the elements on the page.

  1. First, identify where we want the element to be and find pagination patterns.
  2. Output by repeating a request to crawl elements on the page.

Write the code

  1. Preferred definition function entry, crawl corresponding data items.
def spider() :
	jokes = []
	for page_num in range(1.10):
		jokes.append(spider_page(base_url % page_num))
	for joke in jokes:
		print(joke)
	print('congratulations! Crawl data complete! ')
if __name__ == '__main__':
	spider()
Copy the code
  1. Crawl and collect elements on the corresponding page.
def analysis_data() :
    # 1. The default sort is ascending [sort by lowest temperature]
    ALL_DATA.sort(key=lambda data: data['temp_low'])
    # 2. Get the first 10 data
    top_10 = ALL_DATA[:10]
    return top_10
Copy the code
  1. Output the data as a histogram.
def spider_page(url) :
	response = requests.get(url, headers=HEADERS)
	text_raw = response.text

	Get the data for this page
	# 1. Get author list data
	authors_pre = re.findall(r'
      
(.*?) '
\sclass="article.*?>
, text_raw, re.DOTALL) # 1.1 Further processing of acquired author information [data contains \n] authors = [] for author_pre in authors_pre: author = re.sub(r'\n'.' ', author_pre) authors.append(author) # 2. Get segment list data contents_pre = re.findall(r'
.*? (.*?) '
\sclass="content">
, text_raw, re.S)



contents = [] for content_pre in contents_pre: content = re.sub(r'<.*? >|\n'.' ', content_pre) contents.append(content) # 3. Assemble two list data into a new list jokes = [] for temp in zip(authors, contents): author, content = temp jokes.append({ 'author': author, 'content': content }) # 4. Returns a list of segments retrieved from the current page return jokes Copy the code

Effect of screenshots

The complete code


import re
import requests

# address to be crawled
base_url = 'https://www.qiushibaike.com/text/page/%s/'

HEADERS = {
	'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'.'Referer': 'https://www.qiushibaike.com/'
}


def spider_page(url) :
	response = requests.get(url, headers=HEADERS)
	text_raw = response.text

	Get the data for this page
	# 1. Get author list data
	authors_pre = re.findall(r'
      
(.*?) '
\sclass="article.*?>
, text_raw, re.DOTALL) # 1.1 Further processing of acquired author information [data contains \n] authors = [] for author_pre in authors_pre: author = re.sub(r'\n'.' ', author_pre) authors.append(author) # 2. Get segment list data contents_pre = re.findall(r'
.*? (.*?) '
\sclass="content">
, text_raw, re.S)



contents = [] for content_pre in contents_pre: content = re.sub(r'<.*? >|\n'.' ', content_pre) contents.append(content) # 3. Assemble two list data into a new list jokes = [] for temp in zip(authors, contents): author, content = temp jokes.append({ 'author': author, 'content': content }) # 4. Returns a list of segments retrieved from the current page return jokes def spider() : jokes = [] for page_num in range(1.10): jokes.append(spider_page(base_url % page_num)) for joke in jokes: print(joke) print('congratulations! Crawl data complete! ') if __name__ == '__main__': spider() Copy the code

conclusion

Through the above simple little chestnut, in fact, we should be able to master the process of simple crawler, the specific operation process or need to combine with the actual scene.

(Historical article below)

  1. How to Get Video resources faster | Python Theme Month
  2. Faster access to Tencent Job listings | Python Theme month
  3. Summer temperatures coldest TOP10 cities | Python Theme month