Hot jokes for summer | Python Theme month

This article is participating in Python Theme Month. See the link for details

Work to prepare

After the climb of TOP10 coldest cities in summer, through the data climb of the previous chapters, I probably know how the process of interface and page climb is. This article will get the next joke set for friends ~

The local runtime environment is also based on docker, which builds rough details. You can check the previous article on portal

The code

Demand analysis

Here we choose Qiushibaike. We can quickly find the location of the elements we want by checking the elements on the page.

First, identify where we want the element to be and find pagination patterns.
Output by repeating a request to crawl elements on the page.

Write the code

Preferred definition function entry, crawl corresponding data items.

def spider() :
	jokes = []
	for page_num in range(1.10):
		jokes.append(spider_page(base_url % page_num))
	for joke in jokes:
		print(joke)
	print('congratulations! Crawl data complete! ')
if __name__ == '__main__':
	spider()
Copy the code

Crawl and collect elements on the corresponding page.

def analysis_data() :
    # 1. The default sort is ascending [sort by lowest temperature]
    ALL_DATA.sort(key=lambda data: data['temp_low'])
    # 2. Get the first 10 data
    top_10 = ALL_DATA[:10]
    return top_10
Copy the code

Output the data as a histogram.

def spider_page(url) :
	response = requests.get(url, headers=HEADERS)
	text_raw = response.text

	Get the data for this page
	# 1. Get author list data
	authors_pre = re.findall(r'
      
       (.*?) '
      
\sclass="article.*?>, text_raw, re.DOTALL)

	# 1.1 Further processing of acquired author information [data contains \n]
	authors = []
	for author_pre in authors_pre:
		author = re.sub(r'\n'.' ', author_pre)
		authors.append(author)

	# 2. Get segment list data
	contents_pre = re.findall(r'
      
       .*? 
       (.*?) '
      
\sclass="content">, text_raw, re.S)

	

 


	contents = []
	for content_pre in contents_pre:
		content = re.sub(r'<.*? >|\n'.' ', content_pre)
		contents.append(content)

	# 3. Assemble two list data into a new list
	jokes = []
	for temp in zip(authors, contents):
		author, content = temp
		jokes.append({
			'author': author,
			'content': content
		})

	# 4. Returns a list of segments retrieved from the current page
	return jokes
Copy the code

Effect of screenshots

The complete code


import re
import requests

# address to be crawled
base_url = 'https://www.qiushibaike.com/text/page/%s/'

HEADERS = {
	'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'.'Referer': 'https://www.qiushibaike.com/'
}


def spider_page(url) :
	response = requests.get(url, headers=HEADERS)
	text_raw = response.text

	Get the data for this page
	# 1. Get author list data
	authors_pre = re.findall(r'
      
       (.*?) '
      
\sclass="article.*?>, text_raw, re.DOTALL)

	# 1.1 Further processing of acquired author information [data contains \n]
	authors = []
	for author_pre in authors_pre:
		author = re.sub(r'\n'.' ', author_pre)
		authors.append(author)

	# 2. Get segment list data
	contents_pre = re.findall(r'
      
       .*? 
       (.*?) '
      
\sclass="content">, text_raw, re.S)

	

 


	contents = []
	for content_pre in contents_pre:
		content = re.sub(r'<.*? >|\n'.' ', content_pre)
		contents.append(content)

	# 3. Assemble two list data into a new list
	jokes = []
	for temp in zip(authors, contents):
		author, content = temp
		jokes.append({
			'author': author,
			'content': content
		})

	# 4. Returns a list of segments retrieved from the current page
	return jokes


def spider() :
	jokes = []

	for page_num in range(1.10):

		jokes.append(spider_page(base_url % page_num))

	for joke in jokes:
		print(joke)

	print('congratulations! Crawl data complete! ')


if __name__ == '__main__':
	spider()
Copy the code

conclusion

Through the above simple little chestnut, in fact, we should be able to master the process of simple crawler, the specific operation process or need to combine with the actual scene.

(Historical article below)

How to Get Video resources faster | Python Theme Month
Faster access to Tencent Job listings | Python Theme month
Summer temperatures coldest TOP10 cities | Python Theme month

Hot jokes for summer | Python Theme month

Work to prepare

The code

Demand analysis

Write the code

Effect of screenshots

The complete code

conclusion

Related Posts

FullGC problem analysis caused by database connection pool

SpringBoot integrates websocket for communication between client and server

The performance comparison between Kafka and RocketMQ revealed