This article is participating in Python Theme Month. See the link for details

Work to prepare

In the previous chapter, we introduced how to get better access to video resources, and also introduced how to get the information we want through static access. Here again through the way of dynamic interface to obtain job information!!

The local runtime environment is also based on docker, which builds rough details. You can check out the previous article on portal

The code

Demand analysis

When opening the page of Tencent recruitment information, we opened the page element and found that we did not find the data information we wanted in the page element. After checking the interface, we found that the information was obtained through the interface, so we obtained the information we wanted through the interface this time.

  1. View the home page to find pages based on keywordsurlPage turning rules.
  2. Look at the information in the paging data interface return value to extract.
  3. View the interface information on the detail page and extract the returned value.

Write the code

  1. It is preferred to define function entry to extract key information from corresponding paging information.
            for page_num in range(1.2) :print('Start climbing data from page {}'.format(page_num + 1))
		# 1. Addresses for each page
		url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1625731961957&countryId=&cityId=&bgIds=&productId=&c ategoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(page_num)

		# 2. Get [details page URL] for all positions [current page]
		detail_urls = get_jo_detail_urls(url)

		# 3. Parse the details page data one by one
		for detail_url in detail_urls:
			position = get_detail_msg(detail_url)
			positions.append(position)

		time.sleep(1)
Copy the code
  1. Parses the return value and assembles the detail page data return, and retrieves the information in the return value.
def get_detail_msg(detail_id) :
    
	position = {}
	detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1625794375072&postId={}&language=zh-cn'.format(detail_id)
	response = requests.get(detail_url, headers=HEADERS)
	json_obj = json.loads(response.text)
    
	# print(' request detail address is :' + detail_URL)
	response = requests.get(detail_url, headers=HEADERS)

	Get the job title
	position['title'] = json_obj['Data'] ['RecruitPostName']

	# [Data] Location/job type
	position['location'] = json_obj['Data'] ['LocationName']
	position['category'] = json_obj['Data'] ['CategoryName']

	# [Data] Job responsibilities
	position['duty'] = json_obj['Data'] ['Responsibility']

	# [Data] Job requirements
	position['ask'] = json_obj['Data'] ['Requirement']

	return position


def get_jo_detail_urls(page_url) :
    a = set(' ')
    response = requests.get(page_url, headers=HEADERS)
    json_obj = json.loads(response.text)
    for item in json_obj['Data'] ['Posts']:
        a.add(item['PostId'])
    print(a)   
    return a
Copy the code

Effect of screenshots

The complete code

import requests
import time
import json


HEADERS = {
	'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'.'Referer': 'https://careers.tencent.com/search.html?keywords=python&lid=0&tid=0&start=1'.'Cookie': 'pgv_pvi=9905274880; _ga = GA1.2.134754307.1606182211; pgv_pvid=3632371128; pgv_info=ssid=s598319774; _gcl_au = 1.1.1062400509.1622338581; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22100019473226%22%2C%22first_id%22%3A%226ab28e9051a5f99e96cec737ad4367 a7%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest _search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_refe rrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217a5f65aa69497-0a4a94eb345f15-34657601-1296000-17a5f65aa6ad9e%22%7D; loading=agree'
}


def get_detail_msg(detail_id) :
    
	position = {}
	detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1625794375072&postId={}&language=zh-cn'.format(detail_id)
	response = requests.get(detail_url, headers=HEADERS)
	json_obj = json.loads(response.text)
    
	# print(' request detail address is :' + detail_URL)
	response = requests.get(detail_url, headers=HEADERS)

	Get the job title
	position['title'] = json_obj['Data'] ['RecruitPostName']

	# [Data] Location/job type
	position['location'] = json_obj['Data'] ['LocationName']
	position['category'] = json_obj['Data'] ['CategoryName']

	# [Data] Job responsibilities
	position['duty'] = json_obj['Data'] ['Responsibility']

	# [Data] Job requirements
	position['ask'] = json_obj['Data'] ['Requirement']

	return position



def get_jo_detail_urls(page_url) :
    a = set(' ')
    response = requests.get(page_url, headers=HEADERS)
    json_obj = json.loads(response.text)
    for item in json_obj['Data'] ['Posts']:
        a.add(item['PostId'])
    return a


def spider() :
	# 0. Job data to be returned
	positions = []

	for page_num in range(1.2) :print('Start climbing data from page {}'.format(page_num + 1))
		# 1. Addresses for each page
		url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1625731961957&countryId=&cityId=&bgIds=&productId=&c ategoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(page_num)

		# 2. Get [details page URL] for all positions [current page]
		detail_urls = get_jo_detail_urls(url)

		# 3. Parse the details page data one by one
		for detail_url in detail_urls:
			position = get_detail_msg(detail_url)
			positions.append(position)

		time.sleep(1)

	print(positions)
	print('Climb complete! ')

if __name__ == '__main__':
	spider()
Copy the code