Disclaimer: The technical means and implementation process recorded in this paper are only used as the learning and use of crawler technology, and we do not assume any responsibility for any thing or consequences caused by any act or omission of any person based on all or part of the contents of this paper.
Crawl demand: from the website [www.shixiseng.com], under any query conditions, crawl the first 5 pages of the job name, job, salary level, location and other information;
Crawl tools: Chrome, PyCharm
Python libraries: Requests, BeautifulSoup
01 Website structure analysis
Enter the URL [www.shixiseng.com] to open the internship website and click the “Search” button to open the query results page
Find the url corresponding to each page through page button link information: www.shixiseng.com/interns?pag…
www.shixiseng.com/interns?pag…
www.shixiseng.com/interns?pag…
The fields to be climbed can only be found when opening a new page by clicking the hyperlink of the job title. Therefore, we need to first climb the URLS corresponding to all the job details. In Chrome, right-click on the job title and click Check to locate the HTML location information that requires the URL.
Analyze HTML files in Chrome tool to find the key positioning information to obtain job details.
To sum up, first open the post list through the current request page, get the post details URL by crawling the post information in the list, and then climb the relevant fields required by the URL of the details.
02 Climb the details page URL
Start creating a Python project and writing code based on the site structure analyzed above:
Intern -wrap. Intern -item
Job details URL resolution location information:.f-L.intern-detail__job a
Run the code and the result is as follows:
Succeeded in climbing the url for job information details!
03 Climb the details page company
Go ahead and write the code, open the url of the details page, and extract the company information from the details page
The HTML location information of the company information is.com_intr.com-name
Continue writing parsing code:
Run code:
Find that there are Spaces and empty lines in the obtained company information, remove the Spaces and empty lines:
Run the code again:
Climbing company info succeeded!
04 Climb the detail page post
On the job details page, the key position information is. New_job_name SPAN
Write code to extract job information
Run the code and the result is as follows:
Climb the post information successfully!
05 Climb the details page salary range
Analyze the detail page, the key extracted information for the salary range is:.job_money.cutom_font
Write code to extract salary range information:
The running results are as follows:
The salary range information is successfully crawled, but the display is garbled. It should be because the website encrypts the data of the salary range information to prevent the key information from being crawled. The simplest way to decode is to re-encode UTF8, respectively, to establish the CORRESPONDING relationship between UTF8 encoding and the number 0 9, after crawling the information according to the corresponding relationship of 0 9 batch replacement, to achieve the display of numbers.
Take a copy of any salary information and see what it looks like in code:
By comparing the symbols “-” and “/” in the middle, the following relationship can be obtained: 300 = \xee\xa3\xb2\xef\xa2\x9e\xef\xa2\x9e 400 = \xef\x8b\x8a\xef\xa2\x9e\xef\xa2\x9e day = \xe5\ xA4 \xa9
0 = \xef\xa2\x9e 3 = \xee\xa3\ XB24 = \xef\x8b\x8a
Go ahead and print different numbers and units on the page to find out all the possible correspondences that will not be demonstrated in this article.
Encoding transforms the mapping inferred above:
The running results are as follows:
Because not all mapping, so there are garbled, salary range information crawl success!
06 Crawl details page work place
Analysis details page, the key information extracted from work place is:.com_position
Write code to extract:
The result of running the code is as follows:
Work place info climb successful!
All information needed to be crawled has been successfully crawled, crawler coding is complete!
All sample codes can be downloaded through wechat official account reply keyword [pachong23]!