Ramble about the
Boos direct hire, must be looking for work comrades are very familiar with, with its’ recruit quick talent matching quasi open and transparent ‘and other advantages in the forefront of the industry, of course, I am not to play advertising, I am to arrange him. Use scrapy framework and Selenium for jobs and salaries today. Benefits, companies and other information to crawl
For reverse climbing, their own demand is not very high natural this method should be the simplest
Analysis of the
Boss Direct Hire website:www.zhipin.com
He’s climbing or hate it, information is generated by cookies rendering, cookies aging is very short, and soon failed, fast access will seal off your IP, sealing the IP can use a proxy, the first reflection using a proxy you will find that you will be prompted IP anomaly, and then into the validation, and finished seems to need access to code platform,, Ah ah ah, good trouble, of course, the home page is not anti crawl, then slow down a bit, although this is not like the correct posture of the crawler but for some anti crawl mechanism, do not understand JS, this seems to be a choice
Open lu
Main capture position, salary, company name, treatment, requirements
1. New crawler
Open the terminal or CMD to enter the command
New project
scrapy startproject boss
Copy the code
Go to the boss directory
cd boss
Copy the code
New crawler
scrapy genspider boss_spider 'zhipin.com'
Copy the code
Open using PyCharm
2. Analyze the layout
Check the url:
The first page: www.zhipin.com/c101120100/… Where &ka=page-1 can be omitted
The second page: www.zhipin.com/c101120100/…
A total of 10 pages
View the dataAll the data we want is stored in the UL list so we can go through the UL
Step 4.
We know the location and we know where the data is and we can start
1. Set middlewares and Settings
Because we are going to use Selenium, we need to intercept its response object and then transform it. Let’s create a new class. If you are not sure about selenium, you can check out this article
If you don’t already understand scrapy middleware, refer to this article, or you’ll be confused by the details of scrapy below
from scrapy import signals
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse
class CookiesMiddlewares(object) :
def __init__(self) :
print("Initialize the browser")
self.driver = webdriver.Chrome()
def process_request(self,request,spider) :
self.driver.get(request.url)
time.sleep(5)
We wait 5 seconds for it to load
source = self.driver.page_source
Get the source code for the page
response = HtmlResponse(url=self.driver.current_url,body=source,request=request,encoding='utf-8')
The # Response object is used to describe an HTTP Response
return response
So we get all the information and return response
Copy the code
Setting Sets the robot protocol to false
ROBOTSTXT_OBEY = False
Copy the code
Enable download delay
DOWNLOAD_DELAY = 3
Copy the code
Open request header
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'User-Agent': 'the Mozilla / 5.0 (Macintosh; U; Intel Mac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
}
Copy the code
Enabling middleware
DOWNLOADER_MIDDLEWARES = {
'boss.middlewares.CookiesMiddlewares': 543,}Copy the code
2.boss_spider.py
import scrapy
from boss.items import BossItem
class BossSpiderSpider(scrapy.Spider) :
name = 'boss_spider'
allowed_domains = ['zhipin.com']
start_urls = ['https://www.zhipin.com/c101120100/?query=python&page=1']
base_url = 'https://www.zhipin.com'
def parse(self, response) :
job = response.xpath ("//div[@class='job-list']/ul/li")
for i in job:
job_name = i.xpath (".//span[@class='job-name']/a/text()").get ()
print(job_name)
money = i.xpath (".//span[@class='red']/text()").get ()
name = i.xpath (".//h3[@class='name']/a/text()").get ()
tags = i.xpath (".//div[@class='tags']/span/text()").getall ()
tags = ' '.join (tags)
info_desc = i.xpath (".//div[@class='info-desc']/text()").get ()
yield BossItem (job_name=job_name, money=money, name=name, tags=tags, info_desc=info_desc)
Get the address of the next page
page = response.xpath("//div[@class='page']/a[last()]/@href").get()
next_url = self.base_url + page
if not next_url:
print("Quit")
return
else:
print ("Next address:",next_url)
yield scrapy.Request (next_url)
Copy the code
There’s nothing to explain here, but if you don’t get it, check out this article for scrapy details
3.item.py
import scrapy
class BossItem(scrapy.Item) :
job_name = scrapy.Field()
money = scrapy.Field()
name = scrapy.Field()
tags = scrapy.Field()
info_desc = scrapy.Field()
Copy the code
4. Run
Name the new project CMD
from scrapy import cmdline
cmdline.execute('scrapy crawl boss_spider -o boss.csv'.split())
# -o boss.csv Use CSV format to save, of course, write in Pipeline the same way, don't forget to open pipline
Copy the code
Effect of 5.
This is where the crawl begins. If you don’t want to display the browser, you can set it to no interface mode.After a short wait, more than 400 job listings were saved
Then it can be analyzed
To see the end of you
Friends, thank you to see the last, novice report, technology is not mature place please give directions, thank you!
Look forward to meeting you again. Hope you have a good time