Crawler - reverse crawl a: boss direct employ cookies reverse crawl how to cure

Ramble about the

Boos direct hire, must be looking for work comrades are very familiar with, with its’ recruit quick talent matching quasi open and transparent ‘and other advantages in the forefront of the industry, of course, I am not to play advertising, I am to arrange him. Use scrapy framework and Selenium for jobs and salaries today. Benefits, companies and other information to crawl

For reverse climbing, their own demand is not very high natural this method should be the simplest

Analysis of the

Boss Direct Hire website:www.zhipin.com

He’s climbing or hate it, information is generated by cookies rendering, cookies aging is very short, and soon failed, fast access will seal off your IP, sealing the IP can use a proxy, the first reflection using a proxy you will find that you will be prompted IP anomaly, and then into the validation, and finished seems to need access to code platform,, Ah ah ah, good trouble, of course, the home page is not anti crawl, then slow down a bit, although this is not like the correct posture of the crawler but for some anti crawl mechanism, do not understand JS, this seems to be a choice

Open lu

Main capture position, salary, company name, treatment, requirements

1. New crawler

Open the terminal or CMD to enter the command

New project

scrapy startproject boss
Copy the code

Go to the boss directory

cd boss
Copy the code

New crawler

scrapy genspider boss_spider 'zhipin.com'
Copy the code

Open using PyCharm

2. Analyze the layout

Check the url:

The first page: www.zhipin.com/c101120100/… Where &ka=page-1 can be omitted

The second page: www.zhipin.com/c101120100/…

A total of 10 pages

View the dataAll the data we want is stored in the UL list so we can go through the UL

Step 4.

We know the location and we know where the data is and we can start

1. Set middlewares and Settings

Because we are going to use Selenium, we need to intercept its response object and then transform it. Let’s create a new class. If you are not sure about selenium, you can check out this article

If you don’t already understand scrapy middleware, refer to this article, or you’ll be confused by the details of scrapy below

from scrapy import signals
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse


class CookiesMiddlewares(object) :

    def __init__(self) :
        print("Initialize the browser")
        self.driver = webdriver.Chrome()

    def process_request(self,request,spider) :
        self.driver.get(request.url)
        time.sleep(5)
        We wait 5 seconds for it to load
        source = self.driver.page_source
        Get the source code for the page
        response = HtmlResponse(url=self.driver.current_url,body=source,request=request,encoding='utf-8')
        The # Response object is used to describe an HTTP Response
        return response
        So we get all the information and return response
Copy the code

Setting Sets the robot protocol to false

ROBOTSTXT_OBEY = False
Copy the code

Enable download delay

DOWNLOAD_DELAY = 3
Copy the code

Open request header

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'User-Agent': 'the Mozilla / 5.0 (Macintosh; U; Intel Mac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
}
Copy the code

Enabling middleware

DOWNLOADER_MIDDLEWARES = {
   'boss.middlewares.CookiesMiddlewares': 543,}Copy the code

2.boss_spider.py


import scrapy
from boss.items import BossItem

class BossSpiderSpider(scrapy.Spider) :
    name = 'boss_spider'
    allowed_domains = ['zhipin.com']
    start_urls = ['https://www.zhipin.com/c101120100/?query=python&page=1']
    base_url = 'https://www.zhipin.com'
    def parse(self, response) :
        job = response.xpath ("//div[@class='job-list']/ul/li")
        for i in job:
            job_name = i.xpath (".//span[@class='job-name']/a/text()").get ()
            print(job_name)
            money = i.xpath (".//span[@class='red']/text()").get ()
            name = i.xpath (".//h3[@class='name']/a/text()").get ()
            tags = i.xpath (".//div[@class='tags']/span/text()").getall ()
            tags = ' '.join (tags)
            info_desc = i.xpath (".//div[@class='info-desc']/text()").get ()

            yield BossItem (job_name=job_name, money=money, name=name, tags=tags, info_desc=info_desc)
Get the address of the next page
        page = response.xpath("//div[@class='page']/a[last()]/@href").get()
        next_url = self.base_url + page
        if not next_url:
            print("Quit")
            return
        else:
            print ("Next address:",next_url)
            yield scrapy.Request (next_url)
Copy the code

There’s nothing to explain here, but if you don’t get it, check out this article for scrapy details

3.item.py

import scrapy


class BossItem(scrapy.Item) :
    job_name = scrapy.Field()
    money = scrapy.Field()
    name = scrapy.Field()
    tags = scrapy.Field()
    info_desc = scrapy.Field()
Copy the code

4. Run

Name the new project CMD

from scrapy import cmdline

cmdline.execute('scrapy crawl boss_spider -o boss.csv'.split())
# -o boss.csv Use CSV format to save, of course, write in Pipeline the same way, don't forget to open pipline
Copy the code

Effect of 5.

This is where the crawl begins. If you don’t want to display the browser, you can set it to no interface mode.After a short wait, more than 400 job listings were saved

Then it can be analyzed

To see the end of you

Friends, thank you to see the last, novice report, technology is not mature place please give directions, thank you!

Look forward to meeting you again. Hope you have a good time

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Crawler – reverse crawl a: boss direct employ cookies reverse crawl how to cure

Ramble about the

Analysis of the

Open lu

1. New crawler

2. Analyze the layout

Step 4.

1. Set middlewares and Settings

2.boss_spider.py

3.item.py

4. Run

Effect of 5.

To see the end of you

Crawler – reverse crawl a: boss direct employ cookies reverse crawl how to cure

Ramble about the

Analysis of the

Open lu

1. New crawler

2. Analyze the layout

Step 4.

1. Set middlewares and Settings

2.boss_spider.py

3.item.py

4. Run

Effect of 5.

To see the end of you

Related Posts

Heapsort Python method class (the idea also applies to C arrays) with a detailed video

Why You Shouldn’t Obsess over Rust’s “features”

How do I set HTTPS sharing in GO