Python3 Scrapy Entry-level crawler crawls tens of thousands of dragnet jobs

For testing technical feasibility only, do not crawl in large quantities

First of all bypipThe installationscrapyAfter the successful installation, start today’s tutorial execution:scrapy startproject FirstThe project is successfully created when the project file is generated as shown in the figureAfter successful creation, the directory structure shown in the figure is generatedMy understanding is:

User-written crawler py files should be placed under spiders,
Item is used to store the data retrieved by crawling,
Middlewares is Spider middleware
Pipelines are pipe
Scrapy. CFG is the configuration file for the project

In fact, do not understand is not important, use their own time to understand

The spiders need to be placed under the spiders folder and we’ll call this crawler file second. Py created and import the module

import scrapy
Copy the code

Name this crawler Sean. Our crawler target selected the dragnet, so put the dragnet url in start_urls

class Lagou(scrapy.Spider):
    name = "sean"
    start_urls = [
        "https://www.lagou.com/"
    ]
Copy the code

After the information is retrieved, parse it using the parse method

def parse(self , response):
Copy the code

And then we need to analyze the structure of the dragnet to get the data out of it and in this case I’m using ChromeF12Go to developer toolsClick on the tool indicated by the arrow in the upper left corner and click on the leftQuickly find this element and see the HTML tag structure as follows,

And then we need to select the content by **selector **

Using the xpath selector, find the div with class=”menu_box” to get its next div tag, its next DL tag, its next A tag, and walk through all the content that matches the criteria to select the a tag using the following statement as the selector

for item in response.xpath(‘//div[@class=”menu_box”]/div/dl/dd/a’):

for item in response.xpath('//div[@class="menu_box"]/div/dl/dd/a'):
            jobClass = item.xpath('text()').extract()
            jobUrl = item.xpath("@href").extract_first()

            oneItem = FirstItem()
            oneItem["jobClass"] = jobClass
            oneItem["jobUrl"] = jobUrl
            


Copy the code

Just get the tag’s text() and the link **@href** to get the job type and the link to that job type

jobClass = item.xpath('text()').extract()
jobUrl = item.xpath("@href").extract_first()
Copy the code

When processing the resulting data, you need to enter the items.py file

import scrapy


class FirstItem(scrapy.Item):
jobClass = scrapy.Field()
jobUrl = scrapy.Field()
Copy the code

Define jobClass and jobUrl to receive the retrieved data

Go back to second.py and import FirstItem in the header of the file

from First.items import FirstItem
Copy the code

A FirstItem is then instantiated in the parse method and the obtained values are put into it

oneItem = FirstItem()
oneItem["jobClass"] = jobClass
oneItem["jobUrl"] = jobUrl
yield oneItem
Copy the code

Then use yield to output oneItem

The next step is to execute the crawler’s scrapy in a different way than when executing the PY file

**scrapy crawl sean**
Copy the code

Sean is the name enter of the crawler we defined above, as shown in the picture, which outputs the work type and the corresponding URLWe can still use

**scrapy crawl sean -o shujv.json**
Copy the code

Store the obtained data in shujv.jsonAfter executing the command, an extra directory is found in the current directoryshujv.jsonFile, double-click to open, view the contentStorage is stored in, the English display is still English, but The Chinese unexpectedly displayed as garbled code, in fact, this is not garbled code, the encoded data. We need to parse the data

The data is encoded in scrapy.exporters’ JsonLinesItemExporter class by reading the source code. So we can create a XXX folder under the spiders folder, create a **__init__.py** file in the XXX folder, and write a class that inherits JsonLinesItemExporter with no encoding.

The **__init__.py** file contains the following contents

from scrapy.exporters import JsonLinesItemExporter

class chongxie(JsonLinesItemExporter):
    def __init__(self , file , **kwargs):
        super(chongxie , self).__init__(file , ensure_ascii = None)

Copy the code

After rewriting, you need to set it in settings.py by adding the following statement

FEED_EXPORTERS_BASE = {
    'json' : 'First.xxx.chongxie' , 
    'jsonlines' : 'scrapy.contrib.exporter.JsonLinesItemExporter',

}   
Copy the code

Run the crawler again, and the result is as followsCoding problem solved

The overall code is as follows

second.py

import scrapy
from First.items import FirstItem

class Lagou(scrapy.Spider):
    name = "sean"
    start_urls = [
        "https://www.lagou.com/"
    ]

    def parse(self , response):
        for item in response.xpath('//div[@class="menu_box"]/div/dl/dd/a'):
            jobClass = item.xpath('text()').extract()
            jobUrl = item.xpath("@href").extract_first()

            oneItem = FirstItem()
            oneItem["jobClass"] = jobClass
            oneItem["jobUrl"] = jobUrl

            yield oneItem
Copy the code

xxx/__init__.py

from scrapy.exporters import JsonLinesItemExporter

class chongxie(JsonLinesItemExporter):
    def __init__(self , file , **kwargs):
        super(chongxie , self).__init__(file , ensure_ascii = None)

Copy the code

items.py

import scrapy
class FirstItem(scrapy.Item):
    jobClass = scrapy.Field()
    jobUrl = scrapy.Field()
Copy the code

Late at night, write so much first, continue tomorrow:)

La la la la, again late night code word QAQ

Yesterday, we got preliminary access to jobClass and jobUrl and got important information for our further climb,jobUrl

When we open the dragnet with the browser, we can enter the sub-directory by clicking the arrowYou’re essentially clicking on an A tag, accessing its HREF url

That’s the address that we’re going to take out of jobUrl, and that’s the address that we’re going to use to go to the next screen that we’re going to go towww.lagou.com/zhaopin/Jav…Is the Java interface, ready to get the information indicated by the arrow below

First we need to access jobUrl with the following code,

yield scrapy.Request(url = jobUrl , callback=self.parse_url)
Copy the code

Callback is a callback function, we need to implement this method below, but there is one thing we need to do in advance, that is to overcome the anti-crawler mechanism of the pull, we set the cookie to achieve this function, next we teach you to get cookies.

In the Java interface F12, go to Debug mode, select Network, F5, go to Java/on the left, and click Headers on the rightCookie, copy it and paste it into the compilerBut in scrapy, pasted cookies can’t be used directly. They need to be made into a dict.

cookie = { 'user_trace_token':'20170823200708-9624d434-87fb-11e7-8e7c-5254005c3644', 'LGUID':'20170823200708-9624dbfd-87fb-11e7-8e7c-5254005c3644 ', 'index_location_city':'%E5%85%A8%E5%9B%BD', 'JSESSIONID':'ABAAABAAAIAACBIB27A20589F52DDD944E69CC53E778FA9', 'TG-TRACK-CODE':'index_code', 'X_HTTP_TOKEN':'5c26ebb801b5138a9e3541efa53d578f', 'SEARCH_ID':'739dffd93b144c799698d2940c53b6c1', '_gat':'1', 'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6', '1511162236151162, 245151162, 248151166, 955, 'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6', '1511166955', '_gid' : 'GA1.2.697960479.1511162230', '_ga' : 'GA1.2.845768630.1503490030', 'LGSID' : '20171120163554 - d2b13687 - e7 CDCD - 11-996 - a - 5254005 c3644', 'PRE_UTM' : ', 'PRE_HOST':'www.baidu.com', 'PRE_SITE':'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D7awz0WxWjKxQwJ9xplXysE6LwOiAde1dreMKkGLhWzS%26wd%3D%26eqid%3D806a 75ed0001a451000000035a128181', 'PRE_LAND':'https%3A%2F%2Fwww.lagou.com%2F', 'LGRID':'20171120163554-d2b13811-cdcd-11e7-996a-5254005c3644' }Copy the code

So this is the cookie that I formatted, and you can compare it to the original cookie that we had and then our request needs to be modified to use the cookie that we got

 yield scrapy.Request(url = jobUrl2  ,cookies=self.cookie , callback=self.parse_url)
Copy the code

We define a function called parse_URL below to perform the callback as follows:

import scrapy from First.items import FirstItem class Lagou(scrapy.Spider): name = "sean" start_urls = [ "https://www.lagou.com/" ] cookie = { 'user_trace_token':'20170823200708-9624d434-87fb-11e7-8e7c-5254005c3644', 'LGUID':'20170823200708-9624dbfd-87fb-11e7-8e7c-5254005c3644 ', 'index_location_city':'%E5%85%A8%E5%9B%BD', 'JSESSIONID':'ABAAABAAAIAACBIB27A20589F52DDD944E69CC53E778FA9', 'TG-TRACK-CODE':'index_code', 'X_HTTP_TOKEN':'5c26ebb801b5138a9e3541efa53d578f', 'SEARCH_ID':'739dffd93b144c799698d2940c53b6c1', '_gat':'1', 'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6', '1511162236151162, 245151162, 248151166, 955, 'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6', '1511166955', '_gid' : 'GA1.2.697960479.1511162230', '_ga' : 'GA1.2.845768630.1503490030', 'LGSID' : '20171120163554 - d2b13687 - e7 CDCD - 11-996 - a - 5254005 c3644', 'PRE_UTM' : ', 'PRE_HOST':'www.baidu.com', 'PRE_SITE':'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D7awz0WxWjKxQwJ9xplXysE6LwOiAde1dreMKkGLhWzS%26wd%3D%26eqid%3D806a 75ed0001a451000000035a128181', 'PRE_LAND':'https%3A%2F%2Fwww.lagou.com%2F', 'LGRID':'20171120163554-d2b13811-cdcd-11e7-996a-5254005c3644' } # user_trace_token=20170823200708-9624d434-87fb-11e7-8e7c-5254005c3644; LGUID=20170823200708-9624dbfd-87fb-11e7-8e7c-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAACDBAAIA84B4A11C71835DB2C417DD6D1823D3D9; PRE_UTM=m_cf_cpt_baidu_pc; PRE_HOST=bzclk.baidu.com; PRE_SITE=http%3A%2F%2Fbzclk.baidu.com%2Fadrc.php%3Ft%3D06KL00c00f7Ghk60yUKm0FNkUs0Bwkwp00000PW4pNb000005ulRUZ.THL0oUhY1x 60UWdBmy-bIy9EUyNxTAT0T1dBuA7-njcvnW0snjDzn1DY0ZRqf16zPjnkrHbvfWPDrj6zP1ckPYRvPDc1f1IAwj-Anjc0mHdL5iuVmv-b5Hnsn1nznjR1nj fhTZFEuA-b5HDv0ARqpZwYTZnlQzqLILT8UA7MULR8mvqVQ1qdIAdxTvqdThP-5ydxmvuxmLKYgvF9pywdgLKW0APzm1Y1PjbkPs%26tpl%3Dtpl_10085_1 5730_11224%26l%3D1500117464%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E9%2525A2%252598%2526lin kText%253D%2525E3%252580%252590%2525E6%25258B%252589%2525E5%25258B%2525BE%2525E7%2525BD%252591%2525E3%252580%252591%2525 E5%2525AE%252598%2525E7%2525BD%252591-%2525E4%2525B8%252593%2525E6%2525B3%2525A8%2525E4%2525BA%252592%2525E8%252581%2525 94%2525E7%2525BD%252591%2525E8%252581%25258C%2525E4%2525B8%25259A%2525E6%25259C%2525BA%2526xp%253Did%28%252522m6c247d9c% 252522%29%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%2 5255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D220%26wd%3D%25E6%258B%2589%25E5%258B%25BE%25E7%2 5BD%2591%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D2975%26sug%3D%2525E6%25258B%252589 %2525E5%25258B%2525BE%2525E7%2525BD%252591; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F%3Futm_source%3Dm_cf_cpt_baidu_pc; TG-TRACK-CODE=index_navigation; SEARCH_ID=baf14683532a443db458152ff17699e0; _gid = GA1.2.697960479.1511162230; _ga = GA1.2.845768630.1503490030; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6 = 1511251272151258, 443151258, 449151278, 379; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1511278923; LGSID=20171121233256-3f41ebb3-ced1-11e7-9983-5254005c3644; LGRID=20171121234201-8432ac74-ced2-11e7-9983-5254005c3644 def parse(self , response): for item in response.xpath('//div[@class="menu_box"]/div/dl/dd/a'): jobClass = item.xpath('text()').extract() jobUrl = item.xpath("@href").extract_first() oneItem = FirstItem() oneItem["jobClass"] = jobClass oneItem["jobUrl"] = jobUrl yield scrapy.Request(url = jobUrl ,cookies=self.cookie , Callback =self.parse_url) def parse_url(self, response): print("parse_url method ")Copy the code

Running the crawler again, it is found that the status code returned by the server is302 (Redirection) This redirects us to the pull-in interface, where we can’t crawl to the data, so we should try to solve this problem, which is another anti-crawler mechanism for the pull-in. Because we didn’t use the browser proxy to make the request, the dragnet could redirect our crawler in this way so that we couldn’t get the data we needed to getUser-AgentIn the corner of the page where we just got the cookie, we found thisUser-Agent So how do we use it? In the firstsettings.pySet the following parameters in the file

MY_USER_AGENT = ["Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Mozilla/5.0 (X11; U; Linux x86_64;  zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", ] DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None, 'First.middlewares.MyUserAgentMiddleware': 400, }Copy the code

Here you can set multiple user-agents, so I’m going to use two for example, and then we’re going to pick one at random and use it

Here’s how to set up DOWNLOADER_MIDDLEWARES,

Then we go to middlewares.py and do the user-agent selection and import the module we need first

import scrapy
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

Copy the code

Define a class below that inherits UserAgentMiddleware and sets user-Agent

Class MyUserAgentMiddleware(UserAgentMiddleware): # def __init__(self, user_agent): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): return cls( user_agent=crawler.settings.get('MY_USER_AGENT') ) def process_request(self, request, spider): agent = random.choice(self.user_agent) request.headers['User-Agent'] = agentCopy the code

When the crawler starts again, the server returns the status code200, request successful, callback successfulNext, we need to analyze the structure of the website to get the data we need.Again, use the xpath selector to select the data you want,for sel2 in response.xpath('//ul[@class="item_con_list"]/li'): jobName = sel2.xpath('div/div/div/a/h3/text()').extract() jobMoney = sel2.xpath('div/div/div/div/span/text()').extract() jobNeed = sel2.xpath('div/div/div/div/text()').extract() jobCompany = sel2.xpath('div/div/div/a/text()').extract() jobType = sel2.xpath('div/div/div/text()').extract() jobSpesk = sel2.xpath('div[@class="list_item_bot"]/div/text()').extract()You can try this yourself, not only this one selection method, I will also optimize my selection method, now I just want to get the data

Declare it in the kitems.py file

	jobName = scrapy.Field()
    jobMoney = scrapy.Field()
    jobNeed = scrapy.Field()
    jobCompany = scrapy.Field()
    jobType = scrapy.Field()
    jobSpesk = scrapy.Field()
Copy the code

Store it in second.py and yield the Item

			Item = FirstItem()
            Item["jobName"] = jobName
            Item["jobMoney"] = jobMoney
            Item["jobNeed"] = jobNeed
            Item["jobCompany"] = jobCompany
            Item["jobType"] = jobType
            Item["jobSpesk"] = jobSpesk
            yield Item
Copy the code

Execute scrapy crawl Sean -o shujv2.json and place the data in shujv2.json file. Open the shujv2.json file as shown in the figureThe first half looks fine, but the second half has a lot of newlines, and we need to deal with the newlines, so we’ll choose that for nowstripIs selected, and then optimized. Modify the code as follows:

jobName = sel2.xpath('div/div/div/a/h3/text()').extract() jobMoney = sel2.xpath('div/div/div/div/span/text()').extract()  jobNeed = sel2.xpath('div/div/div/div/text()').extract() jobNeed = jobNeed[2].strip() jobCompany = sel2.xpath('div/div/div/a/text()').extract() jobCompany =jobCompany[3].strip() jobType = sel2.xpath('div/div/div/text()').extract() jobType = jobType[7].strip() jobSpesk = sel2.xpath('div[@class="list_item_bot"]/div/text()').extract() jobSpesk =jobSpesk[-1].strip()Copy the code

Empty shujv2.json and run the crawler againThe problem was solved. The project is basically completed, but there is still a problem. We have just climbed the data of the first page of any work, and we still need to change the page. There are 30 pages in the picture, and we only got to the first pageBy testing page breaks, we can find the rule of page number first pagewww.lagou.com/zhaopin/Jav…The second pagewww.lagou.com/zhaopin/Jav…The third pagewww.lagou.com/zhaopin/Jav…That’s easy, we just need to concatenate the 1-30 numbers after jobUrl and change the code as follows:

for i in range(30):
                
                jobUrl2 = jobUrl + str(i+1)
                # print(jobUrl2)
                try:
                    yield scrapy.Request(url = jobUrl2  ,cookies=self.cookie , meta = {"jobClass":jobClass} , callback=self.parse_url)
                except:
                    pass
Copy the code

Receive the concatenated URL with jobUrl2, request the URL, loop 30 times. To prevent work that is not 30 pages long, we use try… Except… Including, to prevent bugs,

Python3 Scrapy Entry-level crawler crawls tens of thousands of dragnet jobs

Related Posts

Correctly configure Nginx in PHP

Ever used ThreadLocal? How does it keep threads safe?

Learn data analysis quickly in 5 steps