The introduction of Django-scrapy on the Internet is relatively small, this blog only in the process of I look up information to learn, if wrong, I hope to point out the correction;


This article is participating in Python Theme Month. See the link to the event for more details

Learning points:

  1. Implementation effect
  2. Creation of Django and Scrapy
  3. The docking position and code segment in setting
  4. Scrapy_djangoitem use
  5. Scrapy Data crawls save part
  6. Database design and problem section
  7. Django configuration

Realized effect:

Django and Scrapy creation:

Creating Django:

Django StartProject Project nameCopy the code
CD Project name Python manage.py startApp AppNameCopy the code

Such as:

Scrapy creation:

# CD django root directoryCD job_hnting scrapy StartProject Name# Create crawler
scrapy genspider spidername 'www.xxx.com'
Copy the code

Such as:

Setting:

Setting in a scrapy framework refers to Django, so Django knows about scrapy.

Setting in scrapy;

import os
import django

# import
os.environ['DJANGO_SETTINGS_MODULE'] = 'job_hnting.settings'
Manual initialization
django.setup()
Copy the code

Such as:

Scrapy_djangoitem use:

pip install scrapy_djangoitem
Copy the code

This library is written in a scrapy item.

import scrapy
# Import classes from the Models file in Django's app
from app51.models import app51data
# Scrapy library for docking with Django
from scrapy_djangoitem import DjangoItem


class JobprojectItem(DjangoItem) :
    Reference the class name in django's Model
    django_model = app51data
Copy the code

Data storage part of the connection in the later explanation, now the main framework is complete;

Scrapy:

First, write the scrapy crawler part:

We used data from 51 recruitment websites:

Crawling is divided into three functions:

  1. The main function
  2. Analytic function
  3. Total page function

51Job’s anti-climb means:

The data format of JSON is hidden in the structure of the web page. The website tutorial needs to be parsed by other libraries.

Our method is to use regular matching extraction to locate the data part, and use JSON library parsing:

# Locate data location and extract JSON data
        search_pattern = "window.__SEARCH_RESULT__ = (.*?) "
        jsonText = re.search(search_pattern, response.text, re.M | re.S).group(1)
Copy the code

Get total page number of keywords:

# Parse json data
        jsonObject = json.loads(jsonText)
        number = jsonObject['total_page']
Copy the code

Construct the page URL in the main function and give it to the parse function:

        for number in range(1.int(numbers)+1):            next_page_url = self.url.format(self.name,number)            # print(next_page_url) # create a data_parse function, and yield scrapy.Request(url= next_page_URL,callback=self.data_parse)
Copy the code

Finally, the required data is extracted from the analytic function:

        for job_item in jsonObject["engine_search_result"]:            items = JobprojectItem()            items['job_name'] = job_item['job_name']            items['company_name'] = job_item["company_name"]            Items ['Releasetime'] = job_item['issuedate'] items['salary'] = job_item[' providesalARY_text '] items['site'] = job_item['workarea_text'] .......
Copy the code

The details need to be tweaked. The full code is on GitHub.

After the data crawl part is solved, need to scrapy project in the pipline file to save;

class SeemeispiderPipeline(object) :    def process_item(self, item, spider) :        item.save()        return item
Copy the code

Be sure to uncomment pipline in the setting file

Set database:

There are two ways to configure djangos database:

Method 1: Directly add database configuration information to settings.py

DATABASES = {    'default': {'ENGINE': 'django.db.backends. Mysql ', # database ENGINE' NAME': 'mysite', # database NAME' USER': 'root', # database login username 'PASSWORD': '123', # PASSWORD' HOST': '127.0.0.1', # database HOST IP address, if you keep the default, 127.0.0.1' PORT': 3306, # database port number, if keep default, 3306}}
Copy the code

Method 2: Save the database configuration information in a file and import it in settings.py. (recommended)

Create database configuration file my.cnf(name optional)

[client]database = bloguser = blogpassword = bloghost =127.0. 01.port = 3306default-character-set = utf8
Copy the code

Import the my.cnf file in settings.py

DATABASES = {    'default': {'ENGINE': 'django.db.backends. Mysql ', 'OPTIONS': {'read_default_file': 'utils/ DBS /my.cnf',},}}
Copy the code

Enable Django to connect to mysql

To install Pymysql in a production environment, you need to import Pymysql into __init__.py in the package where the settings.py file resides

import pymysqlpymysql.install_as_MySQLdb()
Copy the code

For the previous item, write it in the spider as the model; ;

from django.db import modelsClass app51Data (models.model): Releasetime = models.charfield (max_length=20) Salary = models.charfield (max_length=20) # Models. CharField(max_length=20) # company_name = company_name Models. CharField(max_length=20) # db_table = 'jobsql51' def __str__(self): return self.job_name
Copy the code

When the specified table name, in the DBMS only need to create the corresponding database, the table name is automatically created

Each time you change the database, run the following command:

python manage.py makemigrationspython manage.py migrate
Copy the code

The mysql database is configured

Error encountered while configuring database:

AttributeError: 'STR' object has no attribute 'decode'Copy the code

Solution:

Find the Django installation directory

G:\env\django_job\Lib\site-packages\django\db\backends\mysql\operations.py
Copy the code

Editing operations. Py;

Change decode on line 146 to encode

def last_executed_query(self, cursor, sql, params) :        # With MySQLdb, cursor objects have an (undocumented) "_executed" # attribute where the exact query sent to the database is saved. # See  MySQLdb/cursors.py in the source distribution. query = getattr(cursor, '_executed', None) if query is not None: #query = query.decode(errors='replace') uery = query.encode(errors='replace') return query
Copy the code

Django configuration:

I won’t go into much detail about the basic django configuration, such as routing, app registration, etc.

The following is mainly about the configuration of the view in the APP to generate JSON;

from django.shortcuts import renderfrom django.http import HttpResponseImport app51DatAIMPort jSONDef Index (Request): # return HttpResponse("hello world") # return render(request,'index.html') Data =app51data.objects.all() list3 = [] I = 1 for var in data: data = {} data['id'] = i data['Releasetime'] = var.Releasetime data['job_name'] = var.job_name data['salary'] = var.salary data['site'] = var.site data['education'] = var.education data['company_name'] = var.company_name data['Workexperience'] = var.Workexperience list3.append(data) i += 1 # a = json.dumps(data) # b = json.dumps(list2) # Convert collection or dictionary to JSON object c = json.dumps(list3) Return HttpResponse(c)
Copy the code

Realized effect:

Complete code in GitHub, hope with star, thanks!

If you have any questions, please leave a message online.