Friday, because a colleague gave a website to download books. So, on a whim, I wanted to write a crawler demo to crawl all the data down. Then I found a movie website that was similar, and so on. Code reuse.

The crawler steps

  1. Analyze the characteristics of the target web page
  2. Find the data to crawl
  3. Jump to multiple pages of data
  4. Data is stored

1. Analyze the characteristics of the target page

Week of today, I want to crawl the page data is read, http://www.ireadweek.com/, page structure very short answer, first use requests + bs4 cooperate to crawl. Discover page does not use JS, also do not have anti-crawler mechanism, so very simple.

The site has a two-tier structure, home -> click on each book -> go to the book details page. The data I need is on the details page. The diagram below:

2. Find the data to crawl

        html_doc = response.body
        soup = BeautifulSoup(html_doc, 'html.parser')

        img_url = urljoin(CDN, soup.find('img').attrs.get('src').replace('/ /'.'/'))
        download_url = soup.find('a', class_='downloads').attrs.get('href')
        title = soup.find_all('div', class_='hanghang-za-title')
        name = title[0].text

        content = soup.find_all('div', class_='hanghang-za-content')
        author_info = content[0].text
        directory = '\n'.join([i.text.replace("\u3000".' ') for i in content[1].find_all('p')])

        info = soup.find('div', class_='hanghang-shu-content-font').find_all('p')

        author = info[0].text.split('Author:') [1]
        category = info[1].text.split('Category:') [1]
        score = info[2].text.split('Douban rating:') [1]
        introduction = info[4].text
Copy the code

3. Jump of multi-page data

This mainly deals with jumping between pages. I use the following page number to jump to the next page.

        next_url = urljoin(DOMAIN, soup.find_all('a') [2 -].attrs.get('href'))
        yield scrapy.Request(next_url, callback=self.parse)
Copy the code

Since there is no specific ID, class, only the location index can be used.

4. Data storage

Data storage, used to be written to Excel or Redis, today want to write to mysql, write to mysql can use Pymysql or mysqlDB. I chose to use ORM. It could be SQLALCHEMY, Django Model. I chose Django Model.

# in the django
from django.db import models


# Create your models here.
class Book(models.Model):
    id = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=255)
    author = models.CharField(max_length=255)
    category = models.CharField(max_length=255)
    score = models.CharField(max_length=100)
    img_url = models.URLField()
    download_url = models.URLField()
    introduction = models.CharField(max_length=2048)
    author_info = models.CharField(max_length=2048)
    directory = models.CharField(max_length=4096)
    create_edit = models.DateTimeField(auto_now_add=True)

    class Meta:
        managed = False
        db_table = "ireadweek"

# scrapy settings.py
import os
import sys
import django

sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'Rino_nakasone_backend.settings'

django.setup()


< span style = "box-sizing: border-box! Important; word-break: inherit! Important
from ireadweek.models import Book
import datetime


class RinonakasonePipeline(object):
    def process_item(self, item, spider):
        book = Book()
        book.name = item.get('name')
        book.author = item.get('author')
        book.category = item.get('category')
        book.score = item.get('score')
        book.image_url = item.get('image_url')
        book.download_url = item.get('download_url')
        book.introduction = item.get('introduction')
        book.author_info = item.get('author_info')
        book.directory = item.get('directory')
        book.create_edit = datetime.datetime.now()
        book.save()
        return item

# referenced in spider

    def parse_news(self, response):
        item = IreadweekItem()
        html_doc = response.body
        soup = BeautifulSoup(html_doc, 'html.parser')

        img_url = urljoin(CDN, soup.find('img').attrs.get('src').replace('/ /'.'/'))
        download_url = soup.find('a', class_='downloads').attrs.get('href')
        title = soup.find_all('div', class_='hanghang-za-title')
        name = title[0].text

        content = soup.find_all('div', class_='hanghang-za-content')
        author_info = content[0].text
        directory = '\n'.join([i.text.replace("\u3000".' ') for i in content[1].find_all('p')])

        info = soup.find('div', class_='hanghang-shu-content-font').find_all('p')

        author = info[0].text.split('Author:') [1]
        category = info[1].text.split('Category:') [1]
        score = info[2].text.split('Douban rating:') [1]
        introduction = info[4].text

        item['name'] = name
        item['img_url'] = img_url
        item['download_url'] = download_url
        item['author'] = author
        item['author_info'] = author_info
        item['category'] = category
        item['score'] = score
        item['introduction'] = introduction
        item['directory'] = directory

        return item

There is also a configuration settings.py
ITEM_PIPELINES = {
   'RinoNakasone.pipelines.RinonakasonePipeline': 300,}Copy the code

The main technical points

  1. scrapy
  2. django
  3. beautifulsoup

All of the above should be able to use, I also wrote an API interface. http://127.0.0.1:8080/api/ireadweek/list/?p=400&n=20

Another website is:

I the address of the project: https://github.com/jacksonyoudi/Rino_nakasone_backend

The code is in the project.