Friday, because a colleague gave a website to download books. So, on a whim, I wanted to write a crawler demo to crawl all the data down. Then I found a movie website that was similar, and so on. Code reuse.
The crawler steps
- Analyze the characteristics of the target web page
- Find the data to crawl
- Jump to multiple pages of data
- Data is stored
1. Analyze the characteristics of the target page
Week of today, I want to crawl the page data is read, http://www.ireadweek.com/, page structure very short answer, first use requests + bs4 cooperate to crawl. Discover page does not use JS, also do not have anti-crawler mechanism, so very simple.
The site has a two-tier structure, home -> click on each book -> go to the book details page. The data I need is on the details page. The diagram below:
2. Find the data to crawl
html_doc = response.body
soup = BeautifulSoup(html_doc, 'html.parser')
img_url = urljoin(CDN, soup.find('img').attrs.get('src').replace('/ /'.'/'))
download_url = soup.find('a', class_='downloads').attrs.get('href')
title = soup.find_all('div', class_='hanghang-za-title')
name = title[0].text
content = soup.find_all('div', class_='hanghang-za-content')
author_info = content[0].text
directory = '\n'.join([i.text.replace("\u3000".' ') for i in content[1].find_all('p')])
info = soup.find('div', class_='hanghang-shu-content-font').find_all('p')
author = info[0].text.split('Author:') [1]
category = info[1].text.split('Category:') [1]
score = info[2].text.split('Douban rating:') [1]
introduction = info[4].text
Copy the code
3. Jump of multi-page data
This mainly deals with jumping between pages. I use the following page number to jump to the next page.
next_url = urljoin(DOMAIN, soup.find_all('a') [2 -].attrs.get('href'))
yield scrapy.Request(next_url, callback=self.parse)
Copy the code
Since there is no specific ID, class, only the location index can be used.
4. Data storage
Data storage, used to be written to Excel or Redis, today want to write to mysql, write to mysql can use Pymysql or mysqlDB. I chose to use ORM. It could be SQLALCHEMY, Django Model. I chose Django Model.
# in the django
from django.db import models
# Create your models here.
class Book(models.Model):
id = models.IntegerField(primary_key=True)
name = models.CharField(max_length=255)
author = models.CharField(max_length=255)
category = models.CharField(max_length=255)
score = models.CharField(max_length=100)
img_url = models.URLField()
download_url = models.URLField()
introduction = models.CharField(max_length=2048)
author_info = models.CharField(max_length=2048)
directory = models.CharField(max_length=4096)
create_edit = models.DateTimeField(auto_now_add=True)
class Meta:
managed = False
db_table = "ireadweek"
# scrapy settings.py
import os
import sys
import django
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'Rino_nakasone_backend.settings'
django.setup()
< span style = "box-sizing: border-box! Important; word-break: inherit! Important
from ireadweek.models import Book
import datetime
class RinonakasonePipeline(object):
def process_item(self, item, spider):
book = Book()
book.name = item.get('name')
book.author = item.get('author')
book.category = item.get('category')
book.score = item.get('score')
book.image_url = item.get('image_url')
book.download_url = item.get('download_url')
book.introduction = item.get('introduction')
book.author_info = item.get('author_info')
book.directory = item.get('directory')
book.create_edit = datetime.datetime.now()
book.save()
return item
# referenced in spider
def parse_news(self, response):
item = IreadweekItem()
html_doc = response.body
soup = BeautifulSoup(html_doc, 'html.parser')
img_url = urljoin(CDN, soup.find('img').attrs.get('src').replace('/ /'.'/'))
download_url = soup.find('a', class_='downloads').attrs.get('href')
title = soup.find_all('div', class_='hanghang-za-title')
name = title[0].text
content = soup.find_all('div', class_='hanghang-za-content')
author_info = content[0].text
directory = '\n'.join([i.text.replace("\u3000".' ') for i in content[1].find_all('p')])
info = soup.find('div', class_='hanghang-shu-content-font').find_all('p')
author = info[0].text.split('Author:') [1]
category = info[1].text.split('Category:') [1]
score = info[2].text.split('Douban rating:') [1]
introduction = info[4].text
item['name'] = name
item['img_url'] = img_url
item['download_url'] = download_url
item['author'] = author
item['author_info'] = author_info
item['category'] = category
item['score'] = score
item['introduction'] = introduction
item['directory'] = directory
return item
There is also a configuration settings.py
ITEM_PIPELINES = {
'RinoNakasone.pipelines.RinonakasonePipeline': 300,}Copy the code
The main technical points
- scrapy
- django
- beautifulsoup
All of the above should be able to use, I also wrote an API interface. http://127.0.0.1:8080/api/ireadweek/list/?p=400&n=20
Another website is:
I the address of the project: https://github.com/jacksonyoudi/Rino_nakasone_backend
The code is in the project.