This is the 15th day of my participation in Gwen Challenge

1, the preface

As a crawler, mastering a crawler frame is an essential skill, so as a white you, I recommend Scrapy.

Specific “Scrapy” is what, the role of these will not be wordy (are nonsense, Baidu has a brief introduction to Scrapy), time is precious, directly on the dry goods (combat case with your experience of the use of Scrapy).

Next will be “B station” as the target of actual combat!

2. Scrapy introduction

1. Environment preparation

Install scrapy

pip install scrapy
Copy the code

This command installs the scrapy library directly

2. Build scrapy projects

scrapy startproject Bili
Copy the code

You can create a crawler project named Bili with this command.

Here you can create a crawler project named Bili on your desktop

The project structure

Bili
Copy the code

Functions of each file

Scrapy. CFG: The overall configuration file for a project, usually without modification. Bili: The Python module of the project from which the program will import Python code. Bili/items.py: Used to define the Item class used by the project. The Item class is a DTO (data transfer object), usually defined with N attributes, and the class is defined by the developer. Bili/pipelines. Py: Project pipeline file, which handles the collected information. This file needs to be written by the developer. Bili/settings.py: Configuration file for the project in which you can configure the project. Bili/ Spiders: This directory houses the spiders needed by an item, which are responsible for grasping information of interest to the item.Copy the code

3. Specify what to crawl

https://search.bilibili.com/all?keyword=%E8%AF%BE%E7%A8%8B&page=2
Copy the code

Take the above link as an example (station B), crawl the title and url of the video.

4. Define each class in the project

The Items class

import scrapy
Copy the code

The crawl fields are the title (title) and link (URL) of the video, so the title and URL variables are used

Define the spiders class

The spider class allows you to create your own web page scrapy rules.

The syntax of the genspider command Scrapy is as follows:

scrapy genspider [options] <name> <domain>
Copy the code

To create a Spider, go to the Bili directory in a command line window and execute the following command:

scrapy genspider lyc "bilibili.com"
Copy the code

Run the above command to find a lyc.py file editing lyc.py in the Bili /spider directory of the Bili project

import scrapy
Copy the code

Modify the pipeline class

This class is the last processing of a file to be climbed and is generally responsible for writing the data to a file or database. Here we print it to the console.

from itemadapter import ItemAdapter
Copy the code

Modify the Settings class

BOT_NAME = 'Bili'
Copy the code

A simple architecture for the Scarpy project is complete and we can run it.

Start the project

scrapy crawl lyc
Copy the code

But with only one page of content, we can parse the next. Add the following code to lyc.py

import scrapy
Copy the code

Next page crawl

When executed again, it will crawl page by page.

3, summarize

1, for the convenience of everyone to learn, I put the complete source of this article to upload, required by the same name public response: scrapy framework

2, through the actual combat case “B station”, implement the creation of scrapy project by hand, analyze the web page, and finally successfully climb data and print (save)

3, suitable for white entry scrapy, welcome collection, analysis, learning