This is the 15th day of my participation in Gwen Challenge
1, the preface
As a crawler, mastering a crawler frame is an essential skill, so as a white you, I recommend Scrapy.
Specific “Scrapy” is what, the role of these will not be wordy (are nonsense, Baidu has a brief introduction to Scrapy), time is precious, directly on the dry goods (combat case with your experience of the use of Scrapy).
Next will be “B station” as the target of actual combat!
2. Scrapy introduction
1. Environment preparation
Install scrapy
pip install scrapy
Copy the code
This command installs the scrapy library directly
2. Build scrapy projects
scrapy startproject Bili
Copy the code
You can create a crawler project named Bili with this command.
Here you can create a crawler project named Bili on your desktop
The project structure
Bili
Copy the code
Functions of each file
Scrapy. CFG: The overall configuration file for a project, usually without modification. Bili: The Python module of the project from which the program will import Python code. Bili/items.py: Used to define the Item class used by the project. The Item class is a DTO (data transfer object), usually defined with N attributes, and the class is defined by the developer. Bili/pipelines. Py: Project pipeline file, which handles the collected information. This file needs to be written by the developer. Bili/settings.py: Configuration file for the project in which you can configure the project. Bili/ Spiders: This directory houses the spiders needed by an item, which are responsible for grasping information of interest to the item.Copy the code
3. Specify what to crawl
https://search.bilibili.com/all?keyword=%E8%AF%BE%E7%A8%8B&page=2
Copy the code
Take the above link as an example (station B), crawl the title and url of the video.
4. Define each class in the project
The Items class
import scrapy
Copy the code
The crawl fields are the title (title) and link (URL) of the video, so the title and URL variables are used
Define the spiders class
The spider class allows you to create your own web page scrapy rules.
The syntax of the genspider command Scrapy is as follows:
scrapy genspider [options] <name> <domain>
Copy the code
To create a Spider, go to the Bili directory in a command line window and execute the following command:
scrapy genspider lyc "bilibili.com"
Copy the code
Run the above command to find a lyc.py file editing lyc.py in the Bili /spider directory of the Bili project
import scrapy
Copy the code
Modify the pipeline class
This class is the last processing of a file to be climbed and is generally responsible for writing the data to a file or database. Here we print it to the console.
from itemadapter import ItemAdapter
Copy the code
Modify the Settings class
BOT_NAME = 'Bili'
Copy the code
A simple architecture for the Scarpy project is complete and we can run it.
Start the project
scrapy crawl lyc
Copy the code
But with only one page of content, we can parse the next. Add the following code to lyc.py
import scrapy
Copy the code
Next page crawl
When executed again, it will crawl page by page.
3, summarize
1, for the convenience of everyone to learn, I put the complete source of this article to upload, required by the same name public response: scrapy framework
2, through the actual combat case “B station”, implement the creation of scrapy project by hand, analyze the web page, and finally successfully climb data and print (save)
3, suitable for white entry scrapy, welcome collection, analysis, learning