“This is my 36th day of participating in the First Challenge 2022. For details: First Challenge 2022”
Please attach great importance to the application of scrapy pipeline Pipelines.
Crawl target site analysis
The target site of this collection is: www.zaih.com/falcon/ment… , target data is expert data.
The data is saved in the MySQL database. Based on the target data, the table structure is designed as follows.Compare the table structure, you can directlyscrapy
In theitems.py
The file is written.
class ZaihangItem(scrapy.Item) :
# define the fields for your item here like:
name = scrapy.Field() # the name
city = scrapy.Field() # city
industry = scrapy.Field() # industry
price = scrapy.Field() Price of #
chat_nums = scrapy.Field() # Number of chats
score = scrapy.Field() # score
Copy the code
Encoding time
The project creation process can refer to the previous example, this article is written directly from the collection file development, the file is zh.py. This time the target data paging address needs to be manually concatenated, so declare an instance variable (field) in advance, which is page. After each response, determine whether the data is empty. If not, perform the +1 operation.
The request address template is as follows:
Psychological & https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name= page = {}Copy the code
If the number of pages exceeds the maximum number, the following page status is displayed. Therefore, the data is empty. You only need to check whether the page existsclass=empty
的 section
Can.Parse data and data clarity directly refer to the following code.
import scrapy
from zaihang_spider.items import ZaihangItem
class ZhSpider(scrapy.Spider) :
name = 'zh'
allowed_domains = ['www.zaih.com']
page = 1 # start page number
url_format = 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86&page={}' # template
start_urls = [url_format.format(page)]
def parse(self, response) :
empty = response.css("section.empty") # check whether data is null
if len(empty) > 0:
return # empty tag exists, return directly
mentors = response.css(".mentor-board a") # Hyperlinks for all masters
for m in mentors:
item = ZaihangItem() Instantiate an object
name = m.css(".mentor-card__name::text").extract_first()
city = m.css(".mentor-card__location::text").extract_first()
industry = m.css(".mentor-card__title::text").extract_first()
price = self.replace_space(m.css(".mentor-card__price::text").extract_first())
chat_nums = self.replace_space(m.css(".mentor-card__number::text").extract()[0])
score = self.replace_space(m.css(".mentor-card__number::text").extract()[1])
# Format data
item["name"] = name
item["city"] = city
item["industry"] = industry
item["price"] = price
item["chat_nums"] = chat_nums
item["score"] = score
yield item
Make a request again
self.page += 1
next_url = format(self.url_format.format(self.page))
yield scrapy.Request(url=next_url, callback=self.parse)
def replace_space(self, in_str) :
in_str = in_str.replace("\n"."").replace("\r"."").replace("RMB"."")
return in_str.strip()
Copy the code
< span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important
ITEM_PIPELINES = {
'zaihang_spider.pipelines.ZaihangMySQLPipeline': 300,}Copy the code
< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;” And get the global crawler, which is then used to get the configuration items in settings.py.
In addition, there is a from_settings method, which is usually used in official plug-ins, as shown in the following example.
@classmethod
def from_settings(cls, settings) :
host= settings.get('HOST')
return cls(host)
@classmethod
def from_crawler(cls, crawler) :
# FIXME: for now, stats are only supported from this constructor
return cls.from_settings(crawler.settings)
Copy the code
Before writing the following code, you need to write the configuration items in settings.py. Settings. py file code
HOST = "127.0.0.1"
PORT = 3306
USER = "root"
PASSWORD = "123456"
DB = "zaihang"
Copy the code
> > File code
import pymysql
class ZaihangMySQLPipeline:
def __init__(self, host, port, user, password, db) :
self.host = host
self.port = port
self.user = user
self.password = password
self.db = db
self.conn = None
self.cursor = None
@classmethod
def from_crawler(cls, crawler) :
return cls(
host=crawler.settings.get('HOST'),
port=crawler.settings.get('PORT'),
user=crawler.settings.get('USER'),
password=crawler.settings.get('PASSWORD'),
db=crawler.settings.get('DB'))def open_spider(self, spider) :
self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.password, db=self.db)
def process_item(self, item, spider) :
# print(item)
# store to MySQL
name = item["name"]
city = item["city"]
industry = item["industry"]
price = item["price"]
chat_nums = item["chat_nums"]
score = item["score"]
sql = "insert into users(name,city,industry,price,chat_nums,score) values ('%s','%s','%s',%.1f,%d,%.1f)" % (
name, city, industry, float(price), int(chat_nums), float(score))
print(sql)
self.cursor = self.conn.cursor() Set the cursor
try:
self.cursor.execute(sql) # execute SQL
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self, spider) :
self.cursor.close()
self.conn.close()
Copy the code
The three important functions in the pipeline file are open_spider, process_item, and close_spider.
The crawler is executed once
def open_spider(self, spider) :
The spider object dynamically adds the instance variable, and the value of the variable can be obtained in the spider module, such as in the parse(self, response) function by self
# Some initialization actions
pass
# Processing extracted data, data preservation code writing location
def process_item(self, item, spider) :
pass
Close_spider is executed only once when the crawler is closed. Close_spider is not executed if an abnormal crash occurs while the crawler is running
def close_spider(self, spider) :
Close the database to release resources
pass
Copy the code
Crawl results show
Write in the back
Today is the 246/365 day of continuous writing. Expect attention, likes, comments and favorites.