This is the 19th day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021

1.1 the topic

Proficient in Item and Pipeline data serialization output method in scrapy;

Scrapy+Xpath+MySQL database storage technology route climb dangdang.com book data candidate website: www.dangdang.com/

1.2 train of thought

1.2.1 setting. Py

  • Open request header

  • Database Connection Information

  • ROBOTSTXT_OBEYSet to False

  • Open the pipelines

1.2.2 item. Py

Write the fields for item.py

class DangdangItem(scrapy.Item) :
    title = scrapy.Field()
    author = scrapy.Field()
    publisher = scrapy.Field()
    date = scrapy.Field()
    price = scrapy.Field()
    detail = scrapy.Field()
Copy the code

1.2.3 db_Spider. Py

  • Look at the page, look at the pages

The second pageThe third page

So it’s easy to find thispage_indexThat’s the parameter for paging

  • Obtaining Node Information
    def parse(self, response) :
        lis = response.xpath('//*[@id="component_59"]')
        titles = lis.xpath(".//p[1]/a/@title").extract()
        authors = lis.xpath(".//p[5]/span[1]/a[1]/text()").extract()
        publishers = lis.xpath('.//p[5]/span[3]/a/text()').extract()
        dates = lis.xpath(".//p[5]/span[2]/text()").extract()
        prices = lis.xpath('.//p[3]/span[1]/text()').extract()
        details = lis.xpath('.//p[2]/text()').extract()
        for title,author,publisher,date,price,detail in zip(titles,authors,publishers,dates,prices,details):
            item = DangdangItem(
                title=title,
                author=author,
                publisher=publisher,
                date=date,
                price=price,
                detail=detail,
            )
            self.total += 1
            print(self.total,item)
            yield item
        self.page_index += 1
        yield scrapy.Request(self.next_url % (self.keyword, self.page_index),
                             callback=self.next_parse)
Copy the code
  • Specifies the number of crawls

crawl102article

1. Pipelines. Py

  • Database connection
    def __init__(self) :
        Get the host name, port number, and set name in setting
        host = settings['HOSTNAME']
        port = settings['PORT']
        dbname = settings['DATABASE']
        username = settings['USERNAME']
        password = settings['PASSWORD']
        self.conn = pymysql.connect(host=host, port=port, user=username, password=password, database=dbname,
                                    charset='utf8')
        self.cursor = self.conn.cursor()
Copy the code
  • Insert data
    def process_item(self, item, spider) :
        data = dict(item)
        sql = "INSERT INTO spider_dangdang(title,author,publisher,b_date,price,detail)" \
              " VALUES (%s,%s, %s, %s,%s, %s)"
        try:
            self.conn.commit()
            self.cursor.execute(sql, [data["title"],
                                      data["author"],
                                      data["publisher"],
                                      data["date"],
                                      data["price"],
                                      data["detail"]])print("Insert successful")
        except Exception as err:
            print("Insert failed", err)
        return item

Copy the code

The result shows that there are 102 pieces of data in total. I set the id to be automatically increased. Because the data was inserted in the previous test, the ID did not start from 1