In small crawler projects, we only need requests framework to complete crawler, but in large crawler projects, it is difficult to use requests alone, so now we learn to use scrapy framework to complete crawler.

Scrapy installed

throughpip install ScrapyYou can install Scrapy, and then initialize a crawler project in a blank file directoryscrapy startproject spider_demoSpider_demo is the name of the crawler project. After command initialization, we can see the following directory structure

__init__.py is the initialization file in the package, Middlewares -- Middlewares -- -- -- -- --Copy the code

Write our first crawler

Spiders create a baidu_spider.py

Cmdline import execute from scrapy. Cmdline import execute # Def parse(self, response) def parse(self, response): li_list = response.xpath('//li[contains(@class,"hdline")]') for i in li_list: Print ('. Join (the Selector (text = i.g et ()). The xpath (' / / a / / text () '). The extract ())) print (" = = = = = = = = = = = = = = = = = = = = = = = = = = ') # if command line to start the creeper __name__ == '__main__': execute(['scrapy', 'crawl', 'baiduNews'])Copy the code

Start main function, our crawler script will automatically climb baidu news web page

parsing

Scrapy provides a handy syntax for parsing HTML using xpath, which returns Selectors, get() for Selectors, and extract() for an array of text objects. Web pages can be parsed by learning basic xpath syntax and, if necessary, can be parsed by regular expressions to get the information we want

conclusion

Use scrapy, write different middleware, and use our crawlers.