An overview of the
In an article on “one of a simple web crawler crawler learning, in our concept of the crawler has a preliminary understanding, and through some third-party Python library convenient to extract the content we want, but often face the work as a complex requirements, if by that way to deal with the efficiency is very low, This usually requires you to define and implement many of the very basic crawler frameworks yourself, or to combine many Python third-party libraries to do it. But don’t worry, There are many excellent crawler frameworks in Python, such as Scrapy, which we’ll learn next. This article introduces how to use the library to extract web content according to a simple example. For more details, see our Scrapy library.
Establishing goals
Again, you need to have a clear goal before you do anything, so this time our goal was to crawl some technical articles and store them in a database. So you need to have a target url and a database structure, and the database we chose to use MySql, and the target site we found a content site called Script House. Let’s start with a table structure for storing articles:
CREATE TABLE `articles` (
`id` mediumint(8) AUTO_INCREMENT NOT NULL,
`title` varchar(255) DEFAULT NULL,
`content` longtext,
`add_date` int(11) DEFAULT 0,
`hits` int(11) DEFAULT '0',
`origin` varchar(500) DEFAULT '',
`tags` varchar(45) DEFAULT '',
PRIMARY KEY (`id`),
KEY `add_date` (`add_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;Copy the code
Analysis target structure
Here we first need to climb the entrance is “network programming” node, the main entrance website is (www.jb51.net/list/index_…) Open this website, we can analyze the HTML semantic structure of the current page through the view element of Chrome or other browsers, as shown below:
It can be seen from the part of red box line in the figure that here is the main classification entrance of all articles that we need to extract under the node of “Network programming”, through which we can enter the list of different article classification. Therefore, according to the preliminary structural analysis, we can conclude that the crawling route of this crawler is:
Enter from the main entry -> extract all categories in the current entry -> enter the category list through the category entry -> enter the article page through the list
Next, take a look at our category list. Click on a category at random and open the list as shown in the picture below:
Here I box two main parts, the first is the title of the article, the second is pagination, the URL of the article is the entry we need to climb the content of the article, here we need to pay attention to the processing of pagination, through the last page of the page we can know how many pages of this kind of list currently. Combined with the above analysis, we basically determined each route entrance of the crawler, and then we began to realize the goal of this time through the program.
To realize the crawler
Before we implement crawlers, we will not discuss Item Pipeline for the time being in order to keep this chapter brief.
It’s easy to see how Scrapy works, but we’ll show you the basic functionality we’re using. Mainly rely on third-party libraries:
The web. Py web framework, which uses only the Database part and will be used for content presentation scrapy crawler framework, uses only the most basic content extraction
Some xpath knowledge will also be used here, so Google the xpath syntax
# -*- coding:utf-8 -*- '''by sudo rm -rf http://imchenkun.com''' import scrapy from scrapy.http import Request import Web import time db= web.database(DBN ='mysql', host='127.0.0.1', db='imchenkun', user='root', Pw ='root') # allow_domain = "jb51.net" base_URL = "http://www.jb51.net" # list page list_URL = "Http://www.jb51.net/list/list_%d_%d.htm" # list paging crawl_url list_page = 1 # the article page = "http://www.jb51.net/article/%d.htm" class JB51Spider(scrapy.Spider): name = "jb51" start_urls = [ "http://www.jb51.net/list/index_1.htm" ] cate_list = [] def parse(self, response): cate_id = response.selector.xpath('//div[@class="index_bor clearfix"]/div[@class="index_con"]/span/a/@href').re('(\\\\d+)')[::2] for id in cate_id: cate_url = list_url % (int(id), 1) yield Request(cate_url, callback=self.parse_page) def parse_page(self, response): _params = response.selector.xpath('//div[@class="dxypage clearfix"]/a[last()]/@href').re('(\\\\d+)') cate_id = Int (_params[0]) # count = int(_params[1]) # response.selector. Xpath ('//div[@class=" artList" Clearfix "]/dl/dt/a/@href'). Extract () # Yield Request(base_url + article_url, callback=self.parse_article) # for page in range(1, count): url = (list_url % (cate_id, page + 1)) yield Request(url, callback=self.parse_list) def parse_list(self, response): Article_urls = response.selector. Xpath ('//div[@class=" artList clearfix"]/dl/dt/a/@href').extract() for article_url in article_urls: yield Request(base_url + article_url, callback=self.parse_article) def parse_article(self, response): Analytical articles "" "" "title =" response. The selector. The xpath (' / / div [@ class = "title"] / h1 / text () '). The extract () [0] content = response.selector.xpath('//div[@id="content"]').extract()[0] tags = ','.join(response.selector.xpath('//div[@class="tags mt10"]/a/text()').extract()) results = db.query('select count(0) as total from articles where origin=$origin', vars = { 'origin': response.url }) if results[0].total <= 0: db.insert('articles', title=title, origin=response.url, content=content, add_date=int(time.time()), hits=0, tags=tags )Copy the code
Execute this code using Scrapy:
scrapy runspider jb51_spider.py
The effect after this run can be seen in the database as follows:
Making the address
conclusion
This article focuses on the basic Scrapy Spider part, and uses xpath to extract content from the structure of the target site, as well as pagination. The goal here is to create an idea of how to write a crawler, not how to use tools to crawl data. Firstly, the target is determined, then the target is analyzed, and then the existing tools are used to extract the content. In the process of extracting the content, various problems will be encountered. At this time, we will solve these problems one by one until our crawler can run without obstacles. Next I’ll use Scrapy to explore Item definitions, Pipeline implementations, and how to use proxies.
Special statement: the script home website mentioned in this article is just used for crawler technical exchange and learning, readers involved in all infringement issues have nothing to do with me, but also hope that we do not have a lot of crawling content burden on the server in the process of learning actual combat
This article was first published onsudo rm -rfBY attribution (BY)- Non-commercial use (NC)- No deductive (ND) reprint please indicate the original author