Basic crawler Architecture

The basic crawler framework mainly includes five modules: crawler scheduler, URL manager, web page loader, web page parser and data storage.

Crawler scheduler: start, execute and stop crawler, and coordinate the coordination of other modules.

URL manager: Manages crawled urls and uncrawled urls and provides an interface for obtaining new URL links.

Web page downloader: downloads a URL corresponding to the web page provided by the URL manager and stores it as a string, which is sent to the web page parser for parsing.

Web page parser: get the downloaded HTML page from the web page loader, parse out the new URL to the URL manager, parse out the valid data to submit to the data storage.

Data storage: The data parsed by the web page parser is stored in a file or database.

Crawl thesaurus categories

The web page of theshodictionary classification is pinyin.sogou.com/dict/, and the structure of the source code of the web page is observed by the developer tool of the browser. The href(hyperlink) structure of the source code of the same level classification is the same, only the ID or name is different. Classification in the web page can be extracted by regular expression, and multi-level classification can be stored with nested dictionary.

Get big categories

Example of category links of thesaurus classification webpage pinyin.sogou.com/dict/ :

<a href="/dict/cate/index/167? rf=dictindex& pos=dict_rcmd" target="_blank"< p style = "max-width: 100%; clear: both; min-height: 1em;Copy the code

Category links are compiled into Pattern instances. The Pattern instances are used to process web text to obtain ID and name groups, which are returned as lists and stored as dict.

bigCatePattern = re.compile(r"href='/dict/cate/index/(\d+).*? > (. *?) <")
bigCateURL = 'http://pinyin.sogou.com/dict/'
response = requests.get(bigCateURL, headers=headers)
response.encoding = 'utf-8'
bigCateData = response.text
result = re.findall(bigCatePattern, bigCateData)
Copy the code

There is a confusion in the URL of the main category at pinyin.sogou.com/dict/. From the Chrome developer tools page, the URL of the main category and the small category link is similar. Pattern instance re.compile(r’href=”/dict/cate/index/(\d+).*?>(.*?)<‘) matches the large and small class links, and is consistent with the result seen from the developer tools page. Pattern instance re.compile(r”href=’/dict/cate/index/(\d+).*?>(.*?)<“) matches only the large category links, which is consistent with the goal we want to achieve.

From the large category comes the small category

Example of small class links in thesaurus classification page pinyin.sogou.com/dict/ :

<a href="/dict/cate/index/360? rf=dictindex" target="_blank" class=""> the < / a >Copy the code

From the big category pages get classification of small links, for example: http://pinyin.sogou.com/dict/cate/index/+bigCateID

<a href="/dict/cate/index/167"></a>
...
<a class="citylist" href="/dict/cate/index/360"> the < / a >Copy the code

Re.compile (r’href=”/dict/cate/index/(\d+)”>(.*?)<‘) matches the ID of the large category and needs to be removed. The page above matches both large and small category links.

smallCatePattern = re.compile(r'href="/dict/cate/index/(\d+)">(.*?) < ')
smallCateBaseURL = 'http://pinyin.sogou.com/dict/cate/index/'
Copy the code

You might want to see more

Hadoop/CDH

Hadoop Combat (1) _ Aliyun builds the pseudo-distributed environment of Hadoop2.x

Hadoop Deployment (2) _ Vm Deployment of Hadoop in full distribution Mode

Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode

Hadoop Deployment (4) _Hadoop cluster management and resource allocation

Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience

Hadoop Deployment (6) _ Build the Eclipse development environment for Apache Hadoop

Hadoop Deployment (7) _Apache Install and configure Hue on Hadoop

Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure

Hadoop Combat (9) _Hive and UDF development

Hadoop Combat (10) _Sqoop import and extraction framework encapsulation


The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.

Reprint please specify: Reprint from wechat official account “Data Analysis”


Reader communication telegraph group:

https://t.me/sspadluo