Basic crawler Architecture
The basic crawler framework mainly includes five modules: crawler scheduler, URL manager, web page loader, web page parser and data storage.
Crawler scheduler: start, execute and stop crawler, and coordinate the coordination of other modules.
URL manager: Manages crawled urls and uncrawled urls and provides an interface for obtaining new URL links.
Web page downloader: downloads a URL corresponding to the web page provided by the URL manager and stores it as a string, which is sent to the web page parser for parsing.
Web page parser: get the downloaded HTML page from the web page loader, parse out the new URL to the URL manager, parse out the valid data to submit to the data storage.
Data storage: The data parsed by the web page parser is stored in a file or database.
Crawl thesaurus categories
The web page of theshodictionary classification is pinyin.sogou.com/dict/, and the structure of the source code of the web page is observed by the developer tool of the browser. The href(hyperlink) structure of the source code of the same level classification is the same, only the ID or name is different. Classification in the web page can be extracted by regular expression, and multi-level classification can be stored with nested dictionary.
Get big categories
Example of category links of thesaurus classification webpage pinyin.sogou.com/dict/ :
<a href="/dict/cate/index/167? rf=dictindex& pos=dict_rcmd" target="_blank"< p style = "max-width: 100%; clear: both; min-height: 1em;Copy the code
Category links are compiled into Pattern instances. The Pattern instances are used to process web text to obtain ID and name groups, which are returned as lists and stored as dict.
bigCatePattern = re.compile(r"href='/dict/cate/index/(\d+).*? > (. *?) <")
bigCateURL = 'http://pinyin.sogou.com/dict/'
response = requests.get(bigCateURL, headers=headers)
response.encoding = 'utf-8'
bigCateData = response.text
result = re.findall(bigCatePattern, bigCateData)
Copy the code
There is a confusion in the URL of the main category at pinyin.sogou.com/dict/. From the Chrome developer tools page, the URL of the main category and the small category link is similar. Pattern instance re.compile(r’href=”/dict/cate/index/(\d+).*?>(.*?)<‘) matches the large and small class links, and is consistent with the result seen from the developer tools page. Pattern instance re.compile(r”href=’/dict/cate/index/(\d+).*?>(.*?)<“) matches only the large category links, which is consistent with the goal we want to achieve.
From the large category comes the small category
Example of small class links in thesaurus classification page pinyin.sogou.com/dict/ :
<a href="/dict/cate/index/360? rf=dictindex" target="_blank" class=""> the < / a >Copy the code
From the big category pages get classification of small links, for example: http://pinyin.sogou.com/dict/cate/index/+bigCateID
<a href="/dict/cate/index/167"></a>
...
<a class="citylist" href="/dict/cate/index/360"> the < / a >Copy the code
Re.compile (r’href=”/dict/cate/index/(\d+)”>(.*?)<‘) matches the ID of the large category and needs to be removed. The page above matches both large and small category links.
smallCatePattern = re.compile(r'href="/dict/cate/index/(\d+)">(.*?) < ')
smallCateBaseURL = 'http://pinyin.sogou.com/dict/cate/index/'
Copy the code
You might want to see more
Hadoop/CDH
Hadoop Combat (1) _ Aliyun builds the pseudo-distributed environment of Hadoop2.x
Hadoop Deployment (2) _ Vm Deployment of Hadoop in full distribution Mode
Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode
Hadoop Deployment (4) _Hadoop cluster management and resource allocation
Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience
Hadoop Deployment (6) _ Build the Eclipse development environment for Apache Hadoop
Hadoop Deployment (7) _Apache Install and configure Hue on Hadoop
Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure
Hadoop Combat (9) _Hive and UDF development
Hadoop Combat (10) _Sqoop import and extraction framework encapsulation
The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.
Reprint please specify: Reprint from wechat official account “Data Analysis”
Reader communication telegraph group:
https://t.me/sspadluo