preface
A python3 homework problem was recently done, involving:
- Web crawler
- Web page Chinese text extraction
- Create a text index
- Keyword search
The libraries involved are:
- Crawler library: requests
- Parsing library: xpath
- Regular: re
- Jieba: jieba
- .
Release code for quick reference, to achieve a small demo.
Topic describes
Search engine design and implementation
- Input: Tencent Sports page link, in the form of a list as input, the number of variable, for example:
["http://fiba.qq.com/a/20190420/001968.htm",
"http://sports.qq.com/a/20190424/000181.htm",
"http://sports.qq.com/a/20190423/007933.htm",
"http://new.qq.com/omn/SPO2019042400075107"]
Copy the code
-
Process: Web crawler, page analysis, Chinese extraction analysis, index building, requires the application of the third-party library in the textbook, the intermediate process is completed in memory, output the running time of the process;
-
Search: prompt to enter a keyword to search;
-
Output: the input link list is output in order of occurrence frequency of keywords, and output auxiliary information such as word frequency information in JSON format; Document links that do not contain keywords are not printed, and the retrieval time is printed at the end, for example:
1 "http:xxxxxx.htm" 3
2 "https:xxxx.htm" 2
3 "https:xxxxx.htm" 1
Copy the code
code
The main steps of code implementation are:
- Web crawler:
crawler
function - Web text element cleaning: Remove unnecessary English characters and labels,
bs4_page_clean
function - Extraction with re
re_chinese
function - Dict saves Chinese characters and words for each web page and indexes them:
jieba_create_index
function - Enter keywords to search:
search
function
import requests from bs4 import BeautifulSoup import json import re import jieba import time USER_AGENT = {'user-agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit / 536.6 (KHTML, Like Gecko) ' 'Chrome/20.0.1092.0 Safari/536.6'} URL_TIMEOUT = 10 SLEEP_TIME = 2 # dict_result "xxxxx", "word": {"word1": x, "word2": x, "word3": x}} # "2": # {"url": "xxxxx", "word": {"word1": x, "word2": x, "word3": X}} #} dict_result = {} # dict_search [ # [url, count] # [url, count] # ] list_search_result = [] def crawler(list_URL): for i, url in enumerate(list_URL): Print (" url :", "...") ) page = requests.get(url, headers=USER_AGENT, Timeout =URL_TIMEOUT) Page. Encoding = Page. Apparent_encoding # Prevent code resolution errors result_clean_page = bs4_page_clean(page) Result_chinese = re_chinese(result_clean_page) # print(" 中文内容 : ", result_chinese) dict_result[I + 1] = {"url": Url, "word": jieba_create_index(result_Chinese)} print( Sleep. sleep(SLEEP_TIME) def bs4_page_clean(page): print(" regular expression: ") ) soup = BeautifulSoup(page.text, "html.parser") [script.extract() for script in soup.findAll('script')] [style.extract() for style in soup.findAll('style')] reg1 = re.compile("<[^>]*>") content = reg1.sub('', Prettify ()) return STR (content) def re_chinese(content): print("... ) pattern = re.compile(u'[\u1100-\uFFFD]+?') result = pattern.findall(content) return ''.join(result) def jieba_create_index(string): list_word = jieba.lcut_for_search(string) dict_word_temp = {} for word in list_word: if word in dict_word_temp: dict_word_temp[word] += 1 else: dict_word_temp[word] = 1 return dict_word_temp def search(string): for k, v in dict_result.items(): if string in v["word"]: List_search_result.append ([v["url"], v["word"][string]]) # x[1], reverse=True) if __name__ == "__main__": List_URL_sport = input(" Split (",") print(list_URL_sport) # delete for I in range(len(list_URL_sport)): list_URL_sport[i] = list_URL_sport[i][1:-1] print(list_URL_sport) # list_URL_sport = ["http://fiba.qq.com/a/20190420/001968.htm", # "http://sports.qq.com/a/20190424/000181.htm", # "http://sports.qq.com/a/20190423/007933.htm", # "http://new.qq.com/omn/SPO2019042400075107"] time_start_crawler = time.time() crawler(list_URL_sport) time_end_crawler = time.time() print(" time_end_crawler - time_start_crawler ") word = input(" ") time_start_search = time.time() search(word) time_end_search = time.time() print(" ", time_end_search - time_start_search) for i, row in enumerate(list_search_result): Print (I +1, row[0], row[1]) print(" dumps(dict_result, ensure_ASCII =False)Copy the code
The results
Pay attention to my
Personal Blog:
- CSDN: @ Rude3Knife
- Zhihu: @ Zhendong
- Jane: @pretty three knives a knife
- Nuggets: @ pretty three knife knife
- Public account: Back-end technology ramble