preface

A python3 homework problem was recently done, involving:

  • Web crawler
  • Web page Chinese text extraction
  • Create a text index
  • Keyword search

The libraries involved are:

  • Crawler library: requests
  • Parsing library: xpath
  • Regular: re
  • Jieba: jieba
  • .

Release code for quick reference, to achieve a small demo.

Topic describes

Search engine design and implementation

  1. Input: Tencent Sports page link, in the form of a list as input, the number of variable, for example:
["http://fiba.qq.com/a/20190420/001968.htm",
"http://sports.qq.com/a/20190424/000181.htm",
"http://sports.qq.com/a/20190423/007933.htm",
"http://new.qq.com/omn/SPO2019042400075107"]
Copy the code
  1. Process: Web crawler, page analysis, Chinese extraction analysis, index building, requires the application of the third-party library in the textbook, the intermediate process is completed in memory, output the running time of the process;

  2. Search: prompt to enter a keyword to search;

  3. Output: the input link list is output in order of occurrence frequency of keywords, and output auxiliary information such as word frequency information in JSON format; Document links that do not contain keywords are not printed, and the retrieval time is printed at the end, for example:

1 "http:xxxxxx.htm" 3
2 "https:xxxx.htm" 2
3 "https:xxxxx.htm" 1
Copy the code

code

The main steps of code implementation are:

  • Web crawler:crawlerfunction
  • Web text element cleaning: Remove unnecessary English characters and labels,bs4_page_cleanfunction
  • Extraction with rere_chinesefunction
  • Dict saves Chinese characters and words for each web page and indexes them:jieba_create_indexfunction
  • Enter keywords to search:searchfunction
import requests from bs4 import BeautifulSoup import json import re import jieba import time USER_AGENT = {'user-agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit / 536.6 (KHTML, Like Gecko) ' 'Chrome/20.0.1092.0 Safari/536.6'} URL_TIMEOUT = 10 SLEEP_TIME = 2 # dict_result "xxxxx", "word": {"word1": x, "word2": x, "word3": x}} # "2": # {"url": "xxxxx", "word": {"word1": x, "word2": x, "word3": X}} #} dict_result = {} # dict_search [ # [url, count] # [url, count] # ] list_search_result = [] def crawler(list_URL): for i, url in enumerate(list_URL): Print (" url :", "...") ) page = requests.get(url, headers=USER_AGENT, Timeout =URL_TIMEOUT) Page. Encoding = Page. Apparent_encoding # Prevent code resolution errors result_clean_page = bs4_page_clean(page) Result_chinese = re_chinese(result_clean_page) # print(" 中文内容 : ", result_chinese) dict_result[I + 1] = {"url": Url, "word": jieba_create_index(result_Chinese)} print( Sleep. sleep(SLEEP_TIME) def bs4_page_clean(page): print(" regular expression: ") ) soup = BeautifulSoup(page.text, "html.parser") [script.extract() for script in soup.findAll('script')] [style.extract() for style in soup.findAll('style')] reg1 = re.compile("<[^>]*>") content = reg1.sub('', Prettify ()) return STR (content) def re_chinese(content): print("... ) pattern = re.compile(u'[\u1100-\uFFFD]+?') result = pattern.findall(content) return ''.join(result) def jieba_create_index(string): list_word = jieba.lcut_for_search(string) dict_word_temp = {} for word in list_word: if word in dict_word_temp: dict_word_temp[word] += 1 else: dict_word_temp[word] = 1 return dict_word_temp def search(string): for k, v in dict_result.items(): if string in v["word"]: List_search_result.append ([v["url"], v["word"][string]]) # x[1], reverse=True) if __name__ == "__main__": List_URL_sport = input(" Split (",") print(list_URL_sport) # delete for I in range(len(list_URL_sport)): list_URL_sport[i] = list_URL_sport[i][1:-1] print(list_URL_sport) # list_URL_sport = ["http://fiba.qq.com/a/20190420/001968.htm", # "http://sports.qq.com/a/20190424/000181.htm", # "http://sports.qq.com/a/20190423/007933.htm", # "http://new.qq.com/omn/SPO2019042400075107"] time_start_crawler = time.time() crawler(list_URL_sport) time_end_crawler = time.time() print(" time_end_crawler - time_start_crawler ") word = input(" ") time_start_search = time.time() search(word) time_end_search = time.time() print(" ", time_end_search - time_start_search) for i, row in enumerate(list_search_result): Print (I +1, row[0], row[1]) print(" dumps(dict_result, ensure_ASCII =False)Copy the code

The results

Pay attention to my

Personal Blog:

  • CSDN: @ Rude3Knife
  • Zhihu: @ Zhendong
  • Jane: @pretty three knives a knife
  • Nuggets: @ pretty three knife knife
  • Public account: Back-end technology ramble