preface

Article plagiarism is widespread on the Internet, and many bloggers accept it. In recent years, with the development of the Internet, plagiarism and other unethical behavior on the Internet has intensified, and even copy, paste after the original cloth label is common, some of the copied article even marked some contact information so that readers get source code and other information. Such bad behaviour aroused indignation.

This paper uses the search engine results as the article library, and then compares the similarity with the local or Internet data to realize the article recheck. Since the realization process of duplicate checking is similar to the realization process of weibo sentiment analysis in general, the function of sentiment analysis can be easily expanded (the next chapter will complete the whole process of data collection, cleaning and sentiment analysis based on this code).

Due to the lack of time in the near future, the main functions have been temporarily realized and the details have not been optimized, but some brief design has been carried out in the code structure, which makes the subsequent function expansion and upgrade more convenient. I will continue to update the function of this tool, and strive to make it more mature and practical in technology.

technology

In order to adapt to the majority of sites, selenium is used as data acquisition to configure the information of different search engines to achieve more general search engine queries without considering too much dynamic data fetching. Jieba library is mainly used to complete word segmentation of Chinese sentences. Use cosine similarity to complete text similarity comparison and export the comparison data to Excel articles for reporting information.

Micro-blog sentiment analysis is based on SkLearn, and naive Bayes is used to complete sentiment analysis of data. In data capture, the implementation process is similar to the function of text search.

Test code acquisition

CSDN codechina code warehouse: codechina.csdn.net/A757291228/…

The environment

The author’s context is as follows:

  • Operating system: Windows7 SP1 64
  • Python version: 3.7.7
  • Browser: Google Chrome
  • Browser Version: 80.0.3987 (64-bit)

If there is a mistake welcome to point out, welcome message exchange.

One, the realization of text check

1.1 Selenium Installation and configuration

Since you are using Selenium, you need to ensure that the reader has Selenium installed before using it, using the PIP command, install as follows:

pip install selenium
Copy the code

After installing Selenium, you also need to download a driver.

  • Google Chrome driver: The driver version needs to correspond to the browser version. Click Download to use the driver version for different browsers
  • If you are using Firefox, check the version of Firefox, click GitHub Firefox driver download address to download (if you are not good at English, right click to translate, each version has the corresponding browser version instructions, see clearly download can be)

After selenium is installed, create a New Python file called selenium_search and introduce it in the code first

from selenium import webdriver
Copy the code

For those readers who have not configured the driver to the environment, we can specify the location of the driver (the blogger has configured the driver to the environment) :

driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
Copy the code

Create a variable URL and assign it to baidu home page link. Use the get method to pass in the URL address and try to open Baidu home page. The complete code is as follows:

from selenium import webdriver

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
Copy the code

Run the Python file from the command line in the little black box (windows) :



After running the script, Google Browser will open and jump to baidu home page:



This is used successfullyseleniumOpen the specified url, then the specified search keyword query results, and then traverse to similar data from the results.

1.2 Selenium Baidu search engine keyword search

Before automatically manipulating the browser to type keywords into the search box, you need to get the search box element object. Use Google Browser to open baidu home page, right-click the search box and choose View, the page elements (code) view window will pop up, find the search box element (use the mouse to move in the element node, the element node in the current position of the mouse will be marked blue in the corresponding page) :



In HTML code, the value of id is mostly unique (unless it’s a typo), and the ID is selected here as the tag to get the search box element object.seleniumprovidesfind_element_by_idMethod can be passed inidGets the Page element object.

input=driver.find_element_by_id('kw')
Copy the code

After obtaining the element object, use the send_keys method to pass in the values to be typed:

Input.send_keys (' PHP Foundation Tutorial Step 11 Object Oriented ')Copy the code

Here I pass in “PHP Basics Tutorial Step 11 Object Orientation” as the keyword for the search. Run the script to see if you typed the keyword in the search box. The code is as follows:

Input.send_keys (' PHP Foundation Tutorial Step 11 Object Oriented ')Copy the code

Successfully opened the browser and typed the search keyword:



Now just click the “Baidu click” button to complete the final search. Find the ID value of the “Baidu Search” button using the same element viewing method as viewing the search box:



usefind_element_by_idMethod to get the element object, then use the click method to make the button click:

search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code

The complete code is as follows:

from selenium import webdriver url='https://www.baidu.com' driver=webdriver.Chrome() driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Foundation Tutorial step 11 Object-oriented ') search_bTN =driver.find_element_by_id('su') search_btn.click()Copy the code

The browser automatically completes the search keyword typing and search function:

1.3 Traversal of Search Results

Now that you have the search results in your browser, you need to retrieve the entire Web page to get the search results. BeautifulSoup is used here to parse the entire Web page and retrieve the search results.

BeautifulSoup is an HTML/XML parser that makes it very easy to capture information across HTML. BeautifulSoup is installed before you use it. The installation command is as follows:

pip install BeautifulSoup
Copy the code

After installation, introduce in the current Python file header:

from bs4 import BeautifulSoup
Copy the code

To get HTML text, call page_source:

html=driver.page_source
Copy the code

Once you have the HTML code, create a new BeautifulSoup object, pass in the HTML content and specify a parser, in this case using the html.parser:

soup = BeautifulSoup(html, "html.parser")
Copy the code

Then you look at the search content and see that all the results are made up of onehThe tag contains, andclassfort:



BeautifulSoupThe select method is used to obtain labels, and supports searching by class name, label name, ID, attribute, and combination. We find that in baidu search results, there is a class =”t” in the results, and it is easiest to obtain by traversing the class name:

search_res_list=soup.select('.t')
Copy the code

Pass the class name t in the select method, preceded by a dot (.). Indicates that the element is obtained by the class name. Once you’ve done this, you can add print to try to print out the result:

print(search_res_list)
Copy the code

In general, it is possible to output search_res_list as an empty list because we have retrieved the contents of the browser’s current page before the browser parses the data and renders it to the browser. There is a simple solution to this problem, but it is not very efficient and will be used for the time being. It will then be replaced with something more efficient than this method (using time needs to be introduced in the header) :

time.sleep(2)
Copy the code

The complete code is as follows:

from selenium import webdriver from bs4 import BeautifulSoup import time url='https://www.baidu.com' Driver = webdriver.chrome () driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Basic Tutorial 11 Object-oriented ') Search_btn =driver.find_element_by_id('su') search_btn.click() time.sleep(2)# wait here to make the browser parse and render to the browser HTML =driver.page_source Soup = BeautifulSoup(HTML, "html.parser") search_res_list=soup. Select ('.t') print(search_res_list)Copy the code

Running the program will output:



The result is all tags of class t, including the children of that tag, and the dot (.) is used. An operator can retrieve child node elements. Search content obtained through the browser is all links, click to jump, so you only need to obtain the A tag under each element:

for el in search_res_list:
    print(el.a)
Copy the code



It is obvious from the result that the a tag of the search result has been obtained, so what we need to extract the href hyperlink inside each a tag. Get href links by using a list of elements:

for el in search_res_list:
    print(el.a['href'])
Copy the code

Running the script successfully results in:



Careful readers may find that all the obtained results are websites of Baidu. In fact, these web sites are “indexes”, through which to jump back to the real web site. Since these “indexes” do not necessarily change and are not conducive to long-term storage, you still need to get the real link here.

We call THE JS script to access these websites, which will jump to the real website, and then get the current website information. Calling the execute_script method executes the js code as follows:

for el in search_res_list:
    js = 'window.open("'+el.a['href']+'")'
    driver.execute_script(js)
Copy the code

After opening a new page, you need to obtain the handle of the new page, otherwise you cannot manipulate the new page. The handle can be obtained as follows:

Handle_all =driver.window_handles# Obtain all handlesCopy the code

After obtaining the handle, you need to switch the current object to the new page. Since there are only 2 pages after opening a page, simply use traversal to make a replacement:

Handle_exchange = noneif handle_exchange= noneIf handle_exchange= noneIf handle_exchange= handle Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switchCopy the code

After the switch, the operation object is the page just opened. Get the URL of the new page using the current_URL property:

real_url=driver.current_url
print(real_url)
Copy the code

Then close the current page and set the action object to the initial page:

Driver.close () driver.switch_to.window(handle_this)# Switch back to the initial interfaceCopy the code

Run the script successfully to obtain the real URL:



Finally, after retrieving the real URL, use a list to store the result:

real_url_list.append(real_url)
Copy the code

The complete code for this section is as follows:

from selenium import webdriver from bs4 import BeautifulSoup import time url='https://www.baidu.com' Driver = webdriver.chrome () driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Basic Tutorial 11 Object-oriented ') Search_btn =driver.find_element_by_id('su') search_btn.click() time.sleep(2)# wait here to make the browser parse and render to the browser HTML =driver.page_source soup = BeautifulSoup(html, "html.parser") search_res_list=soup.select('.t') real_url_list=[] # print(search_res_list) for el in search_res_list: Driver.execute_script (js) handle_this= driver.current_WINDOW_handle # Handle_exchange = no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switch Real_url = driver.current_URL print(real_URL) real_url_list.append(real_URL)# Store result driver.close() driver.switch_to.window(handle_this)Copy the code

1.4 Obtaining the Source text

Create a TXT file in the textsrc folder and save the text to be compared in the TXT file. I’m going to save the content here as article”PHP Basics Tutorial Step 11 Object Oriented“.



Write a function in your code to get text content:

def read_txt(path=''):
    f = open(path,'r')
    return f.read()
src=read_txt(r'F:\tool\textsrc\src.txt')
Copy the code

For testing purposes, absolute paths are used here. After obtaining the text content, write the comparison method of cosine similarity.

1.5 cosine similarity

Similarity calculation refer to the article “Python implementation cosine similarity text comparison”, I modify part of the realization.

In this paper, cosine similarity algorithm is used for similarity comparison, and the general steps are divided into word segmentation -> vector calculation -> similarity calculation. Create a Python file called Analyse. Create a class called Analyse, add word segmentation methods to the class, and introduce jieba dictionary and Collections statistics in the head:

from jieba import lcut
import jieba.analyse
import collections
Copy the code

The Count method:

# def Count(self,text): Textrank (text,topK=20) word_counts = collections.counter (tag) # count return word_countsCopy the code

The Count method takes a text variable, which is text, uses the Textrank method for word segmentation and counters. Then add MergeWord method to make word combination convenient after vector calculation:

Def MergeWord(self,T1,T2): MergeWord = [] for I in T1: MergeWord. Append (I) for I in T2: if I am not in MergeWord: MergeWord.append(i) return MergeWordCopy the code

The merge method is very simple and I’m not going to explain it. Next add the vector calculation method:

Def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[MergeWord.index(word)] = TermFrequence return TF1Copy the code

Finally, add the similarity calculation method:

def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code

The similarity method takes two vectors, calculates the similarity and returns. In order to reduce the code redundancy, a simple method is added here to complete the calculation process:

Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))Copy the code

The full code for the Analyse class is as follows:

from jieba import lcut import jieba.analyse import collections class Analyse: Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return Self.cosine_similarity (self.CalVector(T1,mergeword), self.calvector (T2,mergeword)) # def Count(self, mergeword) : self.cosine_similarity(T1,mergeword), mergeword (T2,mergeword)) # def Count(self, mergeword) : Tag = jieba.analyse. Textrank (text,topK=20) word_counts = collections.counter (tag) # MergeWord(self,T1,T2): MergeWord = [] for i in T1: MergeWord.append(i) for i in T2: if i not in MergeWord: MergeWord. Append (I) return MergeWord # def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code

1.6 Similarity comparison between search results and text

Selenium_search introduces Analyse in the selenium_search file and creates a new object:

from Analyse import Analyse
Analyse=Analyse()
Copy the code

Add the content of the newly opened page to the traversal search results:

time.sleep(5)
html_2=driver.page_source
Copy the code

Use time.sleep(5) to wait for the browser to have time to render the current Web content. After obtaining the content of the newly opened page, make similarity comparison:

Analyse.get_Tfidf(src,html_2)
Copy the code

Since it returns a value, print:

Print (' similarity: ',Analyse. Get_Tfidf (SRC,html_2))Copy the code

The complete code is as follows:

from selenium import webdriver from bs4 import BeautifulSoup import time from Analyse import Analyse def read_txt(path=''): SRC =read_txt(r' f :\tool\textsrc\src.txt') Analyse=Analyse() url='https://www.baidu.com' driver=webdriver.Chrome() driver.get(url) input=driver.find_element_by_id('kw') Find_element_by_id ('su') search_btn.click() time.sleep(2)# HTML =driver.page_source soup = BeautifulSoup(HTML, "html.parser") search_res_list=soup.select('.t') real_url_list=[] # print(search_res_list) for el in search_res_list: Driver.execute_script (js) handle_this= driver.current_WINDOW_handle # Handle_exchange = no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switch Real_url = driver.current_URL time.sleep(5) html_2=driver.page_source print(' similar: ',Analyse.get_Tfidf(src,html_2)) print(real_url) real_url_list.append(real_url) driver.close() driver.switch_to.window(handle_this)Copy the code

Run the script:



It turns out that there are several highly similar links, so these are suspected plagiarized articles.

This is the code that completes the basic replay check, but instead of saying the code is more redundant and cluttered, let’s optimize the code.

Second, code optimization

Through the above programming, the brief steps can be divided into: obtain search content -> obtain results -> calculate similarity. There are three classes to create: Browser, Analyse (already created), and SearchEngine. Browser is used for search, data acquisition, etc. Analyse is used for similarity analysis, vector calculation, etc.; SearchEngine is used as a basic configuration for different search engines, because most search engines are fairly consistent.

2.1 Browser class

Create a new Python file called Browser and add the initialization method:

def __init__(self,conf):
        self.browser=webdriver.Chrome()
        self.conf=conf
		self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
Copy the code

Self.browser = webdriver.chrome () creates a new browser object; Conf is the incoming search configuration, and then the search content is realized by writing a configuration dictionary. Self.engine_conf =EngineConfManage().get_engine_conf (conf[‘engine’]).get_conf() To obtain the configuration of the search engine, the input fields and search keys of different search engines are different. Multiple search engines can be searched based on different configuration information.

Add a search method

Def send_keyword(self): input = self.browser.find_element_by_id(self.engine_conf['searchTextID']) input.send_keys(self.conf['kw'])Copy the code

Self. Engine_conf [‘searchTextID’] and self.

Click on the search

Def click_search_bTN (self): search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click()Copy the code

Get the ID of the search button by using self.engine_conf[‘searchBtnID’].

Get search results and text

Def get_search_res_URL (self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange Real_url = self.browser.current_URL time.sleep(1) res_link[real_URL]=self.browser.page_source self.browser.switch_to.window(handle_this) return res_linkCopy the code

The above method is similar to the previous written traversal search results content, Add WebDriverWait(self.browser,timeout=30,poll_frequency=1). Until (ec.presence_of_element_located ((by.id, Ec.presence_of_element_located ((by.id, “page”))) is located on the page of the page button tag ID, If not, the current Web page is not fully loaded, and the wait time is timeout=3030 seconds. If the wait time has passed, the wait is skipped. Res_link [real_URL]=self.browser.page_source saves the contents and URLS to the dictionary, returns them, and compares them again.

Open the target search engine to search

Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code

Finally, add a search method, directly call the search method to achieve all the previous operations, without exposing too much simplification. The complete code is as follows:

from selenium import webdriver from bs4 import BeautifulSoup from SearchEngine import EngineConfManage from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time class Browser: def __init__(self,conf): self.browser=webdriver.Chrome() self.conf=conf Self.engine_conf =EngineConfManage().get_engine_conf (conf['engine']).get_conf() def send_keyword(self): Self.browser.find_element_by_id (self.engine_conf['searchTextID']) input.send_keys(self.conf['kw']) # search box click def click_search_btn(self): Search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click() # retrieve search results with text def get_search_res_url(self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange Real_url = self.browser.current_URL time.sleep(1) res_link[real_URL]=self.browser.page_source Self.browser.switch_to.window (handle_this) return res_link def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code

2.2 SearchEngine class

The SearchEngine class is mainly used to write configurations for different search engines. Easier implementation of search engine or similar business expansion.

Class EngineConfManage: def get_Engine_conf(self,engine_name): if engine_name=='baidu': return BaiduEngineConf() elif engine_name=='qihu360': return Qihu360EngineConf() elif engine_name=='sougou': return SougouEngineConf() class EngineConf: def __init__(self): self.engineConf={} def get_conf(self): return self.engineConf class BaiduEngineConf(EngineConf): engineConf={} def __init__(self): self.engineConf['searchTextID']='kw' self.engineConf['searchBtnID']='su' self.engineConf['nextPageBtnID_xpath_f']='//*[@id="page"]/div/a[10]' self.engineConf['nextPageBtnID_xpath_s']='//*[@id="page"]/div/a[11]' self.engineConf['searchContentHref_class']='t' self.engineConf['website']='http://www.baidu.com' class Qihu360EngineConf(EngineConf): def __init__(self): pass class SougouEngineConf(EngineConf): def __init__(self): passCopy the code

In this only achieved baidu search engine configuration. All different kinds of search engines inherit from the EngineConf base class, giving subclasses the get_conf method. The EngineConfManage class is used for calls to different search engines, passing in the engine name.

2.3 How to Use it

We’ll start with two classes:

from Browser import Browser
from Analyse import Analyse
Copy the code

Create a new method to read the local file:

def read_txt(path=''):
    f = open(path,'r')
    return f.read()
Copy the code

Get the file and create a new data analysis class:

SRC =read_txt(r 'f :\tool\textsrc\src.txt')# Analyse=Analyse()Copy the code

Configuration information dictionary compilation:

# config ={'engine':'baidu',} # config ={'engine':'baidu',}Copy the code

Create a new Browser class and pass in the configuration information:

drvier=Browser(conf)
Copy the code

Get search results and content

Url_content =drvier.search()#Copy the code

Traversal results and calculation similarity:

Get_Tfidf (SRC,url_content[k]) for k in url_content: print(k,' similarity: ',Analyse.Copy the code

The complete code is as follows:

from Browser import Browser from Analyse import Analyse def read_txt(path=''): F = open(path,'r') return f.read() SRC =read_txt(r' f :\tool\textsrc\src.txt')# Analyse=Analyse() 'engine':' Baidu ', 'engine':' Baidu ' } drvier=Browser(conf) url_content=drvier.search()# Print (k,' similarity: ',Analyse. Get_Tfidf (SRC,url_content[k]))Copy the code

Do you feel better? Hardly refreshing. You think this is the end of it? That’s not all. Let’s extend the functionality.

Third, function expansion

Temporarily the function of this small tool only check this basic function, and this has a lot of problems. For example, there is no whitelist filtering, can only check the similarity of an article, if you are lazy, there is no direct access to the article list automatic search function and results export. Next, some functions will be gradually improved. Due to the lack of space, the functions are not fully listed here, and will be updated continuously.

3.1 Automatic text retrieval

Create a new Python file called FileHandle. This class is used to automatically obtain the TXT file in a specified directory. The TXT file name is keyword and the content is the content of the article. The class code is as follows:

Import OS class FileHandle: def get_content(self,path): F = open(path,"r") # set file object content = f.read() # read all contents of TXT file into string STR f.close() # return content # get file contents def get_text(self): File_path =os.path.dirname(__file__) # Txt_path =file_path+r'\textsrc' # TXT rootdir=os.path.join(txt_path) # # target directory content local_text = {} read TXT file for (dirpath, dirnames, filenames) in OS. Walk (rootdir) : for filename in filenames: if os.path.splitext(filename)[1]=='.txt': Flag_file_path =dirpath+'\\'+filename # file path flag_file_content=self.get_content(flag_file_path) # Read file path if flag_file_content! = ": local_text[filename.replace('.txt',")]=flag_file_content # return local_textCopy the code

There are two methods get_content and get_text. Get_text indicates the path of all TXT files in the directory. If get_content is used to obtain detailed text content, local_text is returned. The local_text key is the file name and the value is the text content.

3.2 BrowserManage class

Add a BrowserManage class inherited from Browser to the Browser class file using the following method:

Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code

Adding this class separates the Browser class logic from other methods for easy extension.

3.3 Extension of the Browser class

Add the next page method to the Browser class to retrieve more content when searching for content and specify the number of results to retrieve:

Def click_next_page(self,md5): WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "Page "))) # search engine next page button xpath inconsistent default non-first page xpath try: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s']) except: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f']) next_page_btn.click() #md5 Compare webPAG text to determine whether the page has been turned (temporarily used, I =0 while md5==hashlib.md5(self.browser.page_source.encode(encoding=' utF-8 ')). Hexdigest ():#md5 comparison Time. Sleep (0.3)# prevent some errors, temporarily use force stop to keep some stability I +=1 if I >100: return False return TrueCopy the code

The next page button of Baidu search engine is inconsistent with the xpath. The default is not the first page xpath, and an exception occurs. Then perform MD5 on the page and compare the MD5 value. If the current page is not refreshed, the MD5 value will not change. Wait for a short time and click next.

3.4 Modified the get_search_res_URL method

The get_search_res_URL method has been modified to add the following code:

Def get_search_res_URL (self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) while len(res_link)<self.conf['target_page']: for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange real_url=self.browser.current_url if real_url in self.conf['white_list']: Select * from self.browser.page_source * self.browser.close() self.browser.switch_to.window(handle_this) Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast self.click_next_page(content_md5) return res_linkCopy the code

While len(res_link)

Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast self.click_next_page(content_md5)Copy the code

The above code increases the judgment of MD5 value after the current page is refreshed. If the value is inconsistent, jump.

If real_url in self.conf['white_list']: # whitelist continueCopy the code

The above code to judge the white list, set their own white list does not join the number.

3.5 Creating a Manage class

Create a New Python file named Manage and wrap it again. The code is as follows:

from Browser import BrowserManage from Analyse import Analyse from FileHandle import FileHandle class Manage: def __init__(self,conf): self.drvier=BrowserManage(conf) self.textdic=FileHandle().get_text() self.analyse=Analyse() def get_local_analyse(self):  resdic={} for k in self.textdic: Res ={} self.drvier.set_kw(k) url_content=self.drvier.search() res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1]) resdic[k]=res return resdicCopy the code

The above code initializer method takes one parameter and is new in the initializer methodBrowserManageObjects,AnalyseObject and get the text content.

get_local_analyseMethod traverses the text, searches the file name as the keyword, compares the similarity between the search content and the current text, and finally returns the result.

The results are as follows:



The files in the main directory of the blogger are as follows:



Similarity analysis part above is the main content, tools will be lost laterGitHubandcsdnThe code repository, the use of headless mode, this content for the general implementation.

All complete code below

Analyse the class:

from jieba import lcut import jieba.analyse import collections from FileHandle import FileHandle class Analyse: Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return Self.cosine_similarity (self.CalVector(T1,mergeword), self.calvector (T2,mergeword)) # def Count(self, mergeword) : self.cosine_similarity(T1,mergeword), mergeword (T2,mergeword)) # def Count(self, mergeword) : Tag = jieba.analyse. Textrank (text,topK=20) word_counts = collections.counter (tag) # MergeWord(self,T1,T2): MergeWord = [] for i in T1: MergeWord.append(i) for i in T2: if i not in MergeWord: MergeWord. Append (I) return MergeWord # def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code

Browser:

from selenium import webdriver from bs4 import BeautifulSoup from SearchEngine import EngineConfManage from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import hashlib import time import xlwt class Browser: def __init__(self,conf): self.browser=webdriver.Chrome() self.conf=conf self.conf['kw']='' Self.engine_conf =EngineConfManage().get_engine_conf (conf['engine']).get_conf() def set_kw(self,kw): Def send_keyword(self): self.conf['kw']=kw # def send_keyword(self): Self.browser.find_element_by_id (self.engine_conf['searchTextID']) input.send_keys(self.conf['kw']) # search box click def click_search_btn(self): Search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click() # retrieve search results with text def get_search_res_url(self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) while len(res_link)<self.conf['target_page']: for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange real_url=self.browser.current_url if real_url in self.conf['white_list']: Select * from self.browser.page_source * self.browser.close() self.browser.switch_to.window(handle_this) Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast Self. click_next_page(content_md5) return def click_next_page(self,md5): WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "Page "))) # search engine next page button xpath inconsistent default non-first page xpath try: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s']) except: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f']) next_page_btn.click() #md5 Compare webPAG text to determine whether the page has been turned (temporarily used, I =0 while md5==hashlib.md5(self.browser.page_source.encode(encoding=' utF-8 ')). Hexdigest ():#md5 comparison I +=1 if I >100: return False return True class BrowserManage(Browser): Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code

The Manage class:

from Browser import BrowserManage from Analyse import Analyse from FileHandle import FileHandle class Manage: def __init__(self,conf): self.drvier=BrowserManage(conf) self.textdic=FileHandle().get_text() self.analyse=Analyse() def get_local_analyse(self):  resdic={} for k in self.textdic: Res ={} self.drvier.set_kw(k) url_content=self.drvier.search() res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1]) resdic[k]=res return resdicCopy the code

FileHandle class:

Import OS class FileHandle: def get_content(self,path): F = open(path,"r") # set file object content = f.read() # read all contents of TXT file into string STR f.close() # return content # get file contents def get_text(self): File_path =os.path.dirname(__file__) # Txt_path =file_path+r'\textsrc' # TXT rootdir=os.path.join(txt_path) # # target directory content local_text = {} read TXT file for (dirpath, dirnames, filenames) in OS. Walk (rootdir) : for filename in filenames: if os.path.splitext(filename)[1]=='.txt': Flag_file_path =dirpath+'\\'+filename # file path flag_file_content=self.get_content(flag_file_path) # Read file path if flag_file_content! = ": local_text[filename.replace('.txt',")]=flag_file_content # return local_textCopy the code

The final usage method of this paper is as follows:

From Manage import Manage white_list=['blog.csdn.net/A757291228','www.cnblogs.com/1-bit','blog.csdn.net/csdnnews']# whitelist {'engine':'baidu', 'target_page':5 'white_list':white_list,} print(Manage(conf).get_local_analyse()))Copy the code