preface
Article plagiarism is widespread on the Internet, and many bloggers accept it. In recent years, with the development of the Internet, plagiarism and other unethical behavior on the Internet has intensified, and even copy, paste after the original cloth label is common, some of the copied article even marked some contact information so that readers get source code and other information. Such bad behaviour aroused indignation.
This paper uses the search engine results as the article library, and then compares the similarity with the local or Internet data to realize the article recheck. Since the realization process of duplicate checking is similar to the realization process of weibo sentiment analysis in general, the function of sentiment analysis can be easily expanded (the next chapter will complete the whole process of data collection, cleaning and sentiment analysis based on this code).
Due to the lack of time in the near future, the main functions have been temporarily realized and the details have not been optimized, but some brief design has been carried out in the code structure, which makes the subsequent function expansion and upgrade more convenient. I will continue to update the function of this tool, and strive to make it more mature and practical in technology.
technology
In order to adapt to the majority of sites, selenium is used as data acquisition to configure the information of different search engines to achieve more general search engine queries without considering too much dynamic data fetching. Jieba library is mainly used to complete word segmentation of Chinese sentences. Use cosine similarity to complete text similarity comparison and export the comparison data to Excel articles for reporting information.
Micro-blog sentiment analysis is based on SkLearn, and naive Bayes is used to complete sentiment analysis of data. In data capture, the implementation process is similar to the function of text search.
Test code acquisition
CSDN codechina code warehouse: codechina.csdn.net/A757291228/…
The environment
The author’s context is as follows:
- Operating system: Windows7 SP1 64
- Python version: 3.7.7
- Browser: Google Chrome
- Browser Version: 80.0.3987 (64-bit)
If there is a mistake welcome to point out, welcome message exchange.
One, the realization of text check
1.1 Selenium Installation and configuration
Since you are using Selenium, you need to ensure that the reader has Selenium installed before using it, using the PIP command, install as follows:
pip install selenium
Copy the code
After installing Selenium, you also need to download a driver.
- Google Chrome driver: The driver version needs to correspond to the browser version. Click Download to use the driver version for different browsers
- If you are using Firefox, check the version of Firefox, click GitHub Firefox driver download address to download (if you are not good at English, right click to translate, each version has the corresponding browser version instructions, see clearly download can be)
After selenium is installed, create a New Python file called selenium_search and introduce it in the code first
from selenium import webdriver
Copy the code
For those readers who have not configured the driver to the environment, we can specify the location of the driver (the blogger has configured the driver to the environment) :
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
Copy the code
Create a variable URL and assign it to baidu home page link. Use the get method to pass in the URL address and try to open Baidu home page. The complete code is as follows:
from selenium import webdriver
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
Copy the code
Run the Python file from the command line in the little black box (windows) :
After running the script, Google Browser will open and jump to baidu home page:
This is used successfullyseleniumOpen the specified url, then the specified search keyword query results, and then traverse to similar data from the results.
1.2 Selenium Baidu search engine keyword search
Before automatically manipulating the browser to type keywords into the search box, you need to get the search box element object. Use Google Browser to open baidu home page, right-click the search box and choose View, the page elements (code) view window will pop up, find the search box element (use the mouse to move in the element node, the element node in the current position of the mouse will be marked blue in the corresponding page) :
In HTML code, the value of id is mostly unique (unless it’s a typo), and the ID is selected here as the tag to get the search box element object.seleniumprovidesfind_element_by_id
Method can be passed inidGets the Page element object.
input=driver.find_element_by_id('kw')
Copy the code
After obtaining the element object, use the send_keys method to pass in the values to be typed:
Input.send_keys (' PHP Foundation Tutorial Step 11 Object Oriented ')Copy the code
Here I pass in “PHP Basics Tutorial Step 11 Object Orientation” as the keyword for the search. Run the script to see if you typed the keyword in the search box. The code is as follows:
Input.send_keys (' PHP Foundation Tutorial Step 11 Object Oriented ')Copy the code
Successfully opened the browser and typed the search keyword:
Now just click the “Baidu click” button to complete the final search. Find the ID value of the “Baidu Search” button using the same element viewing method as viewing the search box:
usefind_element_by_id
Method to get the element object, then use the click method to make the button click:
search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code
The complete code is as follows:
from selenium import webdriver url='https://www.baidu.com' driver=webdriver.Chrome() driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Foundation Tutorial step 11 Object-oriented ') search_bTN =driver.find_element_by_id('su') search_btn.click()Copy the code
The browser automatically completes the search keyword typing and search function:
1.3 Traversal of Search Results
Now that you have the search results in your browser, you need to retrieve the entire Web page to get the search results. BeautifulSoup is used here to parse the entire Web page and retrieve the search results.
BeautifulSoup is an HTML/XML parser that makes it very easy to capture information across HTML. BeautifulSoup is installed before you use it. The installation command is as follows:
pip install BeautifulSoup
Copy the code
After installation, introduce in the current Python file header:
from bs4 import BeautifulSoup
Copy the code
To get HTML text, call page_source:
html=driver.page_source
Copy the code
Once you have the HTML code, create a new BeautifulSoup object, pass in the HTML content and specify a parser, in this case using the html.parser:
soup = BeautifulSoup(html, "html.parser")
Copy the code
Then you look at the search content and see that all the results are made up of onehThe tag contains, andclassfort:
BeautifulSoupThe select method is used to obtain labels, and supports searching by class name, label name, ID, attribute, and combination. We find that in baidu search results, there is a class =”t” in the results, and it is easiest to obtain by traversing the class name:
search_res_list=soup.select('.t')
Copy the code
Pass the class name t in the select method, preceded by a dot (.). Indicates that the element is obtained by the class name. Once you’ve done this, you can add print to try to print out the result:
print(search_res_list)
Copy the code
In general, it is possible to output search_res_list as an empty list because we have retrieved the contents of the browser’s current page before the browser parses the data and renders it to the browser. There is a simple solution to this problem, but it is not very efficient and will be used for the time being. It will then be replaced with something more efficient than this method (using time needs to be introduced in the header) :
time.sleep(2)
Copy the code
The complete code is as follows:
from selenium import webdriver from bs4 import BeautifulSoup import time url='https://www.baidu.com' Driver = webdriver.chrome () driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Basic Tutorial 11 Object-oriented ') Search_btn =driver.find_element_by_id('su') search_btn.click() time.sleep(2)# wait here to make the browser parse and render to the browser HTML =driver.page_source Soup = BeautifulSoup(HTML, "html.parser") search_res_list=soup. Select ('.t') print(search_res_list)Copy the code
Running the program will output:
The result is all tags of class t, including the children of that tag, and the dot (.) is used. An operator can retrieve child node elements. Search content obtained through the browser is all links, click to jump, so you only need to obtain the A tag under each element:
for el in search_res_list:
print(el.a)
Copy the code
It is obvious from the result that the a tag of the search result has been obtained, so what we need to extract the href hyperlink inside each a tag. Get href links by using a list of elements:
for el in search_res_list:
print(el.a['href'])
Copy the code
Running the script successfully results in:
Careful readers may find that all the obtained results are websites of Baidu. In fact, these web sites are “indexes”, through which to jump back to the real web site. Since these “indexes” do not necessarily change and are not conducive to long-term storage, you still need to get the real link here.
We call THE JS script to access these websites, which will jump to the real website, and then get the current website information. Calling the execute_script method executes the js code as follows:
for el in search_res_list:
js = 'window.open("'+el.a['href']+'")'
driver.execute_script(js)
Copy the code
After opening a new page, you need to obtain the handle of the new page, otherwise you cannot manipulate the new page. The handle can be obtained as follows:
Handle_all =driver.window_handles# Obtain all handlesCopy the code
After obtaining the handle, you need to switch the current object to the new page. Since there are only 2 pages after opening a page, simply use traversal to make a replacement:
Handle_exchange = noneif handle_exchange= noneIf handle_exchange= noneIf handle_exchange= handle Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switchCopy the code
After the switch, the operation object is the page just opened. Get the URL of the new page using the current_URL property:
real_url=driver.current_url
print(real_url)
Copy the code
Then close the current page and set the action object to the initial page:
Driver.close () driver.switch_to.window(handle_this)# Switch back to the initial interfaceCopy the code
Run the script successfully to obtain the real URL:
Finally, after retrieving the real URL, use a list to store the result:
real_url_list.append(real_url)
Copy the code
The complete code for this section is as follows:
from selenium import webdriver from bs4 import BeautifulSoup import time url='https://www.baidu.com' Driver = webdriver.chrome () driver.get(url) Input =driver.find_element_by_id('kw') input.send_keys(' PHP Basic Tutorial 11 Object-oriented ') Search_btn =driver.find_element_by_id('su') search_btn.click() time.sleep(2)# wait here to make the browser parse and render to the browser HTML =driver.page_source soup = BeautifulSoup(html, "html.parser") search_res_list=soup.select('.t') real_url_list=[] # print(search_res_list) for el in search_res_list: Driver.execute_script (js) handle_this= driver.current_WINDOW_handle # Handle_exchange = no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switch Real_url = driver.current_URL print(real_URL) real_url_list.append(real_URL)# Store result driver.close() driver.switch_to.window(handle_this)Copy the code
1.4 Obtaining the Source text
Create a TXT file in the textsrc folder and save the text to be compared in the TXT file. I’m going to save the content here as article”PHP Basics Tutorial Step 11 Object Oriented“.
Write a function in your code to get text content:
def read_txt(path=''):
f = open(path,'r')
return f.read()
src=read_txt(r'F:\tool\textsrc\src.txt')
Copy the code
For testing purposes, absolute paths are used here. After obtaining the text content, write the comparison method of cosine similarity.
1.5 cosine similarity
Similarity calculation refer to the article “Python implementation cosine similarity text comparison”, I modify part of the realization.
In this paper, cosine similarity algorithm is used for similarity comparison, and the general steps are divided into word segmentation -> vector calculation -> similarity calculation. Create a Python file called Analyse. Create a class called Analyse, add word segmentation methods to the class, and introduce jieba dictionary and Collections statistics in the head:
from jieba import lcut
import jieba.analyse
import collections
Copy the code
The Count method:
# def Count(self,text): Textrank (text,topK=20) word_counts = collections.counter (tag) # count return word_countsCopy the code
The Count method takes a text variable, which is text, uses the Textrank method for word segmentation and counters. Then add MergeWord method to make word combination convenient after vector calculation:
Def MergeWord(self,T1,T2): MergeWord = [] for I in T1: MergeWord. Append (I) for I in T2: if I am not in MergeWord: MergeWord.append(i) return MergeWordCopy the code
The merge method is very simple and I’m not going to explain it. Next add the vector calculation method:
Def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[MergeWord.index(word)] = TermFrequence return TF1Copy the code
Finally, add the similarity calculation method:
def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code
The similarity method takes two vectors, calculates the similarity and returns. In order to reduce the code redundancy, a simple method is added here to complete the calculation process:
Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))Copy the code
The full code for the Analyse class is as follows:
from jieba import lcut import jieba.analyse import collections class Analyse: Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return Self.cosine_similarity (self.CalVector(T1,mergeword), self.calvector (T2,mergeword)) # def Count(self, mergeword) : self.cosine_similarity(T1,mergeword), mergeword (T2,mergeword)) # def Count(self, mergeword) : Tag = jieba.analyse. Textrank (text,topK=20) word_counts = collections.counter (tag) # MergeWord(self,T1,T2): MergeWord = [] for i in T1: MergeWord.append(i) for i in T2: if i not in MergeWord: MergeWord. Append (I) return MergeWord # def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code
1.6 Similarity comparison between search results and text
Selenium_search introduces Analyse in the selenium_search file and creates a new object:
from Analyse import Analyse
Analyse=Analyse()
Copy the code
Add the content of the newly opened page to the traversal search results:
time.sleep(5)
html_2=driver.page_source
Copy the code
Use time.sleep(5) to wait for the browser to have time to render the current Web content. After obtaining the content of the newly opened page, make similarity comparison:
Analyse.get_Tfidf(src,html_2)
Copy the code
Since it returns a value, print:
Print (' similarity: ',Analyse. Get_Tfidf (SRC,html_2))Copy the code
The complete code is as follows:
from selenium import webdriver from bs4 import BeautifulSoup import time from Analyse import Analyse def read_txt(path=''): SRC =read_txt(r' f :\tool\textsrc\src.txt') Analyse=Analyse() url='https://www.baidu.com' driver=webdriver.Chrome() driver.get(url) input=driver.find_element_by_id('kw') Find_element_by_id ('su') search_btn.click() time.sleep(2)# HTML =driver.page_source soup = BeautifulSoup(HTML, "html.parser") search_res_list=soup.select('.t') real_url_list=[] # print(search_res_list) for el in search_res_list: Driver.execute_script (js) handle_this= driver.current_WINDOW_handle # Handle_exchange = no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange= no handle_exchange Handle_exchange = Handle driver.switch_to.window(handle_exchange)# switch Real_url = driver.current_URL time.sleep(5) html_2=driver.page_source print(' similar: ',Analyse.get_Tfidf(src,html_2)) print(real_url) real_url_list.append(real_url) driver.close() driver.switch_to.window(handle_this)Copy the code
Run the script:
It turns out that there are several highly similar links, so these are suspected plagiarized articles.
This is the code that completes the basic replay check, but instead of saying the code is more redundant and cluttered, let’s optimize the code.
Second, code optimization
Through the above programming, the brief steps can be divided into: obtain search content -> obtain results -> calculate similarity. There are three classes to create: Browser, Analyse (already created), and SearchEngine. Browser is used for search, data acquisition, etc. Analyse is used for similarity analysis, vector calculation, etc.; SearchEngine is used as a basic configuration for different search engines, because most search engines are fairly consistent.
2.1 Browser class
Create a new Python file called Browser and add the initialization method:
def __init__(self,conf):
self.browser=webdriver.Chrome()
self.conf=conf
self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
Copy the code
Self.browser = webdriver.chrome () creates a new browser object; Conf is the incoming search configuration, and then the search content is realized by writing a configuration dictionary. Self.engine_conf =EngineConfManage().get_engine_conf (conf[‘engine’]).get_conf() To obtain the configuration of the search engine, the input fields and search keys of different search engines are different. Multiple search engines can be searched based on different configuration information.
Add a search method
Def send_keyword(self): input = self.browser.find_element_by_id(self.engine_conf['searchTextID']) input.send_keys(self.conf['kw'])Copy the code
Self. Engine_conf [‘searchTextID’] and self.
Click on the search
Def click_search_bTN (self): search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click()Copy the code
Get the ID of the search button by using self.engine_conf[‘searchBtnID’].
Get search results and text
Def get_search_res_URL (self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange Real_url = self.browser.current_URL time.sleep(1) res_link[real_URL]=self.browser.page_source self.browser.switch_to.window(handle_this) return res_linkCopy the code
The above method is similar to the previous written traversal search results content, Add WebDriverWait(self.browser,timeout=30,poll_frequency=1). Until (ec.presence_of_element_located ((by.id, Ec.presence_of_element_located ((by.id, “page”))) is located on the page of the page button tag ID, If not, the current Web page is not fully loaded, and the wait time is timeout=3030 seconds. If the wait time has passed, the wait is skipped. Res_link [real_URL]=self.browser.page_source saves the contents and URLS to the dictionary, returns them, and compares them again.
Open the target search engine to search
Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code
Finally, add a search method, directly call the search method to achieve all the previous operations, without exposing too much simplification. The complete code is as follows:
from selenium import webdriver from bs4 import BeautifulSoup from SearchEngine import EngineConfManage from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time class Browser: def __init__(self,conf): self.browser=webdriver.Chrome() self.conf=conf Self.engine_conf =EngineConfManage().get_engine_conf (conf['engine']).get_conf() def send_keyword(self): Self.browser.find_element_by_id (self.engine_conf['searchTextID']) input.send_keys(self.conf['kw']) # search box click def click_search_btn(self): Search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click() # retrieve search results with text def get_search_res_url(self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange Real_url = self.browser.current_URL time.sleep(1) res_link[real_URL]=self.browser.page_source Self.browser.switch_to.window (handle_this) return res_link def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code
2.2 SearchEngine class
The SearchEngine class is mainly used to write configurations for different search engines. Easier implementation of search engine or similar business expansion.
Class EngineConfManage: def get_Engine_conf(self,engine_name): if engine_name=='baidu': return BaiduEngineConf() elif engine_name=='qihu360': return Qihu360EngineConf() elif engine_name=='sougou': return SougouEngineConf() class EngineConf: def __init__(self): self.engineConf={} def get_conf(self): return self.engineConf class BaiduEngineConf(EngineConf): engineConf={} def __init__(self): self.engineConf['searchTextID']='kw' self.engineConf['searchBtnID']='su' self.engineConf['nextPageBtnID_xpath_f']='//*[@id="page"]/div/a[10]' self.engineConf['nextPageBtnID_xpath_s']='//*[@id="page"]/div/a[11]' self.engineConf['searchContentHref_class']='t' self.engineConf['website']='http://www.baidu.com' class Qihu360EngineConf(EngineConf): def __init__(self): pass class SougouEngineConf(EngineConf): def __init__(self): passCopy the code
In this only achieved baidu search engine configuration. All different kinds of search engines inherit from the EngineConf base class, giving subclasses the get_conf method. The EngineConfManage class is used for calls to different search engines, passing in the engine name.
2.3 How to Use it
We’ll start with two classes:
from Browser import Browser
from Analyse import Analyse
Copy the code
Create a new method to read the local file:
def read_txt(path=''):
f = open(path,'r')
return f.read()
Copy the code
Get the file and create a new data analysis class:
SRC =read_txt(r 'f :\tool\textsrc\src.txt')# Analyse=Analyse()Copy the code
Configuration information dictionary compilation:
# config ={'engine':'baidu',} # config ={'engine':'baidu',}Copy the code
Create a new Browser class and pass in the configuration information:
drvier=Browser(conf)
Copy the code
Get search results and content
Url_content =drvier.search()#Copy the code
Traversal results and calculation similarity:
Get_Tfidf (SRC,url_content[k]) for k in url_content: print(k,' similarity: ',Analyse.Copy the code
The complete code is as follows:
from Browser import Browser from Analyse import Analyse def read_txt(path=''): F = open(path,'r') return f.read() SRC =read_txt(r' f :\tool\textsrc\src.txt')# Analyse=Analyse() 'engine':' Baidu ', 'engine':' Baidu ' } drvier=Browser(conf) url_content=drvier.search()# Print (k,' similarity: ',Analyse. Get_Tfidf (SRC,url_content[k]))Copy the code
Do you feel better? Hardly refreshing. You think this is the end of it? That’s not all. Let’s extend the functionality.
Third, function expansion
Temporarily the function of this small tool only check this basic function, and this has a lot of problems. For example, there is no whitelist filtering, can only check the similarity of an article, if you are lazy, there is no direct access to the article list automatic search function and results export. Next, some functions will be gradually improved. Due to the lack of space, the functions are not fully listed here, and will be updated continuously.
3.1 Automatic text retrieval
Create a new Python file called FileHandle. This class is used to automatically obtain the TXT file in a specified directory. The TXT file name is keyword and the content is the content of the article. The class code is as follows:
Import OS class FileHandle: def get_content(self,path): F = open(path,"r") # set file object content = f.read() # read all contents of TXT file into string STR f.close() # return content # get file contents def get_text(self): File_path =os.path.dirname(__file__) # Txt_path =file_path+r'\textsrc' # TXT rootdir=os.path.join(txt_path) # # target directory content local_text = {} read TXT file for (dirpath, dirnames, filenames) in OS. Walk (rootdir) : for filename in filenames: if os.path.splitext(filename)[1]=='.txt': Flag_file_path =dirpath+'\\'+filename # file path flag_file_content=self.get_content(flag_file_path) # Read file path if flag_file_content! = ": local_text[filename.replace('.txt',")]=flag_file_content # return local_textCopy the code
There are two methods get_content and get_text. Get_text indicates the path of all TXT files in the directory. If get_content is used to obtain detailed text content, local_text is returned. The local_text key is the file name and the value is the text content.
3.2 BrowserManage class
Add a BrowserManage class inherited from Browser to the Browser class file using the following method:
Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code
Adding this class separates the Browser class logic from other methods for easy extension.
3.3 Extension of the Browser class
Add the next page method to the Browser class to retrieve more content when searching for content and specify the number of results to retrieve:
Def click_next_page(self,md5): WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "Page "))) # search engine next page button xpath inconsistent default non-first page xpath try: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s']) except: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f']) next_page_btn.click() #md5 Compare webPAG text to determine whether the page has been turned (temporarily used, I =0 while md5==hashlib.md5(self.browser.page_source.encode(encoding=' utF-8 ')). Hexdigest ():#md5 comparison Time. Sleep (0.3)# prevent some errors, temporarily use force stop to keep some stability I +=1 if I >100: return False return TrueCopy the code
The next page button of Baidu search engine is inconsistent with the xpath. The default is not the first page xpath, and an exception occurs. Then perform MD5 on the page and compare the MD5 value. If the current page is not refreshed, the MD5 value will not change. Wait for a short time and click next.
3.4 Modified the get_search_res_URL method
The get_search_res_URL method has been modified to add the following code:
Def get_search_res_URL (self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) while len(res_link)<self.conf['target_page']: for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange real_url=self.browser.current_url if real_url in self.conf['white_list']: Select * from self.browser.page_source * self.browser.close() self.browser.switch_to.window(handle_this) Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast self.click_next_page(content_md5) return res_linkCopy the code
While len(res_link)
Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast self.click_next_page(content_md5)Copy the code
The above code increases the judgment of MD5 value after the current page is refreshed. If the value is inconsistent, jump.
If real_url in self.conf['white_list']: # whitelist continueCopy the code
The above code to judge the white list, set their own white list does not join the number.
3.5 Creating a Manage class
Create a New Python file named Manage and wrap it again. The code is as follows:
from Browser import BrowserManage from Analyse import Analyse from FileHandle import FileHandle class Manage: def __init__(self,conf): self.drvier=BrowserManage(conf) self.textdic=FileHandle().get_text() self.analyse=Analyse() def get_local_analyse(self): resdic={} for k in self.textdic: Res ={} self.drvier.set_kw(k) url_content=self.drvier.search() res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1]) resdic[k]=res return resdicCopy the code
The above code initializer method takes one parameter and is new in the initializer methodBrowserManageObjects,AnalyseObject and get the text content.
get_local_analyseMethod traverses the text, searches the file name as the keyword, compares the similarity between the search content and the current text, and finally returns the result.
The results are as follows:
The files in the main directory of the blogger are as follows:
Similarity analysis part above is the main content, tools will be lost laterGitHubandcsdnThe code repository, the use of headless mode, this content for the general implementation.
All complete code below
Analyse the class:
from jieba import lcut import jieba.analyse import collections from FileHandle import FileHandle class Analyse: Def get_Tfidf(self,text1,text2):# Test against local data comparison search engine method # self.correlate.word. Set_this_url (URL) T1 = self.count (text1) T2 = self.Count(text2) mergeword = self.MergeWord(T1,T2) return Self.cosine_similarity (self.CalVector(T1,mergeword), self.calvector (T2,mergeword)) # def Count(self, mergeword) : self.cosine_similarity(T1,mergeword), mergeword (T2,mergeword)) # def Count(self, mergeword) : Tag = jieba.analyse. Textrank (text,topK=20) word_counts = collections.counter (tag) # MergeWord(self,T1,T2): MergeWord = [] for i in T1: MergeWord.append(i) for i in T2: if i not in MergeWord: MergeWord. Append (I) return MergeWord # def CalVector(self,T1,MergeWord): TF1 = [0] * len(MergeWord) for ch in T1: TermFrequence = T1[ch] word = ch if word in MergeWord: TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): TF1[mergeword.index (word)] = TermFrequence return TF1 # computation-tF-idf def cosine_similarity(self,vector1, vector2): Dot_product = 0.0 normA = 0.0 normB = 0.0 normB = 0.0 normB = 0.0 for a, b in zip(vector1, vector2): Dot_product += a * b normA += a ** 2 normB += b ** 2 if normA == 0.0 or normB == 0.0: return 0 else: Return round (dot_product/(0.5) (normA * * * * * 0.5) (normB) * 100, 2)Copy the code
Browser:
from selenium import webdriver from bs4 import BeautifulSoup from SearchEngine import EngineConfManage from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import hashlib import time import xlwt class Browser: def __init__(self,conf): self.browser=webdriver.Chrome() self.conf=conf self.conf['kw']='' Self.engine_conf =EngineConfManage().get_engine_conf (conf['engine']).get_conf() def set_kw(self,kw): Def send_keyword(self): self.conf['kw']=kw # def send_keyword(self): Self.browser.find_element_by_id (self.engine_conf['searchTextID']) input.send_keys(self.conf['kw']) # search box click def click_search_btn(self): Search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID']) search_btn.click() # retrieve search results with text def get_search_res_url(self): res_link={} WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, BeautifulSoup =self.browser.page_source soup = BeautifulSoup(content, "html.parser") search_res_list=soup.select('.'+self.engine_conf['searchContentHref_class']) while len(res_link)<self.conf['target_page']: for el in search_res_list: js = 'window.open("'+el.a['href']+'")' self.browser.execute_script(js) handle_this=self.browser.current_window_handle Handle_exchange =None handle_exchange=None If handle does not match the new handle. = handle_this: Handle_exchange = handle self.browser.switch_to.window(handle_exchange real_url=self.browser.current_url if real_url in self.conf['white_list']: Select * from self.browser.page_source * self.browser.close() self.browser.switch_to.window(handle_this) Content_md5 = hashlib. Md5 (self. Browser. Page_source. Encode (encoding = "utf-8")). Hexdigest () # md5 contrast Self. click_next_page(content_md5) return def click_next_page(self,md5): WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "Page "))) # search engine next page button xpath inconsistent default non-first page xpath try: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s']) except: next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f']) next_page_btn.click() #md5 Compare webPAG text to determine whether the page has been turned (temporarily used, I =0 while md5==hashlib.md5(self.browser.page_source.encode(encoding=' utF-8 ')). Hexdigest ():#md5 comparison I +=1 if I >100: return False return True class BrowserManage(Browser): Def search(self): Self.browser.get (self.engine_conf['website']) # open search engine self.send_keyword() # enter search kw self.click_search_btn() # click search return Self.get_search_res_url () # Retrieve web page search dataCopy the code
The Manage class:
from Browser import BrowserManage from Analyse import Analyse from FileHandle import FileHandle class Manage: def __init__(self,conf): self.drvier=BrowserManage(conf) self.textdic=FileHandle().get_text() self.analyse=Analyse() def get_local_analyse(self): resdic={} for k in self.textdic: Res ={} self.drvier.set_kw(k) url_content=self.drvier.search() res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1]) resdic[k]=res return resdicCopy the code
FileHandle class:
Import OS class FileHandle: def get_content(self,path): F = open(path,"r") # set file object content = f.read() # read all contents of TXT file into string STR f.close() # return content # get file contents def get_text(self): File_path =os.path.dirname(__file__) # Txt_path =file_path+r'\textsrc' # TXT rootdir=os.path.join(txt_path) # # target directory content local_text = {} read TXT file for (dirpath, dirnames, filenames) in OS. Walk (rootdir) : for filename in filenames: if os.path.splitext(filename)[1]=='.txt': Flag_file_path =dirpath+'\\'+filename # file path flag_file_content=self.get_content(flag_file_path) # Read file path if flag_file_content! = ": local_text[filename.replace('.txt',")]=flag_file_content # return local_textCopy the code
The final usage method of this paper is as follows:
From Manage import Manage white_list=['blog.csdn.net/A757291228','www.cnblogs.com/1-bit','blog.csdn.net/csdnnews']# whitelist {'engine':'baidu', 'target_page':5 'white_list':white_list,} print(Manage(conf).get_local_analyse()))Copy the code