preface

Article plagiarism is widespread on the Internet, and many bloggers accept it. In recent years, with the development of the Internet, plagiarism and other unethical behavior on the Internet has intensified, and even copy, paste after the original cloth label is common, some of the copied article even marked some contact information so that readers get source code and other information. Such bad behaviour aroused indignation.

This paper uses the search engine results as the article library, and then compares the similarity with the local or Internet data to realize the article recheck. Since the realization process of duplicate checking is similar to the realization process of weibo sentiment analysis in general, the function of sentiment analysis can be easily expanded (the next chapter will complete the whole process of data collection, cleaning and sentiment analysis based on this code).

Due to the lack of time in the near future, the main functions have been temporarily realized and the details have not been optimized, but some brief design has been carried out in the code structure, which makes the subsequent function expansion and upgrade more convenient. I will continue to update the function of this tool, and strive to make it more mature and practical in technology.

technology

In order to adapt to the majority of sites, selenium is used as data acquisition to configure the information of different search engines to achieve more general search engine queries without considering too much dynamic data fetching. Jieba library is mainly used to complete word segmentation of Chinese sentences. Use cosine similarity to complete text similarity comparison and export the comparison data to Excel articles for reporting information.

Micro-blog sentiment analysis is based on SkLearn, and naive Bayes is used to complete sentiment analysis of data. In data capture, the implementation process is similar to the function of text search.

Test code acquisition

CSDN codechina code warehouse: codechina.csdn.net/A757291228/…

The environment

The author’s context is as follows:

  • Operating system: Windows7 SP1 64
  • Python version: 3.7.7
  • Browser: Google Chrome
  • Browser Version: 80.0.3987 (64-bit)

If there is a mistake welcome to point out, welcome message exchange.

One, the realization of text check

1.1 Selenium Installation and configuration

Since you are using Selenium, you need to ensure that the reader has Selenium installed before using it, using the PIP command, install as follows:

pip install selenium
Copy the code

After installing Selenium, you also need to download a driver.

  • Google Chrome driver: The driver version needs to correspond to the browser version. Click Download to use the driver version for different browsers

  • If you are using Firefox, check the version of Firefox, click GitHub Firefox driver download address to download (if you are not good at English, right click to translate, each version has the corresponding browser version instructions, see clearly download can be)

After selenium is installed, create a New Python file called selenium_search and introduce it in the code first

from selenium import webdriver
Copy the code

For those readers who have not configured the driver to the environment, we can specify the location of the driver (the blogger has configured the driver to the environment) :

driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
Copy the code

Create a variable URL and assign it to baidu home page link. Use the get method to pass in the URL address and try to open Baidu home page. The complete code is as follows:

from selenium import webdriver

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
Copy the code

Run the Python file from the command line in the little black box (windows) :After running the script, Google Browser will open and jump to baidu home page:This is used successfullyseleniumOpen the specified url, then the specified search keyword query results, and then traverse to similar data from the results.

1.2 Selenium Baidu search engine keyword search

Before automatically manipulating the browser to type keywords into the search box, you need to get the search box element object. Use Google Browser to open baidu home page, right-click the search box and choose View, the page elements (code) view window will pop up, find the search box element (use the mouse to move in the element node, the element node in the current position of the mouse will be marked blue in the corresponding page) :In HTML code, the value of id is mostly unique (unless it’s a typo), and the ID is selected here as the tag to get the search box element object.seleniumprovidesfind_element_by_idMethod can be passed inidGets the Page element object.

input=driver.find_element_by_id('kw')
Copy the code

After obtaining the element object, use the send_keys method to pass in the values to be typed:

input.send_keys('PHP Basics step 11 Object Oriented')
Copy the code

Here I pass in “PHP Basics Tutorial Step 11 Object Orientation” as the keyword for the search. Run the script to see if you typed the keyword in the search box. The code is as follows:

input.send_keys('PHP Basics step 11 Object Oriented')
Copy the code

Successfully opened the browser and typed the search keyword:Now just click the “Baidu click” button to complete the final search. Find the ID value of the “Baidu Search” button using the same element viewing method as viewing the search box:usefind_element_by_idMethod to get the element object, then use the click method to make the button click:

search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code

The complete code is as follows:

from selenium import webdriver

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code

The browser automatically completes the search keyword typing and search function:

1.3 Traversal of Search Results

Now that you have the search results in your browser, you need to retrieve the entire Web page to get the search results. BeautifulSoup is used here to parse the entire Web page and retrieve the search results.

BeautifulSoup is an HTML/XML parser that makes it very easy to capture information across HTML. BeautifulSoup is installed before you use it. The installation command is as follows:

pip install BeautifulSoup
Copy the code

After installation, introduce in the current Python file header:

from bs4 import BeautifulSoup
Copy the code

To get HTML text, call page_source:

html=driver.page_source
Copy the code

Once you have the HTML code, create a new BeautifulSoup object, pass in the HTML content and specify a parser, in this case using the html.parser:

soup = BeautifulSoup(html, "html.parser")
Copy the code

Then you look at the search content and see that all the results are made up of onehThe tag contains, andclassfort: BeautifulSoupThe select method is used to obtain labels, and supports searching by class name, label name, ID, attribute, and combination. We find that in baidu search results, there is a class =”t” in the results, and it is easiest to obtain by traversing the class name:

search_res_list=soup.select('.t')
Copy the code

Pass the class name t in the select method, preceded by a dot (.). Indicates that the element is obtained by the class name. Once you’ve done this, you can add print to try to print out the result:

print(search_res_list)
Copy the code

In general, it is possible to output search_res_list as an empty list because we have retrieved the contents of the browser’s current page before the browser parses the data and renders it to the browser. There is a simple solution to this problem, but it is not very efficient and will be used for the time being. It will then be replaced with something more efficient than this method (using time needs to be introduced in the header) :

time.sleep(2)
Copy the code

The complete code is as follows:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()

time.sleep(2)Wait here for the browser to parse and render to the browser

html=driver.page_source # Get web content
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')
print(search_res_list)
Copy the code

Running the program will output:The result is all tags of class t, including the children of that tag, and the dot (.) is used. An operator can retrieve child node elements. Search content obtained through the browser is all links, click to jump, so you only need to obtain the A tag under each element:

for el in search_res_list:
    print(el.a)
Copy the code

It is obvious from the result that the a tag of the search result has been obtained, so what we need to extract the href hyperlink inside each a tag. Get href links by using a list of elements:

for el in search_res_list:
    print(el.a['href'])
Copy the code

Running the script successfully results in:Careful readers may find that all the obtained results are websites of Baidu. In fact, these web sites are “indexes”, through which to jump back to the real web site. Since these “indexes” do not necessarily change and are not conducive to long-term storage, you still need to get the real link here. We call THE JS script to access these websites, which will jump to the real website, and then get the current website information. Calling the execute_script method executes the js code as follows:

for el in search_res_list:
    js = 'window.open("'+el.a['href'] +'"'
    driver.execute_script(js)
Copy the code

After opening a new page, you need to obtain the handle of the new page, otherwise you cannot manipulate the new page. The handle can be obtained as follows:

handle_this=driver.current_window_handleGet the current handle
handle_all=driver.window_handlesGet all handles
Copy the code

After obtaining the handle, you need to switch the current object to the new page. Since there are only 2 pages after opening a page, simply use traversal to make a replacement:

handle_exchange=NoneThe handle to switch
for handle in handle_all:# does not match the new handle
   ifhandle ! = handle_this:# does not equal the current handle to be swapped
        handle_exchange = handle
driver.switch_to.window(handle_exchange)# switch
Copy the code

After the switch, the operation object is the page just opened. Get the URL of the new page using the current_URL property:

real_url=driver.current_url
print(real_url)
Copy the code

Then close the current page and set the action object to the initial page:

driver.close()
driver.switch_to.window(handle_this)Return to the original screen
Copy the code

Run the script successfully to obtain the real URL:Finally, after retrieving the real URL, use a list to store the result:

real_url_list.append(real_url)
Copy the code

The complete code for this section is as follows:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()

time.sleep(2)Wait here for the browser to parse and render to the browser

html=driver.page_source
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')

real_url_list=[]
# print(search_res_list)
for el in search_res_list:
    js = 'window.open("'+el.a['href'] +'"'
    driver.execute_script(js)
    handle_this=driver.current_window_handleGet the current handle
    handle_all=driver.window_handlesGet all handles
    handle_exchange=NoneThe handle to switch
    for handle in handle_all:# does not match the new handle
        ifhandle ! = handle_this:# does not equal the current handle to be swapped
            handle_exchange = handle
    driver.switch_to.window(handle_exchange)# switch
    real_url=driver.current_url
    print(real_url)
    real_url_list.append(real_url)# store results
    driver.close()
    driver.switch_to.window(handle_this)
Copy the code

1.4 Obtaining the Source text

Create a TXT file in the textsrc folder and save the text to be compared in the TXT file. I’m going to save the content here as article”PHP Basics Tutorial Step 11 Object Oriented“.Write a function in your code to get text content:

def read_txt(path=' ') :
    f = open(path,'r')
    return f.read()
src=read_txt(r'F:\tool\textsrc\src.txt')
Copy the code

For testing purposes, absolute paths are used here. After obtaining the text content, write the comparison method of cosine similarity.

1.5 cosine similarity

Similarity calculation refer to the article “Python implementation cosine similarity text comparison”, I modify part of the realization.

In this paper, cosine similarity algorithm is used for similarity comparison, and the general steps are divided into word segmentation -> vector calculation -> similarity calculation. Create a Python file called Analyse. Create a class called Analyse, add word segmentation methods to the class, and introduce jieba dictionary and Collections statistics in the head:

from jieba import lcut
import jieba.analyse
import collections
Copy the code

The Count method:

# participle
def Count(self,text) :
    tag = jieba.analyse.textrank(text,topK=20)
    word_counts = collections.Counter(tag) # count statistics
    return word_counts
Copy the code

The Count method takes a text variable, which is text, uses the Textrank method for word segmentation and counters. Then add MergeWord method to make word combination convenient after vector calculation:

# word merge
def MergeWord(self,T1,T2) :
    MergeWord = []
    for i in T1:
        MergeWord.append(i)
    for i in T2:
        if i not in MergeWord:
            MergeWord.append(i)
    return MergeWord
Copy the code

The merge method is very simple and I’m not going to explain it. Next add the vector calculation method:

# Derive the document vector
def CalVector(self,T1,MergeWord) :
   TF1 = [0] * len(MergeWord)
   for ch in T1:
       TermFrequence = T1[ch]
       word = ch
       if word in MergeWord:
           TF1[MergeWord.index(word)] = TermFrequence
   return TF1
Copy the code

Finally, add the similarity calculation method:

def cosine_similarity(self,vector1, vector2) :
    dot_product = 0.0
    normA = 0.0
    normB = 0.0

    for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
        dot_product += a * b    
        normA += a ** 2
        normB += b ** 2
    if normA == 0.0 or normB == 0.0:
        return 0
    else:
        return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
Copy the code

The similarity method takes two vectors, calculates the similarity and returns. In order to reduce the code redundancy, a simple method is added here to complete the calculation process:

def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
        # self.correlate.word.set_this_url(url)
        T1 = self.Count(text1)
        T2 = self.Count(text2)
        mergeword = self.MergeWord(T1,T2)
        return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
Copy the code

The full code for the Analyse class is as follows:

from jieba import lcut
import jieba.analyse
import collections

class Analyse:
    def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
        # self.correlate.word.set_this_url(url)
        T1 = self.Count(text1)
        T2 = self.Count(text2)
        mergeword = self.MergeWord(T1,T2)
        return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
        
    # participle
    def Count(self,text) :
        tag = jieba.analyse.textrank(text,topK=20)
        word_counts = collections.Counter(tag) # count statistics
        return word_counts
    # word merge
    def MergeWord(self,T1,T2) :
        MergeWord = []
        for i in T1:
            MergeWord.append(i)
        for i in T2:
            if i not in MergeWord:
                MergeWord.append(i)
        return MergeWord
    # Derive the document vector
    def CalVector(self,T1,MergeWord) :
        TF1 = [0] * len(MergeWord)
        for ch in T1:
            TermFrequence = T1[ch]
            word = ch
            if word in MergeWord:
                TF1[MergeWord.index(word)] = TermFrequence
        return TF1
    # calculation TF - IDF
    def cosine_similarity(self,vector1, vector2) :
        dot_product = 0.0
        normA = 0.0
        normB = 0.0

        for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
            dot_product += a * b    
            normA += a ** 2
            normB += b ** 2
        if normA == 0.0 or normB == 0.0:
            return 0
        else:
            return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
     
Copy the code

1.6 Similarity comparison between search results and text

Selenium_search introduces Analyse in the selenium_search file and creates a new object:

from Analyse import Analyse
Analyse=Analyse()
Copy the code

Add the content of the newly opened page to the traversal search results:

time.sleep(5)
html_2=driver.page_source
Copy the code

Use time.sleep(5) to wait for the browser to have time to render the current Web content. After obtaining the content of the newly opened page, make similarity comparison:

Analyse.get_Tfidf(src,html_2)
Copy the code

Since it returns a value, print:

print('Similarity:',Analyse.get_Tfidf(src,html_2))
Copy the code

The complete code is as follows:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from Analyse import Analyse

def read_txt(path=' ') :
    f = open(path,'r')
    return f.read()

Get the comparison file
src=read_txt(r'F:\tool\textsrc\src.txt')
Analyse=Analyse()

url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()

time.sleep(2)Wait here for the browser to parse and render to the browser

html=driver.page_source
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')

real_url_list=[]
# print(search_res_list)
for el in search_res_list:
    js = 'window.open("'+el.a['href'] +'"'
    driver.execute_script(js)
    handle_this=driver.current_window_handleGet the current handle
    handle_all=driver.window_handlesGet all handles
    handle_exchange=NoneThe handle to switch
    for handle in handle_all:# does not match the new handle
        ifhandle ! = handle_this:# does not equal the current handle to be swapped
            handle_exchange = handle
    driver.switch_to.window(handle_exchange)# switch
    real_url=driver.current_url
    
    time.sleep(5)
    html_2=driver.page_source
    print('Similarity:',Analyse.get_Tfidf(src,html_2))
    
    print(real_url)
    real_url_list.append(real_url)
    driver.close()
    driver.switch_to.window(handle_this)
Copy the code

Run the script:It turns out that there are several highly similar links, so these are suspected plagiarized articles. This is the code that completes the basic replay check, but instead of saying the code is more redundant and cluttered, let’s optimize the code.

Second, code optimization

Through the above programming, the brief steps can be divided into: obtain search content -> obtain results -> calculate similarity. There are three classes to create: Browser, Analyse (already created), and SearchEngine. Browser is used for search, data acquisition, etc. Analyse is used for similarity analysis, vector calculation, etc.; SearchEngine is used as a basic configuration for different search engines, because most search engines are fairly consistent.

2.1 Browser class

Create a new Python file called Browser and add the initialization method:

def __init__(self,conf) :
        self.browser=webdriver.Chrome()
        self.conf=conf
		self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
Copy the code

Self.browser = webdriver.chrome () creates a new browser object; Conf is the incoming search configuration, and then the search content is realized by writing a configuration dictionary. Self.engine_conf =EngineConfManage().get_engine_conf (conf[‘engine’]).get_conf() To obtain the configuration of the search engine, the input fields and search keys of different search engines are different. Multiple search engines can be searched based on different configuration information.

Add a search method

	# Search content is written to the search engine
    def send_keyword(self) :
        input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
        input.send_keys(self.conf['kw'])
Copy the code

Self. Engine_conf [‘searchTextID’] and self.

Click on the search

	# search box click
    def click_search_btn(self) :
        search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
        search_btn.click()
Copy the code

Get the ID of the search button by using self.engine_conf[‘searchBtnID’].

Get search results and text

Get search results and text
    def get_search_res_url(self) :
        res_link={}
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        # Content is parsed through BeautifulSoup
        content=self.browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
        for el in search_res_list:
            js = 'window.open("'+el.a['href'] +'"'
            self.browser.execute_script(js)
            handle_this=self.browser.current_window_handle  Get the current handle
            handle_all=self.browser.window_handles          Get all handles
            handle_exchange=None                            The handle to switch
            for handle in handle_all:                       # does not match the new handle
                ifhandle ! = handle_this:# does not equal the current handle to be swapped
                    handle_exchange = handle
            self.browser.switch_to.window(handle_exchange)  # switch
            real_url=self.browser.current_url
            
            time.sleep(1)
            res_link[real_url]=self.browser.page_source     # Result capture
            
            self.browser.close()
            self.browser.switch_to.window(handle_this)
        return res_link
Copy the code

The above method is similar to the previous written traversal search results content, Add WebDriverWait(self.browser,timeout=30,poll_frequency=1). Until (ec.presence_of_element_located ((by.id, Ec.presence_of_element_located ((by.id, “page”))) is located on the page of the page button tag ID, If not, the current Web page is not fully loaded, and the wait time is timeout=3030 seconds. If the wait time has passed, the wait is skipped. Res_link [real_URL]=self.browser.page_source saves the contents and URLS to the dictionary, returns them, and compares them again.

Open the target search engine to search

	# Open the target search engine to search
    def search(self) :
        self.browser.get(self.engine_conf['website'])       # Open the search engine site
        self.send_keyword()                                 # Enter search kw
        self.click_search_btn()                             # Click search
        return self.get_search_res_url()                    Get web page search data
Copy the code

Finally, add a search method, directly call the search method to achieve all the previous operations, without exposing too much simplification. The complete code is as follows:

from selenium import webdriver
from bs4 import BeautifulSoup
from SearchEngine import EngineConfManage
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

class Browser:
    def __init__(self,conf) :
        self.browser=webdriver.Chrome()
        self.conf=conf
        self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
    # Search content is written to the search engine
    def send_keyword(self) :
        input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
        input.send_keys(self.conf['kw'])
    # search box click
    def click_search_btn(self) :
        search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
        search_btn.click()
    Get search results and text
    def get_search_res_url(self) :
        res_link={}
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        # Content is parsed through BeautifulSoup
        content=self.browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
        for el in search_res_list:
            js = 'window.open("'+el.a['href'] +'"'
            self.browser.execute_script(js)
            handle_this=self.browser.current_window_handle  Get the current handle
            handle_all=self.browser.window_handles          Get all handles
            handle_exchange=None                            The handle to switch
            for handle in handle_all:                       # does not match the new handle
                ifhandle ! = handle_this:# does not equal the current handle to be swapped
                    handle_exchange = handle
            self.browser.switch_to.window(handle_exchange)  # switch
            real_url=self.browser.current_url
            
            time.sleep(1)
            res_link[real_url]=self.browser.page_source     # Result capture
            
            self.browser.close()
            self.browser.switch_to.window(handle_this)
        return res_link
    
    # Open the target search engine to search
    def search(self) :
        self.browser.get(self.engine_conf['website'])       # Open the search engine site
        self.send_keyword()                                 # Enter search kw
        self.click_search_btn()                             # Click search
        return self.get_search_res_url()                    Get web page search data
Copy the code

2.2 SearchEngine class

The SearchEngine class is mainly used to write configurations for different search engines. Easier implementation of search engine or similar business expansion.

# Search engine configuration
class EngineConfManage:
    def get_Engine_conf(self,engine_name) :
        if engine_name=='baidu':
            return BaiduEngineConf()
        elif engine_name=='qihu360':
            return Qihu360EngineConf()
        elif engine_name=='sougou':
            return SougouEngineConf()

class EngineConf:
    def __init__(self) :
        self.engineConf={}
    def get_conf(self) :
        return self.engineConf

class BaiduEngineConf(EngineConf) :
    engineConf={}
    def __init__(self) :
        self.engineConf['searchTextID'] ='kw'
        self.engineConf['searchBtnID'] ='su'
        self.engineConf['nextPageBtnID_xpath_f'] ='//*[@id="page"]/div/a[10]'
        self.engineConf['nextPageBtnID_xpath_s'] ='//*[@id="page"]/div/a[11]'
        self.engineConf['searchContentHref_class'] ='t'
        self.engineConf['website'] ='http://www.baidu.com'


class Qihu360EngineConf(EngineConf) :
    def __init__(self) :
        pass


class SougouEngineConf(EngineConf) :
    def __init__(self) :
        pass
Copy the code

In this only achieved baidu search engine configuration. All different kinds of search engines inherit from the EngineConf base class, giving subclasses the get_conf method. The EngineConfManage class is used for calls to different search engines, passing in the engine name.

2.3 How to Use it

We’ll start with two classes:

from Browser import Browser
from Analyse import Analyse
Copy the code

Create a new method to read the local file:

def read_txt(path=' ') :
    f = open(path,'r')
    return f.read()
Copy the code

Get the file and create a new data analysis class:

src=read_txt(r'F:\tool\textsrc\src.txt')Get local text
Analyse=Analyse()
Copy the code

Configuration information dictionary compilation:

# config information
conf={
       'kw':'PHP Basics step 11 Object Oriented'.'engine':'baidu',}Copy the code

Create a new Browser class and pass in the configuration information:

drvier=Browser(conf)
Copy the code

Get search results and content

url_content=drvier.search()# Get search results and content
Copy the code

Traversal results and calculation similarity:

for k in url_content:
    print(k,'Similarity:',Analyse.get_Tfidf(src,url_content[k]))
Copy the code

The complete code is as follows:

from Browser import Browser
from Analyse import Analyse

def read_txt(path=' ') :
    f = open(path,'r')
    return f.read()

src=read_txt(r'F:\tool\textsrc\src.txt')Get local text
Analyse=Analyse()

# config information
conf={
       'kw':'PHP Basics step 11 Object Oriented'.'engine':'baidu',
    }
    
drvier=Browser(conf)
url_content=drvier.search()# Get search results and content
for k in url_content:
    print(k,'Similarity:',Analyse.get_Tfidf(src,url_content[k]))
Copy the code

Do you feel better? Hardly refreshing. You think this is the end of it? That’s not all. Let’s extend the functionality.

Third, function expansion

Temporarily the function of this small tool only check this basic function, and this has a lot of problems. For example, there is no whitelist filtering, can only check the similarity of an article, if you are lazy, there is no direct access to the article list automatic search function and results export. Next, some functions will be gradually improved. Due to the lack of space, the functions are not fully listed here, and will be updated continuously.

3.1 Automatic text retrieval

Create a new Python file called FileHandle. This class is used to automatically obtain the TXT file in a specified directory. The TXT file name is keyword and the content is the content of the article. The class code is as follows:

import os

class FileHandle:
    Get the file content
    def get_content(self,path) :
        f = open(path,"r")   Set the file object
        content = f.read()     Read the entire contents of the TXT file into the string STR
        f.close()   Close the file
        return content
	Get the file content
    def get_text(self) :
        file_path=os.path.dirname(__file__)                                 # Directory where the current file resides
        txt_path=file_path+r'\textsrc'                                      # TXT directory
        rootdir=os.path.join(txt_path)                                      # Target directory contents
        local_text={}
        # Read TXT file
        for (dirpath,dirnames,filenames) in os.walk(rootdir):
            for filename in filenames:
                if os.path.splitext(filename)[1] = ='.txt':
                    flag_file_path=dirpath+'\ \'+filename                    # file path
                    flag_file_content=self.get_content(flag_file_path) # Read file path
                    ifflag_file_content! =' ':
                        local_text[filename.replace('.txt'.' ')]=flag_file_content  Key value to content
        return local_text
Copy the code

There are two methods get_content and get_text. Get_text indicates the path of all TXT files in the directory. If get_content is used to obtain detailed text content, local_text is returned. The local_text key is the file name and the value is the text content.

3.2 BrowserManage class

Add a BrowserManage class inherited from Browser to the Browser class file using the following method:

# Open the target search engine to search
    def search(self) :
        self.browser.get(self.engine_conf['website'])       # Open the search engine site
        self.send_keyword()                                 # Enter search kw
        self.click_search_btn()                             # Click search
        return self.get_search_res_url()                    Get web page search data
Copy the code

Adding this class separates the Browser class logic from other methods for easy extension.

3.3 Extension of the Browser class

Add the next page method to the Browser class to retrieve more content when searching for content and specify the number of results to retrieve:

# on the next page
    def click_next_page(self,md5) :
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        Next page button xpath is inconsistent. Default is not the first page xpath
        try:
            next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s'])
        except:
            next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f'])
        next_page_btn.click()
        #md5 use webpag text to check whether the page has been turned.
        i=0
        while md5==hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest():# md5 contrast
            time.sleep(0.3)# Prevent some errors, temporarily use force stop to keep some stability
            i+=1
            if i>100:
                return False
        return True
Copy the code

The next page button of Baidu search engine is inconsistent with the xpath. The default is not the first page xpath, and an exception occurs. Then perform MD5 on the page and compare the MD5 value. If the current page is not refreshed, the MD5 value will not change. Wait for a short time and click next.

3.4 Modified the get_search_res_URL method

The get_search_res_URL method has been modified to add the following code:

Get search results and text
    def get_search_res_url(self) :
        res_link={}
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        # Content is parsed through BeautifulSoup
        content=self.browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
        while len(res_link)<self.conf['target_page'] :for el in search_res_list:
                js = 'window.open("'+el.a['href'] +'"'
                self.browser.execute_script(js)
                handle_this=self.browser.current_window_handle  Get the current handle
                handle_all=self.browser.window_handles          Get all handles
                handle_exchange=None                            The handle to switch
                for handle in handle_all:                       # does not match the new handle
                    ifhandle ! = handle_this:# does not equal the current handle to be swapped
                        handle_exchange = handle
                self.browser.switch_to.window(handle_exchange)  # switch
                real_url=self.browser.current_url
                if real_url in self.conf['white_list'] :# white list
                    continue
                time.sleep(1)
                res_link[real_url]=self.browser.page_source     # Result capture
                
                self.browser.close()
                self.browser.switch_to.window(handle_this)
            content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
            self.click_next_page(content_md5)
        return res_link
Copy the code

While len(res_link)

content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
self.click_next_page(content_md5)
Copy the code

The above code increases the judgment of MD5 value after the current page is refreshed. If the value is inconsistent, jump.

if real_url in self.conf['white_list'] :# white list
	continue
Copy the code

The above code to judge the white list, set their own white list does not join the number.

3.5 Creating a Manage class

Create a New Python file named Manage and wrap it again. The code is as follows:

from Browser import BrowserManage
from Analyse import Analyse
from FileHandle import FileHandle

class Manage:
    def __init__(self,conf) :
        self.drvier=BrowserManage(conf)
        self.textdic=FileHandle().get_text()
        self.analyse=Analyse()
    def get_local_analyse(self) :    
        resdic={}
        
        for k in self.textdic:
            res={}
            self.drvier.set_kw(k)
            url_content=self.drvier.search()# Get search results and content
            for k1 in url_content:
                res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1])
            resdic[k]=res
        return resdic
Copy the code

The above code initializer method takes one parameter and is new in the initializer methodBrowserManageObjects,AnalyseObject and get the text content.get_local_analyseMethod traverses the text, searches the file name as the keyword, compares the similarity between the search content and the current text, and finally returns the result. The results are as follows:The files in the main directory of the blogger are as follows:Similarity analysis part above is the main content, tools will be lost laterGitHubandcsdnThe code repository, the use of headless mode, this content for the general implementation.

All complete code below

Analyse the class:

from jieba import lcut
import jieba.analyse
import collections
from FileHandle import FileHandle

class Analyse:
    def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
        # self.correlate.word.set_this_url(url)
        T1 = self.Count(text1)
        T2 = self.Count(text2)
        mergeword = self.MergeWord(T1,T2)
        return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
        
    # participle
    def Count(self,text) :
        tag = jieba.analyse.textrank(text,topK=20)
        word_counts = collections.Counter(tag) # count statistics
        return word_counts
    # word merge
    def MergeWord(self,T1,T2) :
        MergeWord = []
        for i in T1:
            MergeWord.append(i)
        for i in T2:
            if i not in MergeWord:
                MergeWord.append(i)
        return MergeWord
    # Derive the document vector
    def CalVector(self,T1,MergeWord) :
        TF1 = [0] * len(MergeWord)
        for ch in T1:
            TermFrequence = T1[ch]
            word = ch
            if word in MergeWord:
                TF1[MergeWord.index(word)] = TermFrequence
        return TF1
    # calculation TF - IDF
    def cosine_similarity(self,vector1, vector2) :
        dot_product = 0.0
        normA = 0.0
        normB = 0.0

        for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
            dot_product += a * b    
            normA += a ** 2
            normB += b ** 2
        if normA == 0.0 or normB == 0.0:
            return 0
        else:
            return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
Copy the code

Browser:

from selenium import webdriver
from bs4 import BeautifulSoup
from SearchEngine import EngineConfManage
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import hashlib
import time
import xlwt

class Browser:
    def __init__(self,conf) :
        self.browser=webdriver.Chrome()
        self.conf=conf
        self.conf['kw'] =' '
        self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
    # Search content Settings
    def set_kw(self,kw) :
        self.conf['kw']=kw
    # Search content is written to the search engine
    def send_keyword(self) :
        input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
        input.send_keys(self.conf['kw'])
    # search box click
    def click_search_btn(self) :
        search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
        search_btn.click()
    Get search results and text
    def get_search_res_url(self) :
        res_link={}
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        # Content is parsed through BeautifulSoup
        content=self.browser.page_source
        soup = BeautifulSoup(content, "html.parser")
        search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
        while len(res_link)<self.conf['target_page'] :for el in search_res_list:
                js = 'window.open("'+el.a['href'] +'"'
                self.browser.execute_script(js)
                handle_this=self.browser.current_window_handle  Get the current handle
                handle_all=self.browser.window_handles          Get all handles
                handle_exchange=None                            The handle to switch
                for handle in handle_all:                       # does not match the new handle
                    ifhandle ! = handle_this:# does not equal the current handle to be swapped
                        handle_exchange = handle
                self.browser.switch_to.window(handle_exchange)  # switch
                real_url=self.browser.current_url
                if real_url in self.conf['white_list'] :# white list
                    continue
                time.sleep(1)
                res_link[real_url]=self.browser.page_source     # Result capture
                
                self.browser.close()
                self.browser.switch_to.window(handle_this)
            content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
            self.click_next_page(content_md5)
        return res_link
    # on the next page
    def click_next_page(self,md5) :
        WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
        Next page button xpath is inconsistent. Default is not the first page xpath
        try:
            next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s'])
        except:
            next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f'])
        next_page_btn.click()
        #md5 use webpag text to check whether the page has been turned.
        i=0
        while md5==hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest():# md5 contrast
            time.sleep(0.3)# Prevent some errors, temporarily use force stop to keep some stability
            i+=1
            if i>100:
                return False
        return True
class BrowserManage(Browser) :    
    # Open the target search engine to search
    def search(self) :
        self.browser.get(self.engine_conf['website'])       # Open the search engine site
        self.send_keyword()                                 # Enter search kw
        self.click_search_btn()                             # Click search
        return self.get_search_res_url()                    Get web page search data
Copy the code

The Manage class:

from Browser import BrowserManage
from Analyse import Analyse
from FileHandle import FileHandle

class Manage:
    def __init__(self,conf) :
        self.drvier=BrowserManage(conf)
        self.textdic=FileHandle().get_text()
        self.analyse=Analyse()
    def get_local_analyse(self) :    
        resdic={}
        
        for k in self.textdic:
            res={}
            self.drvier.set_kw(k)
            url_content=self.drvier.search()# Get search results and content
            for k1 in url_content:
                res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1])
            resdic[k]=res
        return resdic
Copy the code

FileHandle class:

import os

class FileHandle:
    Get the file content
    def get_content(self,path) :
        f = open(path,"r")   Set the file object
        content = f.read()     Read the entire contents of the TXT file into the string STR
        f.close()   Close the file
        return content
	Get the file content
    def get_text(self) :
        file_path=os.path.dirname(__file__)                                 # Directory where the current file resides
        txt_path=file_path+r'\textsrc'                                      # TXT directory
        rootdir=os.path.join(txt_path)                                      # Target directory contents
        local_text={}
        # Read TXT file
        for (dirpath,dirnames,filenames) in os.walk(rootdir):
            for filename in filenames:
                if os.path.splitext(filename)[1] = ='.txt':
                    flag_file_path=dirpath+'\ \'+filename                    # file path
                    flag_file_content=self.get_content(flag_file_path) # Read file path
                    ifflag_file_content! =' ':
                        local_text[filename.replace('.txt'.' ')]=flag_file_content  Key value to content
        return local_text
   
Copy the code

The final usage method of this paper is as follows:

from Manage import Manage

white_list=['blog.csdn.net/A757291228'.'www.cnblogs.com/1-bit'.'blog.csdn.net/csdnnews']# white list
# config information
conf={
       'engine':'baidu'.'target_page':5
       'white_list':white_list,
    }

print(Manage(conf).get_local_analyse())
Copy the code

Please note the original address of this article:blog.csdn.net/A757291228 CSDN:@1_bit