preface
Article plagiarism is widespread on the Internet, and many bloggers accept it. In recent years, with the development of the Internet, plagiarism and other unethical behavior on the Internet has intensified, and even copy, paste after the original cloth label is common, some of the copied article even marked some contact information so that readers get source code and other information. Such bad behaviour aroused indignation.
This paper uses the search engine results as the article library, and then compares the similarity with the local or Internet data to realize the article recheck. Since the realization process of duplicate checking is similar to the realization process of weibo sentiment analysis in general, the function of sentiment analysis can be easily expanded (the next chapter will complete the whole process of data collection, cleaning and sentiment analysis based on this code).
Due to the lack of time in the near future, the main functions have been temporarily realized and the details have not been optimized, but some brief design has been carried out in the code structure, which makes the subsequent function expansion and upgrade more convenient. I will continue to update the function of this tool, and strive to make it more mature and practical in technology.
technology
In order to adapt to the majority of sites, selenium is used as data acquisition to configure the information of different search engines to achieve more general search engine queries without considering too much dynamic data fetching. Jieba library is mainly used to complete word segmentation of Chinese sentences. Use cosine similarity to complete text similarity comparison and export the comparison data to Excel articles for reporting information.
Micro-blog sentiment analysis is based on SkLearn, and naive Bayes is used to complete sentiment analysis of data. In data capture, the implementation process is similar to the function of text search.
Test code acquisition
CSDN codechina code warehouse: codechina.csdn.net/A757291228/…
The environment
The author’s context is as follows:
- Operating system: Windows7 SP1 64
- Python version: 3.7.7
- Browser: Google Chrome
- Browser Version: 80.0.3987 (64-bit)
If there is a mistake welcome to point out, welcome message exchange.
One, the realization of text check
1.1 Selenium Installation and configuration
Since you are using Selenium, you need to ensure that the reader has Selenium installed before using it, using the PIP command, install as follows:
pip install selenium
Copy the code
After installing Selenium, you also need to download a driver.
-
Google Chrome driver: The driver version needs to correspond to the browser version. Click Download to use the driver version for different browsers
-
If you are using Firefox, check the version of Firefox, click GitHub Firefox driver download address to download (if you are not good at English, right click to translate, each version has the corresponding browser version instructions, see clearly download can be)
After selenium is installed, create a New Python file called selenium_search and introduce it in the code first
from selenium import webdriver
Copy the code
For those readers who have not configured the driver to the environment, we can specify the location of the driver (the blogger has configured the driver to the environment) :
driver = webdriver.Chrome(executable_path=r'F:\python\dr\chromedriver_win32\chromedriver.exe')
Copy the code
Create a variable URL and assign it to baidu home page link. Use the get method to pass in the URL address and try to open Baidu home page. The complete code is as follows:
from selenium import webdriver
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
Copy the code
Run the Python file from the command line in the little black box (windows) :After running the script, Google Browser will open and jump to baidu home page:This is used successfullyseleniumOpen the specified url, then the specified search keyword query results, and then traverse to similar data from the results.
1.2 Selenium Baidu search engine keyword search
Before automatically manipulating the browser to type keywords into the search box, you need to get the search box element object. Use Google Browser to open baidu home page, right-click the search box and choose View, the page elements (code) view window will pop up, find the search box element (use the mouse to move in the element node, the element node in the current position of the mouse will be marked blue in the corresponding page) :In HTML code, the value of id is mostly unique (unless it’s a typo), and the ID is selected here as the tag to get the search box element object.seleniumprovidesfind_element_by_id
Method can be passed inidGets the Page element object.
input=driver.find_element_by_id('kw')
Copy the code
After obtaining the element object, use the send_keys method to pass in the values to be typed:
input.send_keys('PHP Basics step 11 Object Oriented')
Copy the code
Here I pass in “PHP Basics Tutorial Step 11 Object Orientation” as the keyword for the search. Run the script to see if you typed the keyword in the search box. The code is as follows:
input.send_keys('PHP Basics step 11 Object Oriented')
Copy the code
Successfully opened the browser and typed the search keyword:Now just click the “Baidu click” button to complete the final search. Find the ID value of the “Baidu Search” button using the same element viewing method as viewing the search box:usefind_element_by_id
Method to get the element object, then use the click method to make the button click:
search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code
The complete code is as follows:
from selenium import webdriver
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()
Copy the code
The browser automatically completes the search keyword typing and search function:
1.3 Traversal of Search Results
Now that you have the search results in your browser, you need to retrieve the entire Web page to get the search results. BeautifulSoup is used here to parse the entire Web page and retrieve the search results.
BeautifulSoup is an HTML/XML parser that makes it very easy to capture information across HTML. BeautifulSoup is installed before you use it. The installation command is as follows:
pip install BeautifulSoup
Copy the code
After installation, introduce in the current Python file header:
from bs4 import BeautifulSoup
Copy the code
To get HTML text, call page_source:
html=driver.page_source
Copy the code
Once you have the HTML code, create a new BeautifulSoup object, pass in the HTML content and specify a parser, in this case using the html.parser:
soup = BeautifulSoup(html, "html.parser")
Copy the code
Then you look at the search content and see that all the results are made up of onehThe tag contains, andclassfort: BeautifulSoupThe select method is used to obtain labels, and supports searching by class name, label name, ID, attribute, and combination. We find that in baidu search results, there is a class =”t” in the results, and it is easiest to obtain by traversing the class name:
search_res_list=soup.select('.t')
Copy the code
Pass the class name t in the select method, preceded by a dot (.). Indicates that the element is obtained by the class name. Once you’ve done this, you can add print to try to print out the result:
print(search_res_list)
Copy the code
In general, it is possible to output search_res_list as an empty list because we have retrieved the contents of the browser’s current page before the browser parses the data and renders it to the browser. There is a simple solution to this problem, but it is not very efficient and will be used for the time being. It will then be replaced with something more efficient than this method (using time needs to be introduced in the header) :
time.sleep(2)
Copy the code
The complete code is as follows:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()
time.sleep(2)Wait here for the browser to parse and render to the browser
html=driver.page_source # Get web content
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')
print(search_res_list)
Copy the code
Running the program will output:The result is all tags of class t, including the children of that tag, and the dot (.) is used. An operator can retrieve child node elements. Search content obtained through the browser is all links, click to jump, so you only need to obtain the A tag under each element:
for el in search_res_list:
print(el.a)
Copy the code
It is obvious from the result that the a tag of the search result has been obtained, so what we need to extract the href hyperlink inside each a tag. Get href links by using a list of elements:
for el in search_res_list:
print(el.a['href'])
Copy the code
Running the script successfully results in:Careful readers may find that all the obtained results are websites of Baidu. In fact, these web sites are “indexes”, through which to jump back to the real web site. Since these “indexes” do not necessarily change and are not conducive to long-term storage, you still need to get the real link here. We call THE JS script to access these websites, which will jump to the real website, and then get the current website information. Calling the execute_script method executes the js code as follows:
for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
driver.execute_script(js)
Copy the code
After opening a new page, you need to obtain the handle of the new page, otherwise you cannot manipulate the new page. The handle can be obtained as follows:
handle_this=driver.current_window_handleGet the current handle
handle_all=driver.window_handlesGet all handles
Copy the code
After obtaining the handle, you need to switch the current object to the new page. Since there are only 2 pages after opening a page, simply use traversal to make a replacement:
handle_exchange=NoneThe handle to switch
for handle in handle_all:# does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
driver.switch_to.window(handle_exchange)# switch
Copy the code
After the switch, the operation object is the page just opened. Get the URL of the new page using the current_URL property:
real_url=driver.current_url
print(real_url)
Copy the code
Then close the current page and set the action object to the initial page:
driver.close()
driver.switch_to.window(handle_this)Return to the original screen
Copy the code
Run the script successfully to obtain the real URL:Finally, after retrieving the real URL, use a list to store the result:
real_url_list.append(real_url)
Copy the code
The complete code for this section is as follows:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()
time.sleep(2)Wait here for the browser to parse and render to the browser
html=driver.page_source
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')
real_url_list=[]
# print(search_res_list)
for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
driver.execute_script(js)
handle_this=driver.current_window_handleGet the current handle
handle_all=driver.window_handlesGet all handles
handle_exchange=NoneThe handle to switch
for handle in handle_all:# does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
driver.switch_to.window(handle_exchange)# switch
real_url=driver.current_url
print(real_url)
real_url_list.append(real_url)# store results
driver.close()
driver.switch_to.window(handle_this)
Copy the code
1.4 Obtaining the Source text
Create a TXT file in the textsrc folder and save the text to be compared in the TXT file. I’m going to save the content here as article”PHP Basics Tutorial Step 11 Object Oriented“.Write a function in your code to get text content:
def read_txt(path=' ') :
f = open(path,'r')
return f.read()
src=read_txt(r'F:\tool\textsrc\src.txt')
Copy the code
For testing purposes, absolute paths are used here. After obtaining the text content, write the comparison method of cosine similarity.
1.5 cosine similarity
Similarity calculation refer to the article “Python implementation cosine similarity text comparison”, I modify part of the realization.
In this paper, cosine similarity algorithm is used for similarity comparison, and the general steps are divided into word segmentation -> vector calculation -> similarity calculation. Create a Python file called Analyse. Create a class called Analyse, add word segmentation methods to the class, and introduce jieba dictionary and Collections statistics in the head:
from jieba import lcut
import jieba.analyse
import collections
Copy the code
The Count method:
# participle
def Count(self,text) :
tag = jieba.analyse.textrank(text,topK=20)
word_counts = collections.Counter(tag) # count statistics
return word_counts
Copy the code
The Count method takes a text variable, which is text, uses the Textrank method for word segmentation and counters. Then add MergeWord method to make word combination convenient after vector calculation:
# word merge
def MergeWord(self,T1,T2) :
MergeWord = []
for i in T1:
MergeWord.append(i)
for i in T2:
if i not in MergeWord:
MergeWord.append(i)
return MergeWord
Copy the code
The merge method is very simple and I’m not going to explain it. Next add the vector calculation method:
# Derive the document vector
def CalVector(self,T1,MergeWord) :
TF1 = [0] * len(MergeWord)
for ch in T1:
TermFrequence = T1[ch]
word = ch
if word in MergeWord:
TF1[MergeWord.index(word)] = TermFrequence
return TF1
Copy the code
Finally, add the similarity calculation method:
def cosine_similarity(self,vector1, vector2) :
dot_product = 0.0
normA = 0.0
normB = 0.0
for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
dot_product += a * b
normA += a ** 2
normB += b ** 2
if normA == 0.0 or normB == 0.0:
return 0
else:
return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
Copy the code
The similarity method takes two vectors, calculates the similarity and returns. In order to reduce the code redundancy, a simple method is added here to complete the calculation process:
def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
# self.correlate.word.set_this_url(url)
T1 = self.Count(text1)
T2 = self.Count(text2)
mergeword = self.MergeWord(T1,T2)
return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
Copy the code
The full code for the Analyse class is as follows:
from jieba import lcut
import jieba.analyse
import collections
class Analyse:
def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
# self.correlate.word.set_this_url(url)
T1 = self.Count(text1)
T2 = self.Count(text2)
mergeword = self.MergeWord(T1,T2)
return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
# participle
def Count(self,text) :
tag = jieba.analyse.textrank(text,topK=20)
word_counts = collections.Counter(tag) # count statistics
return word_counts
# word merge
def MergeWord(self,T1,T2) :
MergeWord = []
for i in T1:
MergeWord.append(i)
for i in T2:
if i not in MergeWord:
MergeWord.append(i)
return MergeWord
# Derive the document vector
def CalVector(self,T1,MergeWord) :
TF1 = [0] * len(MergeWord)
for ch in T1:
TermFrequence = T1[ch]
word = ch
if word in MergeWord:
TF1[MergeWord.index(word)] = TermFrequence
return TF1
# calculation TF - IDF
def cosine_similarity(self,vector1, vector2) :
dot_product = 0.0
normA = 0.0
normB = 0.0
for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
dot_product += a * b
normA += a ** 2
normB += b ** 2
if normA == 0.0 or normB == 0.0:
return 0
else:
return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
Copy the code
1.6 Similarity comparison between search results and text
Selenium_search introduces Analyse in the selenium_search file and creates a new object:
from Analyse import Analyse
Analyse=Analyse()
Copy the code
Add the content of the newly opened page to the traversal search results:
time.sleep(5)
html_2=driver.page_source
Copy the code
Use time.sleep(5) to wait for the browser to have time to render the current Web content. After obtaining the content of the newly opened page, make similarity comparison:
Analyse.get_Tfidf(src,html_2)
Copy the code
Since it returns a value, print:
print('Similarity:',Analyse.get_Tfidf(src,html_2))
Copy the code
The complete code is as follows:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from Analyse import Analyse
def read_txt(path=' ') :
f = open(path,'r')
return f.read()
Get the comparison file
src=read_txt(r'F:\tool\textsrc\src.txt')
Analyse=Analyse()
url='https://www.baidu.com'
driver=webdriver.Chrome()
driver.get(url)
input=driver.find_element_by_id('kw')
input.send_keys('PHP Basics step 11 Object Oriented')
search_btn=driver.find_element_by_id('su')
search_btn.click()
time.sleep(2)Wait here for the browser to parse and render to the browser
html=driver.page_source
soup = BeautifulSoup(html, "html.parser")
search_res_list=soup.select('.t')
real_url_list=[]
# print(search_res_list)
for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
driver.execute_script(js)
handle_this=driver.current_window_handleGet the current handle
handle_all=driver.window_handlesGet all handles
handle_exchange=NoneThe handle to switch
for handle in handle_all:# does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
driver.switch_to.window(handle_exchange)# switch
real_url=driver.current_url
time.sleep(5)
html_2=driver.page_source
print('Similarity:',Analyse.get_Tfidf(src,html_2))
print(real_url)
real_url_list.append(real_url)
driver.close()
driver.switch_to.window(handle_this)
Copy the code
Run the script:It turns out that there are several highly similar links, so these are suspected plagiarized articles. This is the code that completes the basic replay check, but instead of saying the code is more redundant and cluttered, let’s optimize the code.
Second, code optimization
Through the above programming, the brief steps can be divided into: obtain search content -> obtain results -> calculate similarity. There are three classes to create: Browser, Analyse (already created), and SearchEngine. Browser is used for search, data acquisition, etc. Analyse is used for similarity analysis, vector calculation, etc.; SearchEngine is used as a basic configuration for different search engines, because most search engines are fairly consistent.
2.1 Browser class
Create a new Python file called Browser and add the initialization method:
def __init__(self,conf) :
self.browser=webdriver.Chrome()
self.conf=conf
self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
Copy the code
Self.browser = webdriver.chrome () creates a new browser object; Conf is the incoming search configuration, and then the search content is realized by writing a configuration dictionary. Self.engine_conf =EngineConfManage().get_engine_conf (conf[‘engine’]).get_conf() To obtain the configuration of the search engine, the input fields and search keys of different search engines are different. Multiple search engines can be searched based on different configuration information.
Add a search method
# Search content is written to the search engine
def send_keyword(self) :
input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
input.send_keys(self.conf['kw'])
Copy the code
Self. Engine_conf [‘searchTextID’] and self.
Click on the search
# search box click
def click_search_btn(self) :
search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
search_btn.click()
Copy the code
Get the ID of the search button by using self.engine_conf[‘searchBtnID’].
Get search results and text
Get search results and text
def get_search_res_url(self) :
res_link={}
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
# Content is parsed through BeautifulSoup
content=self.browser.page_source
soup = BeautifulSoup(content, "html.parser")
search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
self.browser.execute_script(js)
handle_this=self.browser.current_window_handle Get the current handle
handle_all=self.browser.window_handles Get all handles
handle_exchange=None The handle to switch
for handle in handle_all: # does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
self.browser.switch_to.window(handle_exchange) # switch
real_url=self.browser.current_url
time.sleep(1)
res_link[real_url]=self.browser.page_source # Result capture
self.browser.close()
self.browser.switch_to.window(handle_this)
return res_link
Copy the code
The above method is similar to the previous written traversal search results content, Add WebDriverWait(self.browser,timeout=30,poll_frequency=1). Until (ec.presence_of_element_located ((by.id, Ec.presence_of_element_located ((by.id, “page”))) is located on the page of the page button tag ID, If not, the current Web page is not fully loaded, and the wait time is timeout=3030 seconds. If the wait time has passed, the wait is skipped. Res_link [real_URL]=self.browser.page_source saves the contents and URLS to the dictionary, returns them, and compares them again.
Open the target search engine to search
# Open the target search engine to search
def search(self) :
self.browser.get(self.engine_conf['website']) # Open the search engine site
self.send_keyword() # Enter search kw
self.click_search_btn() # Click search
return self.get_search_res_url() Get web page search data
Copy the code
Finally, add a search method, directly call the search method to achieve all the previous operations, without exposing too much simplification. The complete code is as follows:
from selenium import webdriver
from bs4 import BeautifulSoup
from SearchEngine import EngineConfManage
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
class Browser:
def __init__(self,conf) :
self.browser=webdriver.Chrome()
self.conf=conf
self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
# Search content is written to the search engine
def send_keyword(self) :
input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
input.send_keys(self.conf['kw'])
# search box click
def click_search_btn(self) :
search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
search_btn.click()
Get search results and text
def get_search_res_url(self) :
res_link={}
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
# Content is parsed through BeautifulSoup
content=self.browser.page_source
soup = BeautifulSoup(content, "html.parser")
search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
self.browser.execute_script(js)
handle_this=self.browser.current_window_handle Get the current handle
handle_all=self.browser.window_handles Get all handles
handle_exchange=None The handle to switch
for handle in handle_all: # does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
self.browser.switch_to.window(handle_exchange) # switch
real_url=self.browser.current_url
time.sleep(1)
res_link[real_url]=self.browser.page_source # Result capture
self.browser.close()
self.browser.switch_to.window(handle_this)
return res_link
# Open the target search engine to search
def search(self) :
self.browser.get(self.engine_conf['website']) # Open the search engine site
self.send_keyword() # Enter search kw
self.click_search_btn() # Click search
return self.get_search_res_url() Get web page search data
Copy the code
2.2 SearchEngine class
The SearchEngine class is mainly used to write configurations for different search engines. Easier implementation of search engine or similar business expansion.
# Search engine configuration
class EngineConfManage:
def get_Engine_conf(self,engine_name) :
if engine_name=='baidu':
return BaiduEngineConf()
elif engine_name=='qihu360':
return Qihu360EngineConf()
elif engine_name=='sougou':
return SougouEngineConf()
class EngineConf:
def __init__(self) :
self.engineConf={}
def get_conf(self) :
return self.engineConf
class BaiduEngineConf(EngineConf) :
engineConf={}
def __init__(self) :
self.engineConf['searchTextID'] ='kw'
self.engineConf['searchBtnID'] ='su'
self.engineConf['nextPageBtnID_xpath_f'] ='//*[@id="page"]/div/a[10]'
self.engineConf['nextPageBtnID_xpath_s'] ='//*[@id="page"]/div/a[11]'
self.engineConf['searchContentHref_class'] ='t'
self.engineConf['website'] ='http://www.baidu.com'
class Qihu360EngineConf(EngineConf) :
def __init__(self) :
pass
class SougouEngineConf(EngineConf) :
def __init__(self) :
pass
Copy the code
In this only achieved baidu search engine configuration. All different kinds of search engines inherit from the EngineConf base class, giving subclasses the get_conf method. The EngineConfManage class is used for calls to different search engines, passing in the engine name.
2.3 How to Use it
We’ll start with two classes:
from Browser import Browser
from Analyse import Analyse
Copy the code
Create a new method to read the local file:
def read_txt(path=' ') :
f = open(path,'r')
return f.read()
Copy the code
Get the file and create a new data analysis class:
src=read_txt(r'F:\tool\textsrc\src.txt')Get local text
Analyse=Analyse()
Copy the code
Configuration information dictionary compilation:
# config information
conf={
'kw':'PHP Basics step 11 Object Oriented'.'engine':'baidu',}Copy the code
Create a new Browser class and pass in the configuration information:
drvier=Browser(conf)
Copy the code
Get search results and content
url_content=drvier.search()# Get search results and content
Copy the code
Traversal results and calculation similarity:
for k in url_content:
print(k,'Similarity:',Analyse.get_Tfidf(src,url_content[k]))
Copy the code
The complete code is as follows:
from Browser import Browser
from Analyse import Analyse
def read_txt(path=' ') :
f = open(path,'r')
return f.read()
src=read_txt(r'F:\tool\textsrc\src.txt')Get local text
Analyse=Analyse()
# config information
conf={
'kw':'PHP Basics step 11 Object Oriented'.'engine':'baidu',
}
drvier=Browser(conf)
url_content=drvier.search()# Get search results and content
for k in url_content:
print(k,'Similarity:',Analyse.get_Tfidf(src,url_content[k]))
Copy the code
Do you feel better? Hardly refreshing. You think this is the end of it? That’s not all. Let’s extend the functionality.
Third, function expansion
Temporarily the function of this small tool only check this basic function, and this has a lot of problems. For example, there is no whitelist filtering, can only check the similarity of an article, if you are lazy, there is no direct access to the article list automatic search function and results export. Next, some functions will be gradually improved. Due to the lack of space, the functions are not fully listed here, and will be updated continuously.
3.1 Automatic text retrieval
Create a new Python file called FileHandle. This class is used to automatically obtain the TXT file in a specified directory. The TXT file name is keyword and the content is the content of the article. The class code is as follows:
import os
class FileHandle:
Get the file content
def get_content(self,path) :
f = open(path,"r") Set the file object
content = f.read() Read the entire contents of the TXT file into the string STR
f.close() Close the file
return content
Get the file content
def get_text(self) :
file_path=os.path.dirname(__file__) # Directory where the current file resides
txt_path=file_path+r'\textsrc' # TXT directory
rootdir=os.path.join(txt_path) # Target directory contents
local_text={}
# Read TXT file
for (dirpath,dirnames,filenames) in os.walk(rootdir):
for filename in filenames:
if os.path.splitext(filename)[1] = ='.txt':
flag_file_path=dirpath+'\ \'+filename # file path
flag_file_content=self.get_content(flag_file_path) # Read file path
ifflag_file_content! =' ':
local_text[filename.replace('.txt'.' ')]=flag_file_content Key value to content
return local_text
Copy the code
There are two methods get_content and get_text. Get_text indicates the path of all TXT files in the directory. If get_content is used to obtain detailed text content, local_text is returned. The local_text key is the file name and the value is the text content.
3.2 BrowserManage class
Add a BrowserManage class inherited from Browser to the Browser class file using the following method:
# Open the target search engine to search
def search(self) :
self.browser.get(self.engine_conf['website']) # Open the search engine site
self.send_keyword() # Enter search kw
self.click_search_btn() # Click search
return self.get_search_res_url() Get web page search data
Copy the code
Adding this class separates the Browser class logic from other methods for easy extension.
3.3 Extension of the Browser class
Add the next page method to the Browser class to retrieve more content when searching for content and specify the number of results to retrieve:
# on the next page
def click_next_page(self,md5) :
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
Next page button xpath is inconsistent. Default is not the first page xpath
try:
next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s'])
except:
next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f'])
next_page_btn.click()
#md5 use webpag text to check whether the page has been turned.
i=0
while md5==hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest():# md5 contrast
time.sleep(0.3)# Prevent some errors, temporarily use force stop to keep some stability
i+=1
if i>100:
return False
return True
Copy the code
The next page button of Baidu search engine is inconsistent with the xpath. The default is not the first page xpath, and an exception occurs. Then perform MD5 on the page and compare the MD5 value. If the current page is not refreshed, the MD5 value will not change. Wait for a short time and click next.
3.4 Modified the get_search_res_URL method
The get_search_res_URL method has been modified to add the following code:
Get search results and text
def get_search_res_url(self) :
res_link={}
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
# Content is parsed through BeautifulSoup
content=self.browser.page_source
soup = BeautifulSoup(content, "html.parser")
search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
while len(res_link)<self.conf['target_page'] :for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
self.browser.execute_script(js)
handle_this=self.browser.current_window_handle Get the current handle
handle_all=self.browser.window_handles Get all handles
handle_exchange=None The handle to switch
for handle in handle_all: # does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
self.browser.switch_to.window(handle_exchange) # switch
real_url=self.browser.current_url
if real_url in self.conf['white_list'] :# white list
continue
time.sleep(1)
res_link[real_url]=self.browser.page_source # Result capture
self.browser.close()
self.browser.switch_to.window(handle_this)
content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
self.click_next_page(content_md5)
return res_link
Copy the code
While len(res_link)
content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
self.click_next_page(content_md5)
Copy the code
The above code increases the judgment of MD5 value after the current page is refreshed. If the value is inconsistent, jump.
if real_url in self.conf['white_list'] :# white list
continue
Copy the code
The above code to judge the white list, set their own white list does not join the number.
3.5 Creating a Manage class
Create a New Python file named Manage and wrap it again. The code is as follows:
from Browser import BrowserManage
from Analyse import Analyse
from FileHandle import FileHandle
class Manage:
def __init__(self,conf) :
self.drvier=BrowserManage(conf)
self.textdic=FileHandle().get_text()
self.analyse=Analyse()
def get_local_analyse(self) :
resdic={}
for k in self.textdic:
res={}
self.drvier.set_kw(k)
url_content=self.drvier.search()# Get search results and content
for k1 in url_content:
res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1])
resdic[k]=res
return resdic
Copy the code
The above code initializer method takes one parameter and is new in the initializer methodBrowserManageObjects,AnalyseObject and get the text content.get_local_analyseMethod traverses the text, searches the file name as the keyword, compares the similarity between the search content and the current text, and finally returns the result. The results are as follows:The files in the main directory of the blogger are as follows:Similarity analysis part above is the main content, tools will be lost laterGitHubandcsdnThe code repository, the use of headless mode, this content for the general implementation.
All complete code below
Analyse the class:
from jieba import lcut
import jieba.analyse
import collections
from FileHandle import FileHandle
class Analyse:
def get_Tfidf(self,text1,text2) :# Test compare local data compare search engine methods
# self.correlate.word.set_this_url(url)
T1 = self.Count(text1)
T2 = self.Count(text2)
mergeword = self.MergeWord(T1,T2)
return self.cosine_similarity(self.CalVector(T1,mergeword),self.CalVector(T2,mergeword))
# participle
def Count(self,text) :
tag = jieba.analyse.textrank(text,topK=20)
word_counts = collections.Counter(tag) # count statistics
return word_counts
# word merge
def MergeWord(self,T1,T2) :
MergeWord = []
for i in T1:
MergeWord.append(i)
for i in T2:
if i not in MergeWord:
MergeWord.append(i)
return MergeWord
# Derive the document vector
def CalVector(self,T1,MergeWord) :
TF1 = [0] * len(MergeWord)
for ch in T1:
TermFrequence = T1[ch]
word = ch
if word in MergeWord:
TF1[MergeWord.index(word)] = TermFrequence
return TF1
# calculation TF - IDF
def cosine_similarity(self,vector1, vector2) :
dot_product = 0.0
normA = 0.0
normB = 0.0
for a, b in zip(vector1, vector2):[(1, 4), (2, 5), (3, 6)]
dot_product += a * b
normA += a ** 2
normB += b ** 2
if normA == 0.0 or normB == 0.0:
return 0
else:
return round(dot_product / ((normA**0.5)*(normB**0.5)) *100.2)
Copy the code
Browser:
from selenium import webdriver
from bs4 import BeautifulSoup
from SearchEngine import EngineConfManage
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import hashlib
import time
import xlwt
class Browser:
def __init__(self,conf) :
self.browser=webdriver.Chrome()
self.conf=conf
self.conf['kw'] =' '
self.engine_conf=EngineConfManage().get_Engine_conf(conf['engine']).get_conf()
# Search content Settings
def set_kw(self,kw) :
self.conf['kw']=kw
# Search content is written to the search engine
def send_keyword(self) :
input = self.browser.find_element_by_id(self.engine_conf['searchTextID'])
input.send_keys(self.conf['kw'])
# search box click
def click_search_btn(self) :
search_btn = self.browser.find_element_by_id(self.engine_conf['searchBtnID'])
search_btn.click()
Get search results and text
def get_search_res_url(self) :
res_link={}
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
# Content is parsed through BeautifulSoup
content=self.browser.page_source
soup = BeautifulSoup(content, "html.parser")
search_res_list=soup.select('. '+self.engine_conf['searchContentHref_class'])
while len(res_link)<self.conf['target_page'] :for el in search_res_list:
js = 'window.open("'+el.a['href'] +'"'
self.browser.execute_script(js)
handle_this=self.browser.current_window_handle Get the current handle
handle_all=self.browser.window_handles Get all handles
handle_exchange=None The handle to switch
for handle in handle_all: # does not match the new handle
ifhandle ! = handle_this:# does not equal the current handle to be swapped
handle_exchange = handle
self.browser.switch_to.window(handle_exchange) # switch
real_url=self.browser.current_url
if real_url in self.conf['white_list'] :# white list
continue
time.sleep(1)
res_link[real_url]=self.browser.page_source # Result capture
self.browser.close()
self.browser.switch_to.window(handle_this)
content_md5=hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest() # md5 contrast
self.click_next_page(content_md5)
return res_link
# on the next page
def click_next_page(self,md5) :
WebDriverWait(self.browser,timeout=30,poll_frequency=1).until(EC.presence_of_element_located((By.ID, "page")))
Next page button xpath is inconsistent. Default is not the first page xpath
try:
next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_s'])
except:
next_page_btn = self.browser.find_element_by_xpath(self.engine_conf['nextPageBtnID_xpath_f'])
next_page_btn.click()
#md5 use webpag text to check whether the page has been turned.
i=0
while md5==hashlib.md5(self.browser.page_source.encode(encoding='UTF-8')).hexdigest():# md5 contrast
time.sleep(0.3)# Prevent some errors, temporarily use force stop to keep some stability
i+=1
if i>100:
return False
return True
class BrowserManage(Browser) :
# Open the target search engine to search
def search(self) :
self.browser.get(self.engine_conf['website']) # Open the search engine site
self.send_keyword() # Enter search kw
self.click_search_btn() # Click search
return self.get_search_res_url() Get web page search data
Copy the code
The Manage class:
from Browser import BrowserManage
from Analyse import Analyse
from FileHandle import FileHandle
class Manage:
def __init__(self,conf) :
self.drvier=BrowserManage(conf)
self.textdic=FileHandle().get_text()
self.analyse=Analyse()
def get_local_analyse(self) :
resdic={}
for k in self.textdic:
res={}
self.drvier.set_kw(k)
url_content=self.drvier.search()# Get search results and content
for k1 in url_content:
res[k1]=self.analyse.get_Tfidf(self.textdic[k],url_content[k1])
resdic[k]=res
return resdic
Copy the code
FileHandle class:
import os
class FileHandle:
Get the file content
def get_content(self,path) :
f = open(path,"r") Set the file object
content = f.read() Read the entire contents of the TXT file into the string STR
f.close() Close the file
return content
Get the file content
def get_text(self) :
file_path=os.path.dirname(__file__) # Directory where the current file resides
txt_path=file_path+r'\textsrc' # TXT directory
rootdir=os.path.join(txt_path) # Target directory contents
local_text={}
# Read TXT file
for (dirpath,dirnames,filenames) in os.walk(rootdir):
for filename in filenames:
if os.path.splitext(filename)[1] = ='.txt':
flag_file_path=dirpath+'\ \'+filename # file path
flag_file_content=self.get_content(flag_file_path) # Read file path
ifflag_file_content! =' ':
local_text[filename.replace('.txt'.' ')]=flag_file_content Key value to content
return local_text
Copy the code
The final usage method of this paper is as follows:
from Manage import Manage
white_list=['blog.csdn.net/A757291228'.'www.cnblogs.com/1-bit'.'blog.csdn.net/csdnnews']# white list
# config information
conf={
'engine':'baidu'.'target_page':5
'white_list':white_list,
}
print(Manage(conf).get_local_analyse())
Copy the code
Please note the original address of this article:blog.csdn.net/A757291228 CSDN:@1_bit