“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Main Content: Use Selenium and BS4 to automatically crawl Chinese breaking news
Code subject: EventExtractor, ChinaEmergencyNewsExtractor, save_results, contentExtractor
EventExtractor
This kind of control crawler is used to obtain and process web information
1. Configure some default arguments in __init__(self) :
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
: Close images in your browser to improve web responsivenesschrome_options.add_argument('--headless')
: Hides the browser that is displayed
webdriver.Chrome()
The Google Browser driver is used hereurl
: Climbs the URL of the home page
html
: Stores the crawled HTML data- Save_dir: saves the directory for crawling files
# crawl
class EventExtractor() :
def __init__(self) :
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
chrome_options = Options()
chrome_options.add_argument('--headless')
self.browser = webdriver.Chrome("chromedriver.exe", options=options,chrome_options=chrome_options)
self.url = 'http://www.12379.cn/html/gzaq/fmytplz/index.shtml'
self.html = None
self.save_dir = './chinaEmergency/chinaEEP.xlsx'
# login
def login(self) :
self.browser.get(self.url)
WebDriverWait(self.browser, 1000).until(
EC.presence_of_element_located((By.CLASS_NAME, 'main')))Get the current page resources
def get_html(self) :
self.html = self.browser.page_source
# Skip to the next page
def next_page(self,verbose=False) :
try:
submit_page = self.browser.find_element(by=By.XPATH,value=r"//*[@class='next']") # find_element_by_xpath(r"//*[@class='next']")
submit_page.click()
WebDriverWait(self.browser, 1000).until(
EC.presence_of_element_located((By.CLASS_NAME, 'main'))
)
curr_page = self.browser.find_element(by=By.XPATH,value=r"//*[@class='avter']")
if verbose andcurr_page ! =0:
print(int(curr_page) - 1)
except:
print("Page challenge exception ended")
# web page handling functions
def ProcessHTML(self, verbose=False, withDuplicates=True) :
HTML: HTML input verbose: the flag bit that controls crawling for visual printing. 0 indicates no display, 1 indicates HTML printing, and 2 indicates detailed printing. The default is 1.
duplicates_flag = False
if not os.path.exists(self.save_dir):
names = ['title'.'content'.'public_time'.'urls']
history = pd.DataFrame([], columns=names)
history.to_excel(self.save_dir, index=False)
print("First run crawl time :{}".format(time.strftime('%Y.%m.%d', time.localtime(time.time()))))
else:
history = pd.read_excel(self.save_dir)
history_key = history["public_time"].sort_values(ascending=False).values[0]
print("Historical crawl node :{} Crawl time :{}".format(history_key,time.strftime('%Y.%m.%d',time.localtime(time.time()))))
soup = BeautifulSoup(self.html)
tbody = soup.find("div", attrs={"class": "list_content"})
for idx, tr in enumerate(tbody.find_all("li")) :# Set random sleep to prevent IP blocking
if random.randint(1.2) = =2:
time.sleep(random.randint(3.6))
a = tr.find("a")
title = a.text
href = "http://www.12379.cn" + a["href"]
content = contentExtractor(href)
public_time = tr.find("span").text
results = [title, content, public_time, href]
if verbose:
print([public_time, title, content, href])
# rechecking
if withDuplicates and public_time in history["public_time"].values and title in history["title"].values:
duplicates_flag = True
print("Repeated crawls detected and subsequent crawls have been stopped.")
break
else:
save_results(results)
# rechecking
return duplicates_flag
Close # driver
def close(self) :
self.browser.close()
Copy the code
2. Login (self) : access the url, until is the wait delay, only when the detection of
will proceed to the next step
3. Get_html (self) : Gets the HTML data of the current page
4. Next_page: Redirects the page
5.ProcessHTML(self, verbose=False, withDuplicates=True) : data processing
- The first run creates a new Excel file, the existing history file is rescaled based on the last release date, and the crawl is stopped when both title and time are detected.
- Sleep time is set for each message crawl so as not to burden the server
- Other information is easier to crawl and I won’t go into detail
- For extracting the body content, another Selenium crawl is started (note that some links are broken, 404 will return the body as “none”) as follows:
# Text extraction function
def contentExtractor(url) :
""" Function input: URL Function output: outputs the news content in the current URL. If there is no news content, the output is "none" """
Check for 404 first
user_agent = {
'User-Agent': 'the Mozilla / 5.0 HTML (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36'}
r = requests.get(url, headers=user_agent, allow_redirects=False)
if r.status_code == 404:
return "No"
else:
If there is no 404, extract the body of the web page
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome("chromedriver.exe", chrome_options=chrome_options)
driver.get(url)
WebDriverWait(driver, 1000).until(
EC.presence_of_element_located((By.CLASS_NAME, 'content_text'))
)
respond = driver.page_source
driver.quit()
soup = BeautifulSoup(respond)
text = soup.find("div", attrs={"class": "content_text"}).text
return text
Copy the code
- Is called if no duplication is detected
save_results(results)
Write the current data as follows:
def save_results(results) :
results_path = './chinaEmergency/chinaEEP.xlsx'
names = ['title'.'content'.'public_time'.'urls']
results = {k: v for k, v in zip(names, results)}
if not os.path.exists(results_path):
df1 = pd.DataFrame([], columns=names)
df1.to_excel(results_path, index=False)
else:
df1 = pd.read_excel(results_path)
new = pd.DataFrame(results, index=[1])
df1 = df1.append(new, ignore_index=True)
df1.sort_values(by="public_time", ascending=True, inplace=True)
df1.reset_index(drop=True, inplace=True)
df1.to_excel(results_path, index=False)
Copy the code
ChinaEmergencyNewsExtractor
Through this function to complete the page turning crawler
Event = EventExtractor() creates the crawler written before
Event.login() logs in to the crawl site
The for loop iterates through the entered page number, which can have its own input
Event.get_html() gets HTML data for the current page
Event.processhtml (verbose=True) processes the HTML data and saves it to the specified Excel file, returning a Boolean value for whether to repeat the crawl. Repeat to stop page-turning, not repeat to use event.next_page () page-turning.
Event.close() closes the Selenium browser
# Controllable event extractor
def ChinaEmergencyNewsExtractor(page_nums=3,verbose=1) :
Event = EventExtractor()
Event.login()
for i in range(0, page_nums):
Event.get_html()
page_break = Event.ProcessHTML(verbose=True)
if page_break:
break
Event.next_page()
# Disable the driver to prevent residual background processes
Event.close()
Copy the code
Auto crawl and main functions
Automatic climb adopts a very simple timed climb strategy. A Retry library is used to quickly complete scheduled tasks. The main code is as follows:
# Timed extraction
@retry(tries=30,delay=60*60*24)
def retry_data() :
ChinaEmergencyNewsExtractor(page_nums=3, verbose=1)
raise
Copy the code
Use raise to raise an exception and have the Retry library restart the function
Tries: indicates the number of restarts. The value is 30
Delay: indicates the delay of the restart, in seconds. 60 x 60 x 24 is one day
That is, the web content is climbed once every 24 hours, a total of 30 times, to complete a month of automatic data acquisition
Main function:
# If __name__ == '__main__': retry_data()Copy the code
Results show
Figure 1 Hanging up and crawling interface
Figure 2. Crawl files
At this point, the automatic climbing example of breaking news is displayed, data mining and analysis sprout new, talent and ignorance, there are mistakes or imperfect place, please criticize and correct!!