“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Main Content: Use Selenium and BS4 to automatically crawl Chinese breaking news

Code subject: EventExtractor, ChinaEmergencyNewsExtractor, save_results, contentExtractor

EventExtractor

This kind of control crawler is used to obtain and process web information

1. Configure some default arguments in __init__(self) :

  • options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}): Close images in your browser to improve web responsiveness
  • chrome_options.add_argument('--headless'): Hides the browser that is displayed
  • webdriver.Chrome()The Google Browser driver is used here
  • url: Climbs the URL of the home page
  • html: Stores the crawled HTML data
  • Save_dir: saves the directory for crawling files
# crawl
class EventExtractor() :
    def __init__(self) :
        options = webdriver.ChromeOptions()
        options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        self.browser = webdriver.Chrome("chromedriver.exe", options=options,chrome_options=chrome_options)
        self.url = 'http://www.12379.cn/html/gzaq/fmytplz/index.shtml'
        self.html = None
        self.save_dir = './chinaEmergency/chinaEEP.xlsx'
    # login
    def login(self) :
        self.browser.get(self.url)
        WebDriverWait(self.browser, 1000).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'main')))Get the current page resources
    def get_html(self) :
        self.html = self.browser.page_source
    # Skip to the next page
    def next_page(self,verbose=False) :
        try:
            submit_page = self.browser.find_element(by=By.XPATH,value=r"//*[@class='next']") # find_element_by_xpath(r"//*[@class='next']")
            submit_page.click()
            WebDriverWait(self.browser, 1000).until(
                EC.presence_of_element_located((By.CLASS_NAME, 'main'))
            )
            curr_page = self.browser.find_element(by=By.XPATH,value=r"//*[@class='avter']")
            if verbose andcurr_page ! =0:
                print(int(curr_page) - 1)
        except:
            print("Page challenge exception ended")
    # web page handling functions
    def ProcessHTML(self, verbose=False, withDuplicates=True) :
        HTML: HTML input verbose: the flag bit that controls crawling for visual printing. 0 indicates no display, 1 indicates HTML printing, and 2 indicates detailed printing. The default is 1.
        duplicates_flag = False
        if not os.path.exists(self.save_dir):
            names = ['title'.'content'.'public_time'.'urls']
            history = pd.DataFrame([], columns=names)
            history.to_excel(self.save_dir, index=False)
            print("First run crawl time :{}".format(time.strftime('%Y.%m.%d', time.localtime(time.time()))))
        else:
            history = pd.read_excel(self.save_dir)
            history_key = history["public_time"].sort_values(ascending=False).values[0]
            print("Historical crawl node :{} Crawl time :{}".format(history_key,time.strftime('%Y.%m.%d',time.localtime(time.time()))))
        soup = BeautifulSoup(self.html)
        tbody = soup.find("div", attrs={"class": "list_content"})
        for idx, tr in enumerate(tbody.find_all("li")) :# Set random sleep to prevent IP blocking
            if random.randint(1.2) = =2:
                time.sleep(random.randint(3.6))
            a = tr.find("a")
            title = a.text
            href = "http://www.12379.cn" + a["href"]
            content = contentExtractor(href)
            public_time = tr.find("span").text
            results = [title, content, public_time, href]
            if verbose:
                print([public_time, title, content, href])
            # rechecking
            if withDuplicates and public_time in history["public_time"].values and title in history["title"].values:
                duplicates_flag = True
                print("Repeated crawls detected and subsequent crawls have been stopped.")
                break
            else:
                save_results(results)
            # rechecking
        return duplicates_flag
    Close # driver
    def close(self) :
        self.browser.close()
Copy the code

2. Login (self) : access the url, until is the wait delay, only when the detection of

will proceed to the next step
=main>

3. Get_html (self) : Gets the HTML data of the current page

4. Next_page: Redirects the page

5.ProcessHTML(self, verbose=False, withDuplicates=True) : data processing

  • The first run creates a new Excel file, the existing history file is rescaled based on the last release date, and the crawl is stopped when both title and time are detected.
  • Sleep time is set for each message crawl so as not to burden the server
  • Other information is easier to crawl and I won’t go into detail
  • For extracting the body content, another Selenium crawl is started (note that some links are broken, 404 will return the body as “none”) as follows:
# Text extraction function
def contentExtractor(url) :
    """ Function input: URL Function output: outputs the news content in the current URL. If there is no news content, the output is "none" """
    Check for 404 first
    user_agent = {
        'User-Agent': 'the Mozilla / 5.0 HTML (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36'}
    r = requests.get(url, headers=user_agent, allow_redirects=False)
    if r.status_code == 404:
        return "No"
    else:
        If there is no 404, extract the body of the web page
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        driver = webdriver.Chrome("chromedriver.exe", chrome_options=chrome_options)
        driver.get(url)
        WebDriverWait(driver, 1000).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'content_text'))
        )
        respond = driver.page_source
        driver.quit()
        soup = BeautifulSoup(respond)
        text = soup.find("div", attrs={"class": "content_text"}).text
        return text
Copy the code
  • Is called if no duplication is detectedsave_results(results)Write the current data as follows:
def save_results(results) :
    results_path = './chinaEmergency/chinaEEP.xlsx'
    names = ['title'.'content'.'public_time'.'urls']
    results = {k: v for k, v in zip(names, results)}

    if not os.path.exists(results_path):
        df1 = pd.DataFrame([], columns=names)
        df1.to_excel(results_path, index=False)
    else:
        df1 = pd.read_excel(results_path)
        new = pd.DataFrame(results, index=[1])
        df1 = df1.append(new, ignore_index=True)
        df1.sort_values(by="public_time", ascending=True, inplace=True)
        df1.reset_index(drop=True, inplace=True)
        df1.to_excel(results_path, index=False)
Copy the code

ChinaEmergencyNewsExtractor

Through this function to complete the page turning crawler

Event = EventExtractor() creates the crawler written before

Event.login() logs in to the crawl site

The for loop iterates through the entered page number, which can have its own input

Event.get_html() gets HTML data for the current page

Event.processhtml (verbose=True) processes the HTML data and saves it to the specified Excel file, returning a Boolean value for whether to repeat the crawl. Repeat to stop page-turning, not repeat to use event.next_page () page-turning.

Event.close() closes the Selenium browser

# Controllable event extractor
def ChinaEmergencyNewsExtractor(page_nums=3,verbose=1) :
    Event = EventExtractor()
    Event.login()
    for i in range(0, page_nums):
        Event.get_html()
        page_break = Event.ProcessHTML(verbose=True)
        if page_break:
            break
        Event.next_page()
    # Disable the driver to prevent residual background processes
    Event.close()
Copy the code

Auto crawl and main functions

Automatic climb adopts a very simple timed climb strategy. A Retry library is used to quickly complete scheduled tasks. The main code is as follows:

# Timed extraction
@retry(tries=30,delay=60*60*24)
def retry_data() :
    ChinaEmergencyNewsExtractor(page_nums=3, verbose=1)
    raise
Copy the code

Use raise to raise an exception and have the Retry library restart the function

Tries: indicates the number of restarts. The value is 30

Delay: indicates the delay of the restart, in seconds. 60 x 60 x 24 is one day

That is, the web content is climbed once every 24 hours, a total of 30 times, to complete a month of automatic data acquisition

Main function:

# If __name__ == '__main__': retry_data()Copy the code

Results show

Figure 1 Hanging up and crawling interface

Figure 2. Crawl files

At this point, the automatic climbing example of breaking news is displayed, data mining and analysis sprout new, talent and ignorance, there are mistakes or imperfect place, please criticize and correct!!