This time, let’s crawl the article on the official website of the National Health Commission.

First, the game between reptile and anti-crawl mechanism

I actually wrote about the various game processes between me and the anti-crawl mechanism in another blog, but I don’t know where I touched the CSDN audit mechanism, and the review failed. It’s just a bunch of failed crawler attempts, not much fun. Anyone really interested can talk to me in private.

Speaking truth, the health committee website than I imagined to climb more difficult, anti – climbing mechanism is really strong.

After numerous 412 errors, I found that the anti-crawl mechanism of this website has the following characteristics (personal experience, summary is not accurate or there are missing points welcome to add).

  1. The server authenticates cookies when processing network requests.
  2. The value of a Cookie changes dynamically, and the effective duration of a Cookie is only about tens of seconds.
  3. The Cookie will be changed when visiting different web pages, so the idea of using a Cookie to crawl multiple web pages is not valid by playing the time difference.
  4. The site checks to identify and restrict access to Selenium, and all you get with Selenium access is an empty interface.

This feature meant that THERE was no way for me to crawl the site in the normal way.

Obviously, the site’s Cookie is encrypted by JS, which contains at least 3 encrypted parameters.

To really crack the encryption algorithm, to achieve data crawling, theoretically is feasible, because the encryption process is completed in the browser, all the encryption code can be seen in the developer tools, so in theory, you only need to understand JS, spend some effort can be completed to crack. (But, it is not recommended to do so, if it is to hone the technology, you can do research in private, But if it is to write the crawler and write this, the input and output is actually a little out of proportion, not cost-effective. And crack the other site encryption algorithm, you are in the edge of the law to test ah!

In my helpless, a big guy recommended to me a “artifact”, successfully completed the climb, that is pyppeteer.

About this framework, I also do not say (mainly I am also new contact, understanding is not deep, detailed introduction you can go to the Internet to find understanding)

In short, the advanced Version of Selenium is more functional, more efficient, and easier to bypass anti-crawl detection.

In the spirit of trying, I installed the library and gave it a run.

import asyncio
from pyppeteer import launch

url = 'http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml'

async def fetchUrl(url) :
    browser = await launch({'headless': False.'dumpio':True.'autoClose':True})
    page = await browser.newPage()

    await page.goto(url)
    await asyncio.wait([page.waitForNavigation()])
    str = await page.content()
    await browser.close()
    print(str)

asyncio.get_event_loop().run_until_complete(fetchUrl(url))
Copy the code

It really worked!

So far, by the Pyppeteer artifact, the official website of the anti – climbing mechanism can finally be around the past.

I’ll describe the crawl process in more detail.

Second, the official website of the Health commission article crawl

2.1 Installing necessary libraries

The libraries used by our crawler are Pyppeteer and BeautifulSoup. Here are the installation instructions (there are many online tutorials on how to install them).

pip install pyppeteer
pip install bs4
Copy the code

If no error is reported when you run the following code, the installation is successful.

import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup
Copy the code

2.2 Analyze the target website

Because of the previous attempt, we have been able to successfully bypass the anti-crawl mechanism of the website, so here can be directly assured of the analysis of the structure of the page.

The first page: www.nhc.gov.cn/xcs/yqtb/li…

The second page: www.nhc.gov.cn/xcs/yqtb/li…

The third page: www.nhc.gov.cn/xcs/yqtb/li…

First of all, observe and analyze the URL rules of the website. We can find that, except for the first page, the URLS of other page numbers all have significant rules — the number in the URL corresponds to the current page number, that is to say, page turning is realized through the URL.

As shown in the figure above, we can see from the F12 summoned developer tool that all articles are stored under a ul tag with class ZxxX_List. Each of the Li tags corresponds to an article (except for the li tags with class=”line” in the middle, which serve as dividing lines).

  • The title of the article is stored under the li tag, in the title attribute of the A tag,
  • The link to the article is stored in the a tag’s href attribute.
  • The post time is stored under the li TAB and in the text of the span TAB.

Then analyze the details of the article interface, the body part is stored in a DIV tag with ID xw_box (div under many P tags, but don’t worry about those P tags, just take the text of the div).

So far, the website interface analysis is complete.

2.3 Start coding crawl

First import the required libraries.

import os
import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup
Copy the code

The operation of pyppeteer is encapsulated into the fetchUrl function, which is used to initiate network requests and obtain web source code.

async def pyppteer_fetchUrl(url) :
    browser = await launch({'headless': False.'dumpio':True.'autoClose':True})
    page = await browser.newPage()

    await page.goto(url)
    await asyncio.wait([page.waitForNavigation()])
    str = await page.content()
    await browser.close()
    return str

def fetchUrl(url) :
    return asyncio.get_event_loop().run_until_complete(pyppteer_fetchUrl(url))
Copy the code

Then we construct the URL link of each page by getPageUrl function according to the URL constitution rule (when the blogger crawls, the article of the website only has 7 pages, so set the page number here as range(1, 7), you can adjust according to the actual situation when you climb).

def getPageUrl() :
    for page in range(1.7) :if page == 1:
            yield 'http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml'
        else:
            url = 'http://www.nhc.gov.cn/xcs/yqtb/list_gzbd_'+ str(page) +'.shtml'
            yield url
Copy the code

Get the title, link, and publication date of each article in the list of articles on a page using the getTitleUrl function.

def getTitleUrl(html) :

    bsobj = BeautifulSoup(html,'html.parser')
    titleList = bsobj.find('div', attrs={"class":"list"}).ul.find_all("li")
    for item in titleList:
        link = "http://www.nhc.gov.cn" + item.a["href"];
        title = item.a["title"]
        date = item.span.text
        yield title, link, date
Copy the code

Get the body content of an article through the getContent function. (If no body is obtained, return “crawl failed”)

def getContent(html) :
        
    bsobj = BeautifulSoup(html,'html.parser')
    cnt = bsobj.find('div', attrs={"id":"xw_box"}).find_all("p")
    s = ""
    if cnt:
        for item in cnt:
            s += item.text
        return s

    return "Failed to crawl!"
Copy the code

Using the saveFile function, you can save the data to the local TXT file.

def saveFile(path, filename, content) :

    if not os.path.exists(path):
        os.makedirs(path)
        
    # Save file
    with open(path + filename + ".txt".'w', encoding='utf-8') as f:
        f.write(content)
Copy the code

And then finally, the main function.

if "__main__" == __name__: 
    for url in getPageUrl():
        s =fetchUrl(url)
        for title,link,date in getTitleUrl(s):
            print(title,link)
            If the date is before January 21, exit directly
            mon = int(date.split("-") [1])
            day = int(date.split("-") [2])
            if mon <= 1 and day < 21:
                break;

            html =fetchUrl(link)
            content = getContent(html)
            print(content)
            saveFile("D:/Python/NHC_Data/", title, content)
            print("-- -- -- -- --"*20)
Copy the code

So let me give you a little bit of an idea of what’s going on in the main function.

  • First, through the getPageUrl function, the link URL of a page is generated
  • Then access the link through the fetchUrl function to get the source content s of the web page.
  • Then, the function getTitleUrl is used to parse s to get the title, link link and release date of the article list on this page.
  • Then I made a judgment here. Since the notification of the website officially started on January 21, I limited the publication date of the article to after January 21.
  • Then use the fetchUrl and getContent functions to access and parse the link to get the body content of the article.
  • Finally, save the content in the local file through the saveFile function.

2.4 Data Presentation

Finally, let’s do the data presentation.

At this point, the official website of the Health commission article data crawl finished.

Three, write in the back of the words

It’s been a long time since I wrote a blog.

And the time after the work is really not before in the school so generous, can only take advantage of the time after work out to write. This blog has been written on and off for over a week.

But I will continue to study and hone my skills.

Back to the reptile.

Although from the result, I was successful in the site of the target data to climb down, is a success. But to be fair, from the point of view of the crawler code itself, writing is not so beautiful, including the use of these libraries is not so skilled, there is still a lot of room for improvement.

The most important point of this post is to introduce some ideas, and I hope you can share some tips of your own crawler.

At the same time, I hope this post will provide you with inspiration and direction when you are confused about what to do with these websites.


2020.5.29 supplement

1. It is important to forget to mention that to prevent servers from monitoring WebDriver, you need to disable –enable-automation before launching, as shown below.

from pyppeteer import launcher
# Disable --enable-automation before importing launch to prevent webDriver monitoring
launcher.DEFAULT_ARGS.remove("--enable-automation")

from pyppeteer import launch
Copy the code

I just commented this out in the source code to save trouble (I don’t think there’s any need for it to be tested anyway)

File path: Python37_64\Lib\site-packages\pyppeteer\launcher.py in the Python installation path.

  1. The file name is the title of the article and the screenshot of the data display part. In order to review, I batch renamed it as the release date. Please understand.

If there is something not clear or wrong in the article, please comment in the comment area, or scan the QR code below, and add my wechat, so that we can learn and communicate together and make progress together.