Project introduction

source

I also read network novels occasionally, but it is more troublesome to follow them. So I figured I’d just masturbate myself.

The project address

Making: github.com/sfyc23/Pyth…

The data source

Biquge: www.5atxt.com/

Reasons for choosing it:

  1. simple
  2. Baidu out of the first is it;
  3. There are a lot of novels;
  4. There is no anti-crawl treatment.

The final function

Every half an hour (time can be set) automatically refresh the novel catalog to determine whether the novel has been updated. If there is any update, the updated content of the novel will be sent to the bound mailbox of wechat and checked by wechat.

implementation

Take the novel The Mysterious Lord for example:

Search page

Find the details page for the novel from the search page.

base_url = 'https://www.5atxt.com'
novel_name = The Mysterious Lord
data = {'name': novel_name}
resp = requests.post("https://www.5atxt.com/index.php?s=/web/index/search", data=data)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, "lxml")
    nh = soup.find('span', class_='s2', text=novel_name)   
    href = nh.a.get('href')
    home_url = "{} {}".format(base_url, href)
    print(home_url)
    
Copy the code

Get the details page address: www.5atxt.com/1_1409/

Details page

Get the latest chapter on the details page:

base_url = 'https://www.5atxt.com'
home_url = "https://www.5atxt.com/1_1409/"
resp = requests.get(home_url)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, "lxml")
    latest_chapter_a = soup.select("div#info > p:nth-last-child(1)")[0].a
    latest_chapter_name = latest_chapter_a.text.strip()  # Latest chapter name
    latest_chapter_url = latest_chapter_a['href']  # Latest chapter address
    latest_chapter_url = "{} {}".format(base_url, latest_chapter_url)
    print(latest_chapter_name, latest_chapter_url)
Copy the code

Get: chapter name: the fifth part of the summary and leave and ask for the end of the monthly ticket chapter address: www.5atxt.com/1_1409/1458…

Novel reading page

Read from the fiction page crawler fiction content:

novel_url = 'https://www.5atxt.com/1_1409/14584779.html'
resp = requests.get(novel_url)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, "lxml")
    aa = soup.find('div', id='content', deep="3")
    if aa.p:
        # Use to delete unnecessary information: genius in a second remember this site address:
        aa.p.decompose()
    if aa.div:
        # Delete: section error, click this submission (free registration)......
        aa.div.decompose()
    print(aa.text)
Copy the code

Email updates to your inbox

email_user = 'agent @ qq.com
email_password = 'Proxy password'
email_host = 'proxy host' # example: smtp.qq.com
to_emails = ['[email protected]']  You can send more than one email address.
yag = yagmail.SMTP(user=email_user, password=email_password, host=email_host)
yag.send(to_emails, title, content) # title: chapter name of the novel, content: the content of the novel
Copy the code

Enabling Scheduled Query

from apscheduler.schedulers.blocking import BlockingScheduler
scheduler = BlockingScheduler()
scheduler.add_job(update_novel, 'interval', minutes=30, misfire_grace_time=600, jitter=300)
scheduler.start()
Copy the code

The query is performed every 30 minutes and the time jitter is 300 seconds (to prevent a fixed time).

That’s the easy process.

Some other details:

  1. The details page is saved after the first request and can be accessed directly the next time;
  2. Save the latest chapter names, compare them, and update them;
  3. There may be more than one chapter update at a time, so you have to deal with it (see the full code for details);
  4. Use fake_userAgent to do simple anti-crawl;