The idea of climbing the novel:

  1. First get the address of the novel.
  2. Analyze the directory address structure.
  3. Join the address.
  4. Analyze chapter content structure.
  5. Gets and saves the text.
  6. The complete code

1. Get the address of the novel

Load the required package:

import re
from bs4 import BeautifulSoup as ds
import requests
Copy the code

To obtain the novel directory file, return <Response [200]>, indicating that the webpage can be climbed normally

base_url='https://www.soshuw.com/XuLiangShangYouWangFei/'
chapter_html=requests.get(base_url)
print(chapter_html)
Copy the code

2. Analyze the address structure of the novel

Parse directory web page, the output result is directory web page source code

chapter_page_html=ds(chapter_page,'lxml')
print(chapter_page)
Copy the code

Open theDirectory page, we found that there is a latest chapter table (there are nine chapters here) in front of the main body table, and the most complete table contains the latest chapter, so the latest chapter is not needed here.

Right-click on a web page and select Check (or Properties, depending on browsers, I use Internet Explorer), select the Element column and move the mouse over the code block to the right. The left page will highlight its corresponding page area to find the code block corresponding to the full directory. The diagram below:

Id =” novel108799 “; id= “novel108799”; id= “novel108799”; id= “novel108799”

Extract the full directory block

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
Copy the code

The results are as follows (only partial) :

By comparing the url of the chapter content with the url of the table of contents (base_URL), we can get the complete url of the chapter content by stitching together the second half of the url of the chapter content with the url of the chapter content

3. Join the address

The second half of the address is extracted by using the regular language library

chapter_novel_str=str(chapter_novel) regx = '<dd><a href="/XuLiangShangYouWangFei(.*?) "' chapter_href_list = re.findall(regx, chapter_novel_str) print(chapter_href_list)Copy the code

Concatenated URL: Defines a list chapter_url_list to receive the full address

chapter_url_list = []
for i in chapter_href_list:
    url=base_url+i
    chapter_url_list.append(url)
print(chapter_url_list)
Copy the code

4. Analyze the structure of chapter content

Open thechapter> < span style = “box-sizing: inherit! Important; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;

Extract text segment

chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
Copy the code

Extract the body text and headings

body_html=requests.get('https://www.soshuw.com/XuLiangShangYouWangFei/3647144.html')
body_page=ds(body_html.content,'lxml')
body = body_page.find(class_='content')
body_content=str(body)
print(body_content)
body_regx='<br/>    (.*?)\n'
content_list=re.findall(body_regx,body_content)
print(content_list)
title_regx = '<h1>(.*?)</h1>'
title = re.findall(title_regx, body_html.text)
print(title)
Copy the code

5. Save the text

with open('1.txt', 'a+') as f: f.write('\n\n') f.write(title[0] + '\n') f.write('\n\n') for e in content_list: F.w rite (e + '\ n') print (' {} crawl out. The format (title [0]))Copy the code

6. Complete code

import re from bs4 import BeautifulSoup as ds import requests base_url='https://www.soshuw.com/XuLiangShangYouWangFei' chapter_html=requests.get(base_url) chapter_page=ds(chapter_html.content,'lxml') chapter_novel=chapter_page.find(id="novel108799") #print(chapter_novel) chapter_novel_str=str(chapter_novel) regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"' chapter_href_list = re.findall(regx, chapter_novel_str) #print(chapter_href_list) chapter_url_list = [] for i in chapter_href_list: url=base_url+i chapter_url_list.append(url) #print(chapter_url_list) for u in chapter_url_list: body_html=requests.get(u) body_page=ds(body_html.content,'lxml') body = body_page.find(class_='content') body_content=str(body) # print(body_content) body_regx='<br/> (.*?)\n' content_list=re.findall(body_regx,body_content) #print(content_list) title_regx = '<h1>(.*?)</h1>' title = re.findall(title_regx, body_html.text) #print(title) with open('1.txt', 'a+') as f: f.write('\n\n') f.write(title[0] + '\n') f.write('\n\n') for e in content_list: F.w rite (e + '\ n') print (' {} crawl out. The format (title [0]))Copy the code

Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base