The idea of climbing the novel:
- First get the address of the novel.
- Analyze the directory address structure.
- Join the address.
- Analyze chapter content structure.
- Gets and saves the text.
- The complete code
1. Get the address of the novel
Load the required package:
import re
from bs4 import BeautifulSoup as ds
import requests
Copy the code
To obtain the novel directory file, return <Response [200]>, indicating that the webpage can be climbed normally
base_url='https://www.soshuw.com/XuLiangShangYouWangFei/'
chapter_html=requests.get(base_url)
print(chapter_html)
Copy the code
2. Analyze the address structure of the novel
Parse directory web page, the output result is directory web page source code
chapter_page_html=ds(chapter_page,'lxml')
print(chapter_page)
Copy the code
Open theDirectory page, we found that there is a latest chapter table (there are nine chapters here) in front of the main body table, and the most complete table contains the latest chapter, so the latest chapter is not needed here.
Right-click on a web page and select Check (or Properties, depending on browsers, I use Internet Explorer), select the Element column and move the mouse over the code block to the right. The left page will highlight its corresponding page area to find the code block corresponding to the full directory. The diagram below:
Id =” novel108799 “; id= “novel108799”; id= “novel108799”; id= “novel108799”
Extract the full directory block
chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
Copy the code
The results are as follows (only partial) :
By comparing the url of the chapter content with the url of the table of contents (base_URL), we can get the complete url of the chapter content by stitching together the second half of the url of the chapter content with the url of the chapter content
3. Join the address
The second half of the address is extracted by using the regular language library
chapter_novel_str=str(chapter_novel) regx = '<dd><a href="/XuLiangShangYouWangFei(.*?) "' chapter_href_list = re.findall(regx, chapter_novel_str) print(chapter_href_list)Copy the code
Concatenated URL: Defines a list chapter_url_list to receive the full address
chapter_url_list = []
for i in chapter_href_list:
url=base_url+i
chapter_url_list.append(url)
print(chapter_url_list)
Copy the code
4. Analyze the structure of chapter content
Open thechapter> < span style = “box-sizing: inherit! Important; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;
Extract text segment
chapter_novel=chapter_page.find(id="novel108799")
print(chapter_novel)
Copy the code
Extract the body text and headings
body_html=requests.get('https://www.soshuw.com/XuLiangShangYouWangFei/3647144.html')
body_page=ds(body_html.content,'lxml')
body = body_page.find(class_='content')
body_content=str(body)
print(body_content)
body_regx='<br/> (.*?)\n'
content_list=re.findall(body_regx,body_content)
print(content_list)
title_regx = '<h1>(.*?)</h1>'
title = re.findall(title_regx, body_html.text)
print(title)
Copy the code
5. Save the text
with open('1.txt', 'a+') as f: f.write('\n\n') f.write(title[0] + '\n') f.write('\n\n') for e in content_list: F.w rite (e + '\ n') print (' {} crawl out. The format (title [0]))Copy the code
6. Complete code
import re from bs4 import BeautifulSoup as ds import requests base_url='https://www.soshuw.com/XuLiangShangYouWangFei' chapter_html=requests.get(base_url) chapter_page=ds(chapter_html.content,'lxml') chapter_novel=chapter_page.find(id="novel108799") #print(chapter_novel) chapter_novel_str=str(chapter_novel) regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"' chapter_href_list = re.findall(regx, chapter_novel_str) #print(chapter_href_list) chapter_url_list = [] for i in chapter_href_list: url=base_url+i chapter_url_list.append(url) #print(chapter_url_list) for u in chapter_url_list: body_html=requests.get(u) body_page=ds(body_html.content,'lxml') body = body_page.find(class_='content') body_content=str(body) # print(body_content) body_regx='<br/> (.*?)\n' content_list=re.findall(body_regx,body_content) #print(content_list) title_regx = '<h1>(.*?)</h1>' title = re.findall(title_regx, body_html.text) #print(title) with open('1.txt', 'a+') as f: f.write('\n\n') f.write(title[0] + '\n') f.write('\n\n') for e in content_list: F.w rite (e + '\ n') print (' {} crawl out. The format (title [0]))Copy the code
Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base