directory
Learning Python requires a lot of reading and learning, and most importantly, a lot of practice and application.
The statement
This article is intended for learning, not commercial use! If there is infringement in this article, please contact me to delete the article!
1. Import library files (package guide)
import requests
import os
from bs4 import BeautifulSoup
Copy the code
Of course, the most important are requests, requests for web pages. Use Beautfulsoup to parse web pages.
2. Construct the request header
The request header is basically a simulation of the browser browsing the web page to get the required information. Using a request header to simulate a browser’s access to a web page can prevent a web page from not allowing crawlers to access it, which is designed to crawl backwards, see the article belowportal
3. Visit the web
Access the web page using the Get method for Requests
response = requests.get(url=url,headers=headers)
Copy the code
4. Parse web pages
Parsing a web string can extract useful information according to our requirements, or it can be parsed according to the way the DOM tree is parsed. Web page parsers have regular expressions (intuitive, fuzzy matching to extract valuable information by converting web pages into strings, which can be very difficult to extract data from complex documents), html.parser (Python), beautifulsoup (a third-party plug-in, Parser, LXML (a third-party plug-in that parses XML and HTML), and LXML (a third-party plug-in that parses XML and HTML). Html. parser and Beautifulsoup, as well as LXML, are parsed as DOM trees. This can be seen in the introduction of portal in novice programming
summary
The above four steps are the simplest four steps of crawler, and the next operation is to extract the required content, while the methods used for crawling different contents may be different.
5. Web analysis
First, I randomly selected a finished novel (PS: After all, it was fun to watch) and went to the page to analyze its web structureThe content we’re going to crawl is the novel, the novel includes its name, the chapter name, the chapter content. Since I don’t know much about HTML, my solution is to see if the web page is included. The web is a container, and all the content we need is in the container, but how do we know where in the container. Within the blue content, we need to peel it layer by layer like an onion, taking only what we need.
This is an introduction to the novel, including its name, introduction, author and so on. Because the focus of the author is the content of the novel, the author will not repeat here.
The content that the author needs is inside this.
Layer by layer, the structure is peeled away and presented with labels that store the content of the novel. All the contents the author needs are in a tag, and the parent tag of a tag is DD, the parent tag of DD is DL, and the parent tag of DL is DIV, which has attributes like class and ID.
6. Data extraction
response = requests.get(url=url,headers=headers)
html= response.text
soup = BeautifulSoup(html,'html.parser')
novel_lists = soup.select('#list dd a')
novel_list = novel_lists[12:]
# Because according to the feedback in the webpage, the first 12 chapters are the repeated ones, and the repeated chapters are filtered out through slices
Copy the code
We found the DD tag that stores the content of the novel chapter, and the required novel content and so on are all in the DD tag, so the next step is to extract the required content.The resulting content is a list, and we just need to iterate through the list to extract what we need.
for i in range(len(novel_list)):
novel_name = novel_list[i].text
novel_url = url+novel_list[i].get("href")
Copy the code
Enter the chapter link and analyze its structure again. The analysis result is that the novel content we need is extracted again from the attribute id=”content” under the div tag:
response_1 = requests.get(url=novel_url,headers=headers)
html_1 = response_1.text
soup_1 = BeautifulSoup(html_1, 'html.parser')
novel_content = soup_1.find('div'.id='content').text
Copy the code
7. Write files
file = open(path,'a',encoding='utf-8')
file.write(novel_name)
file.write(novel_content)
print(novel_name + 'Download completed')
file.close()
print('Download complete!! ')
Copy the code
8. Running results
9. Complete code
import os
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
path = 'D:/ crawler /novel/ Luper.txt '
if not os.path.exists(path):
os.mkdir('D:/ crawler /novel/ Luper.txt ')
url = 'https://www.xbiquge.cc/book/14719/'
response = requests.get(url=url,headers=headers)
html= response.text
soup = BeautifulSoup(html,'html.parser')
novel_lists = soup.select('#list dd a')
novel_list = novel_lists[12:]
for i in range(len(novel_list)):
novel_name = novel_list[i].text
novel_url = url+novel_list[i].get("href")
response_1 = requests.get(url=novel_url,headers=headers)
html_1 = response_1.text
soup_1 = BeautifulSoup(html_1, 'html.parser')
novel_content = soup_1.find('div'.id='content').text
file = open(path,'a',encoding='utf-8')
file.write(novel_name)
file.write(novel_content)
print(novel_name + 'Download completed')
file.close()
print('Download complete!! ')
Copy the code
conclusion
In the process, encountered a lot of problems, but there are many resources can let me learn, reference. Writing down study notes from time to time will be more effective in understanding and memorizing knowledge points. Inadequate place, please big guy many correct, hope not stingy give advice!