preface
I started to learn Python in this semester, and I think simplicity and ease of use are the biggest advantages of Python. The format of the code is not very strict, and the use of the code is very good. What I am most interested in in Python is crawler, so I can get more data by learning crawler, and then better data analysis and more value
Crawl novel ideas
If you want to use the Elements option of F12, you can check that the content of the article is stored in the div id= ‘content’ tag, indicating that the site is static. If you want to use Selenium crawling in motion, you can do so. But the site is static, we do not need to use dynamic method to crawl.
Then, after selecting the target novel, click on the novel catalog page. It can be observed that the urls of all chapters of the novel are regular through the Elements option of F12.
Crawl to the URL of all chapters for 1 save, and then improve the URL, and then enter each chapter to crawl the title and body content, and then save in TXT.
Function module implementation
After clear thinking, according to the steps step by step to complete the function.
1. Use the Request request library and clean the matching RE library
The import Requestsimport re module is python’s unique string matching module. Many of the functions in this module are based on regular expressions. Regular expressions are used to fuzzy match strings and extract the required parts of the string. Note:
(1) The RE module is unique to Python;
(2) Regular expressions can be used by all programming languages;
(3) Re module and regular expression operate on strings.
I have a learning and communication circle here. Although the number of people is not very large, friends who are learning Python come together to learn and communicate, and discuss with each other about any problems they encounter and exchange academic issues. Learn to communicate penguin colony:745895701
2. Send a URL request to the destination website
s = requests.Session()url = 'https://www.xsbiquge.com/96_96293/'html = s.get(url)html.encoding = 'utf-8'
Copy the code
3. Look up the urls of all sections on the site’s catalog page
Caption_title_1 = re.findall(r'<a href="(/96_96293/.*? \.html)">.*? </a>',html.text)Copy the code
4. In order to make it easy to ask again, improve the access to all section urls
For I in Caption_title_1: Caption_title_1 = 'Newbiquece_ bookfriends most worthy of collection of online novel reading network! '+iCopy the code
5. Access each URL to find the title and body content
S1 = requests.Session()r1 = s1.get(caption_title_1)r1.encoding = 'utF-8' Name = re.findall(r'<meta name="keywords" content="(.*?)) '/>', r1.re.text)[0]print(name) # Print (name)' Chapters = re.findall(r'<div id="content">(.*?) </div>',r1.text,re.S)[0]Copy the code
6. Clean the obtained text
chapters = chapters.replace(' ', '') chapters = chapters.replace('readx(); ', '') chapters = chapters.replace('& lt; ! --go - - & gt; ', '') chapters = chapters.replace('< ! --go--> ', '') chapters = chapters.replace('()', S = chapters (chapters) # replace s_replace = s.replace('<br/>',"\n") while True: index_begin = s_replace.find("<") index_end = s_replace.find(">",index_begin+1) if index_begin == -1: break s_replace = s_replace.replace(s_replace[index_begin:index_end+1],"") pattern = re.compile(r' ', re.i)# make match case insensitive fiction = pattern.sub(",s_replace)Copy the code
7. Save the data to the preset TXT file
Path = r'F:\title.txt' # #a is append file_name = open(path,'a',encoding=' UTF-8 ') file_name.write(name) file_name.write('\n') File_name.write (fiction) file_name.write('\n')# Close file_name.close()Copy the code
The results
Program source code
import requestsimport res = requests.Session()url = 'https://www.xsbiquge.com/96_96293/' html = s.get(url)html.encoding = "utf-8" # get chapter caption_title_1 = re. The.findall (r '< a href = "(. / 96 _96293 / *? \. HTML" >. *? < / a >', HTML, text) + path = # write files R 'F:\title.txt' # You can change it #a is append file_name = open(path,'a',encoding=' UTF-8 ') # loop download each ticket for I in caption_title_1: Caption_title_1 = 'new pen quge _ bookfriends most worth collecting network novel reading network! Get (caption_title_1) r1.encoding = 'utF-8' Name = re.findall(r'<meta name="keywords" content="(.*?)) "/>',r1.text)[0] print(name) file_name.write(name) file_name.write('\n') # If you do not use the re.s argument, only match on each line. If not, start over on the next line. With the re.s argument, the regular expression takes the string as a whole and matches it as a whole. chapters = re.findall(r'<div id="content">(.*?) </div>', r1.re.s)[0] # Chapters = Chapters.replace (' ', '') Chapters = Chapters.replace ('readx(); ', '') chapters = chapters.replace('& lt; ! --go - - & gt; ', '') chapters = chapters.replace('< ! --go--> ', '') chapters = chapters.replace('()', S = chapters (chapters) # replace s_replace = s.replace('<br/>',"\n") while True: index_begin = s_replace.find("<") index_end = s_replace.find(">",index_begin+1) if index_begin == -1: break s_replace = s_replace.replace(s_replace[index_begin:index_end+1],"") pattern = re.compile(r' Fiction = pattern.sub(" ",s_replace) file_name.write(fiction) file_name.write('\n')file_name.close()Copy the code
conclusion
Through this project, I have benefited a lot from the design and implementation of the overall functional modules of the system and improved my ability of self-study, especially in data mining and data analysis.
The program code realizes the function of crawling novel, and carries on the cleaning to the data. But this is just a novel to crawl, if you add a large loop to the feature block to get the url of all the novels on the site. Here are my thoughts and ideas, if you have other ideas, please feel free to comment on them.