Python crawl site novel 2

Copy the code

Using regular expressions

Re.com running the function

The compile function is used to compile regular expressions, producing a regular expression (Pattern) object used by the match() and search() functions.

The syntax format is:

 re.compile(pattern[, flags])
Copy the code

Parameters:

Pattern: A regular expression in string form
Flags: Indicates the matching mode, such as ignoring case and multi-line mode. The parameters are optional.
1. Re. I ignores case
2. Re.l indicates that the special character set \w, \w, \b, \b, \s, \s depends on the current environment
3. Re.m multi-line mode
4. Re.s is. And any character including newline (. Does not include newline)
5. Re.u represents the special character set \w, \w, \b, \b, \d, \d, \s, \s depends on the Unicode character attribute database
6. Re.x ignores comments after Spaces and # for readability

findall

Finds all substrings matched by the regular expression in the string and returns a list, or an empty list if no matches are found.

Note: Match and search match once. Findall matches all.

The syntax format is:

 findall(string[, pos[, endpos]])
Copy the code

Parameters:

String: indicates the string to be matched.
Pos: Optional argument that specifies the start position of the string. Default is 0.
Endpos: Optional argument that specifies the end position of the string. Default is the length of the string.

Python crawler novel Site – Download novel (regular expression)

Ideas:

Find a novel to download on the home page, open the web page source code analysis (example: www.kanunu8.com/files/old/2…
Analyze the content you want to get, first analyze the URL, and find that only the following is changed, first obtain the novel without relative path, and then combine into a new URL (the URL of each chapter of the novel).
Get the content of each chapter and beautify it

The screenshots

The source code

 
Copy the code

Import re import requests # to crawl site url = 'https://www.kanunu8.com/book4/10509/' # for binary first, TXT = requests. Get (url).content.decode(' GBK ') # txt.conten is binary ---n<head>\r\n<title>\xd6\xd0\xb9\xfa\xba\xcf\xbb\xef\xc8\xcb # print(txt) m1 = re.compile(r'<td colspan="4" align="center"><strong>(.+)</strong>') # print(m1.findall(txt)) m2 = re.compile(r'<td( width="25%")?><a Href ="(.+\.html)">(.+)</a></td>') print(m2.findall(TXT)  in raw: Print ([I [2],url+ I [1]]) # 'https://www.kanunu8.com/book4/10509/184616.html'] # generate corresponding url sanguo. Each chapter append ([I [2], the url + [1]] I) print (" * * 100)" Print (sanguo) # [[' the origin of the first chapter is the dream ', 'https://www.kanunu8.com/book4/10509/184612.html'], [' second chapter idol brothers', 'https://www.kanunu8.com/book4/10509/184613.html'], [' third chapter love required ', 'https://www.kanunu8.com/book4/10509/184614.html']. [' the price of the fourth chapter love ', 'https://www.kanunu8.com/book4/10509/184615.html'], [chapter 5: the mother of success, 'https://www.kanunu8.com/book4/10509/184616.html'], [' sixth chapter life-changing ', 'https://www.kanunu8.com/book4/10509/184617.html']. [' chapter 7 Forced into a ', 'https://www.kanunu8.com/book4/10509/184618.html'], [' chapter 8 Drifting away ', 'https://www.kanunu8.com/book4/10509/184619.html'], [' chapter 9 The joining together of three arrows', 'https://www.kanunu8.com/book4/10509/184620.html'], [' chapter ten dreams set sail ', 'https://www.kanunu8.com/book4/10509/184621.html'], [' chapter 11 Pilot dream ', 'https://www.kanunu8.com/book4/10509/184622.html'], [' chapter 12 The ground waves', 'https://www.kanunu8.com/book4/10509/184623.html'], [' chapter 13 new signs', 'https://www.kanunu8.com/book4/10509/184624.html'], [chapter 14 the weakness of god, 'https://www.kanunu8.com/book4/10509/184625.html'], [' early chapter 15 fractures are ', 'https://www.kanunu8.com/book4/10509/184626.html']. [' chapter 16 listed battle ', 'https://www.kanunu8.com/book4/10509/184627.html'], [' chapter 17 Dream peak ', 'https://www.kanunu8.com/book4/10509/184628.html'], [' chapter 18 Dry outline arbitrary ', 'https://www.kanunu8.com/book4/10509/184629.html'], [' chapter 19 Sword to wear heart ', 'https://www.kanunu8.com/book4/10509/184630.html'], [' chapter 20 As JieBo ', 'https://www.kanunu8.com/book4/10509/184631.html'], [' tail \ u3000 voice, 'https://www.kanunu8.com/book4/10509/184632.html']] # # match each chapter the body content of each chapter of the novel text in < p > tag m3 = re.com running (r '< p > (. +) < / p >', re. S) # M4 = re.compile(r'<br />') # &nbsp; Is also to be replaced. &nbsp; &nbsp; &nbsp; TXT with open(' sanguo ','a') as f: for I in sanguo: I_url = I [1] print(" downloading -->%s" % I [0]) print(" downloading -->%s" % I [0]) R_nr = requests. Get (i_URL).content.decode(' GBK ') Print (n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) Also replaced with empty n2 = m5. Sub (' ', n) (n = n2 replace (' \ n ', ') # # write TXT I [0] is the section name f.w rite (' \ n '+ I +' \ n '[0]) f.w rite (n2)Copy the code

Using regular expressions

Re.com running the function

findall

Python crawler novel Site – Download novel (regular expression)

Ideas:

The screenshots

The source code

Related Posts

Neural network back propagation (BP) algorithm code implementation

How do you build your own neural network from scratch in Python

Deep learning — Learning neural networks for the first time