Copy the code
Using regular expressions
Re.com running the function
The compile function is used to compile regular expressions, producing a regular expression (Pattern) object used by the match() and search() functions.
The syntax format is:
re.compile(pattern[, flags])
Copy the code
Parameters:
-
Pattern: A regular expression in string form
-
Flags: Indicates the matching mode, such as ignoring case and multi-line mode. The parameters are optional.
- Re. I ignores case
- Re.l indicates that the special character set \w, \w, \b, \b, \s, \s depends on the current environment
- Re.m multi-line mode
- Re.s is. And any character including newline (. Does not include newline)
- Re.u represents the special character set \w, \w, \b, \b, \d, \d, \s, \s depends on the Unicode character attribute database
- Re.x ignores comments after Spaces and # for readability
findall
Finds all substrings matched by the regular expression in the string and returns a list, or an empty list if no matches are found.
Note: Match and search match once. Findall matches all.
The syntax format is:
findall(string[, pos[, endpos]])
Copy the code
Parameters:
- String: indicates the string to be matched.
- Pos: Optional argument that specifies the start position of the string. Default is 0.
- Endpos: Optional argument that specifies the end position of the string. Default is the length of the string.
Python crawler novel Site – Download novel (regular expression)
Ideas:
- Find a novel to download on the home page, open the web page source code analysis (example: www.kanunu8.com/files/old/2…
- Analyze the content you want to get, first analyze the URL, and find that only the following is changed, first obtain the novel without relative path, and then combine into a new URL (the URL of each chapter of the novel).
- Get the content of each chapter and beautify it
The screenshots
The source code
Copy the code
Import re import requests # to crawl site url = 'https://www.kanunu8.com/book4/10509/' # for binary first, TXT = requests. Get (url).content.decode(' GBK ') # txt.conten is binary ---n<head>\r\n<title>\xd6\xd0\xb9\xfa\xba\xcf\xbb\xef\xc8\xcb # print(txt) m1 = re.compile(r'<td colspan="4" align="center"><strong>(.+)</strong>') # print(m1.findall(txt)) m2 = re.compile(r'<td( width="25%")?><a Href ="(.+\.html)">(.+)</a></td>') print(m2.findall(TXT) in raw: Print ([I [2],url+ I [1]]) # 'https://www.kanunu8.com/book4/10509/184616.html'] # generate corresponding url sanguo. Each chapter append ([I [2], the url + [1]] I) print (" * * 100)" Print (sanguo) # [[' the origin of the first chapter is the dream ', 'https://www.kanunu8.com/book4/10509/184612.html'], [' second chapter idol brothers', 'https://www.kanunu8.com/book4/10509/184613.html'], [' third chapter love required ', 'https://www.kanunu8.com/book4/10509/184614.html']. [' the price of the fourth chapter love ', 'https://www.kanunu8.com/book4/10509/184615.html'], [chapter 5: the mother of success, 'https://www.kanunu8.com/book4/10509/184616.html'], [' sixth chapter life-changing ', 'https://www.kanunu8.com/book4/10509/184617.html']. [' chapter 7 Forced into a ', 'https://www.kanunu8.com/book4/10509/184618.html'], [' chapter 8 Drifting away ', 'https://www.kanunu8.com/book4/10509/184619.html'], [' chapter 9 The joining together of three arrows', 'https://www.kanunu8.com/book4/10509/184620.html'], [' chapter ten dreams set sail ', 'https://www.kanunu8.com/book4/10509/184621.html'], [' chapter 11 Pilot dream ', 'https://www.kanunu8.com/book4/10509/184622.html'], [' chapter 12 The ground waves', 'https://www.kanunu8.com/book4/10509/184623.html'], [' chapter 13 new signs', 'https://www.kanunu8.com/book4/10509/184624.html'], [chapter 14 the weakness of god, 'https://www.kanunu8.com/book4/10509/184625.html'], [' early chapter 15 fractures are ', 'https://www.kanunu8.com/book4/10509/184626.html']. [' chapter 16 listed battle ', 'https://www.kanunu8.com/book4/10509/184627.html'], [' chapter 17 Dream peak ', 'https://www.kanunu8.com/book4/10509/184628.html'], [' chapter 18 Dry outline arbitrary ', 'https://www.kanunu8.com/book4/10509/184629.html'], [' chapter 19 Sword to wear heart ', 'https://www.kanunu8.com/book4/10509/184630.html'], [' chapter 20 As JieBo ', 'https://www.kanunu8.com/book4/10509/184631.html'], [' tail \ u3000 voice, 'https://www.kanunu8.com/book4/10509/184632.html']] # # match each chapter the body content of each chapter of the novel text in < p > tag m3 = re.com running (r '< p > (. +) < / p >', re. S) # M4 = re.compile(r'<br />') # Is also to be replaced. TXT with open(' sanguo ','a') as f: for I in sanguo: I_url = I [1] print(" downloading -->%s" % I [0]) print(" downloading -->%s" % I [0]) R_nr = requests. Get (i_URL).content.decode(' GBK ') Print (n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) print(n_nr) Also replaced with empty n2 = m5. Sub (' ', n) (n = n2 replace (' \ n ', ') # # write TXT I [0] is the section name f.w rite (' \ n '+ I +' \ n '[0]) f.w rite (n2)Copy the code