preface
Use the urllib and RE regular expression to get the movie details page link
For the most part, Python’s built-in URllib and RE regular expressions are good enough for our everyday needs, but how can anyone be smart enough to settle for better? Hence requests and Beautifulsoup, the golden combination of crawlers. And that’s the beauty of Python. You can find a lot of third-party libraries that are so easy to use that they make your heart flutter.
One, installation & simple use of entry.
1, install,
Pip can be easily installed:
pip install requests
pip install beautifulsoup4
2. requests get started.
import requests
# # get request
r = requests.get('https://github.com/timeline.json')
r.json() For JSON response content, using r.son () automatically converts JSON results to dict
r.content() ## Binary corresponding content
headers = {'user-agent': 'my - app / 0.0.1'} Custom request headers
r = requests.get('https://github.com/timeline.json', headers=headers)
payload = {'key1': 'value1'.'key2': 'value2'} # pass the URL parameter
r = requests.get("http://httpbin.org/get", params=payload)
# # post request
payload = {'key1': 'value1'.'key2': 'value2'} # # post data
r = requests.post("http://httpbin.org/post", data=payload)
## Upload files
url = 'http://httpbin.org/post'
files = {'file': open('report.xls'.'rb')}
r = requests.post(url, files=files)
Copy the code
For more details, please check the official Chinese document, which is easy to understand and authoritative.
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
Beautifulsoup introduction.
Beautifulsoup is a quick way to locate HTML documents, a hypertext markup language used to describe web pages, not a programming language. If you don’t have an HTML background, spend a day learning about it. Beginner tutorial -HTML
http://www.runoob.com/html/html-tutorial.html
Pay attention to the point
-
Now, if you know what HTML is, it’s made up of layers and layers of tags, each tag node has its own class attribute or other attribute, its own parent tag, its own child tag and its own sibling tag, and beautifulsoup works by looking at that. Take the murderer you want easily. Oh no target nodes are pinched out, eliminating the tedious nightmare of writing regular expressions.
-
Beautifulsoup relies on third-party interpretation libraries for HTML interpretation, including html.parser, LXML, LXML-xml and html5Lib. Each has its own advantages and disadvantages. In my experience, beautifulsoup sometimes uses the text to be processed returned by beautifulsoup. Some tag nodes are missing, but if you are sure that this is not dynamic loading, it is because the interpreter can’t interpret it and just skipped it. In this case, you can change the interpreter and try it.
Common methods
I’m only using find and find_all as examples. For details, please refer to the official documentation. After all, being a porter is tiring and meaningless.
http://beautifulsoup.readthedocs.io/zh_CN/latest/
from bs4 import BeautifulSoup
# # match the requests
r = requests.get('http://www.douban.com')
## A pot of soup to be processed
soup = BeautifulSoup(r.content,'lxml') Use the LXML interpretation library
print(soup)
Copy the code
We will get the soup as follows, and then locate to block A in the red box.
Find the node by attribute location
a = soup.find('a',attrs={'class':'lnk-book'})
print(a)
print('链接: '+a['href'])
print('Text:'+a.text)
Copy the code
Returns a list of all the nodes (find_all)
a_s = soup.find_all('a')
print (a_s)
Copy the code
Tip: Sometimes you need to narrow the target range layer by layer so that it is easy to get the target node.
Second, climb douban books top250
Analyze the page.
1. If we click on the page number at the bottom, we will see that the page number is a multiple of 25, starting from 0, so we can construct the corresponding page URL.
2. We locate each page, search the information of the book, and obtain the corresponding URL. The following figure shows the url of the book details page that is crawled from each page.
3. In the details page of the book, we locate the following elements. Gets the book’s title, rating, and number of ratings.
The code
# -*- coding:utf-8 -*-
#author:waiwen
#email:[email protected]
#time: 2017/12/3 12:27
from bs4 import BeautifulSoup
import requests
import random
#uer_agent library, randomly selected, to prevent banning
USER_AGENT_LIST = [
"Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"."Mozilla / 5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6".Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"."Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"."Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"."Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"."Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
Request code consolidation for the web page
def get_response(url):
Select a request header at random from a collection
headers = {'user-agent':random.choice(USER_AGENT_LIST)}
resp = requests.get(url,headers=headers)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, 'lxml')
return soup
# Find links to each book
def get_book_url(page):
if page>10:
return []
num=(page-1)*25
url ='https://book.douban.com/top250?start=%s'%str(num)
soup = get_response(url)
book_div = soup.find('div', attrs={'class': 'indent'})
books = book_div.find_all('tr', attrs={'class': 'item'})
urls = [ book.td.a['href'] for book in books ]
print('Get page %s'%page,urls)
return urls
# Get information about each book
def get_book_info(book_url):
soup = get_response(book_url)
div_info = soup.find('div',attrs={'id':'info'})
book_author = div_info.a.text.split(' ')[-1] # remove whitespace
book = soup.find('div',attrs={'class':'rating_wrap clearbox'})
book_name= soup.find('span',attrs={'property':'v:itemreviewed'}).text
book_grade = book.find('strong',attrs={'class':'ll rating_num '}).text
book_man = book.find('a',attrs={'class':'rating_people'}).span.text
book_info ={}
book_info['name']=book_name
book_info['author']=book_author
book_info['rating_num'] = int(book_man)
book_info['grade'] = float(book_grade)
print(book_info)
return book_info
if __name__ == '__main__':
all_urls = []
# crawl from page 1 to page 10 and splice the links together.
for page inRange (1,11): urls = get_book_url(page) all_urls = all_urls+urlsprint('Number of links obtained :',len(all_urls))
out=' '
for url in all_urls:
try:
info = get_book_info(url)
except Exception as e:
print(e)
continue
out=out+str(info)+'\n'
with open('douban_book_top250.txt'.'w') as f: # output to TXT file
f.write(out)
Copy the code
conclusion
Parse the page, find the tag node of the target element, record its attributes, find that region using beautifulsoup’s find and find_all, and extract accordingly. In this process, it is necessary to explain that using single-thread block to crawl 250 books is time-consuming. We plan to use multi-thread to improve it later. Please look forward to it.