Preface:
This article is very easy to understand, can be said to be zero foundation can be quickly mastered. If you have any questions, please leave a message and I will reply to you as soon as possible. This article is hosted on Github
I. Importance of reptiles
If the Internet is a Spider’s web, a Spider is a Spider that crawls around the web. Web spider web link address to find web pages, one page from the website (usually the home page) and read the content of the page, find other links in a web page, and then through these links for a web page, the cycle continuously, until finish all web crawl the site so far.
1, before buying a house in Beijing, who wants the housing price to start soaring, lianjia housing price and other data analysis only give a small part, far from meeting their needs. So I spent several hours in the evening writing a reptile, which climbed down all the information of Beijing’s residential areas and all the historical transaction records of all the residential areas in Beijing.
2. My wife is a salesman of an Internet company. She needs to collect all kinds of enterprise information and make phone calls. As a result, the collection script was used to catch a lump of data for her to use, and her colleagues every day to search the data sorted out by themselves until midnight.
Ii. Practice: Climb the movie Heaven movie details page
1. Web page analysis and crawling the details page URL of the first page
Latest movie interface from Cinema Paradise. You can see the first page url for www.ygdy8.net/html/gndy/d… , the second page for www.ygdy8.net/html/gndy/d… And pages 3 and 4 are similar
from lxml import etree
import requests
url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'
headers = {
'User_Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}
response = requests.get(url,headers=headers)
# response.text is the system's default judgment. But it is a pity that judgment error, resulting in garbled code. We can go the other way: Response.content. Specify your own format for decoding
# print(response.text)
# print(response.content.decode('gbk'))
print(response.content.decode(encoding="gbk", errors="ignore"))
Copy the code
Take the first page as an example to print the following data:
Analysis of movieparadise HTML source code shows that each table tag is a movie
Xpath to get the detailed URL for each movie
html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
for detail_url in detail_urls:
print(detail_url) Add the domain name to the details URL
Copy the code
The results are as follows:
2. Clean up the code and crawl the movie list URL on the first 7 pages
from lxml import etree
import requests
# the domain name
BASE_DOMAIN = 'http://www.ygdy8.net'
# url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'
HEADERS = {
'User_Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}
def spider():
base_url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
for x inRange (1, 8) : url = base_url. The format (x)print(url) # the url of each page movie list eg: http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html
if __name__ == '__main__':
spider()
Copy the code
3. Crawl the details page address for each movie
def get_detail_urls(url):
response = requests.get(url, headers=HEADERS)
# response.text is the system's default judgment. But it is a pity that judgment error, resulting in garbled code. We can go the other way: Response.content. Specify your own format for decoding
# print(response.text)
# print(response.content.decode('gbk'))
# print(response.content.decode(encoding="gbk", errors="ignore"))
text = response.content.decode(encoding="gbk", errors="ignore")
Get the details of each movie URL by xpath
html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls) Replace each URL in the list
# def abc(url):
# return BASE_DOMAIN+url
# index = 1
# for detail_url in detail_urls:
# detail_url = abc(detail_url)
# detail_urls[index] = detail_url
# index+1
return detail_urls
Copy the code
4. Grab the data of the movie details page
# Parse the details page
def parse_detail_page(url):
movie = {}
response = requests.get(url,headers = HEADERS)
text = response.content.decode('gbk', errors='ignore')
html = etree.HTML(text)
# title = html.xpath("//div[@class='title_all']//font[@color='#07519a']"
[
,
]
# print(title)
# To display, we need to turn the encoding
# for x in title:
# print(etree.tostring(x,encoding='utf-8').decode('utf-8'))
# we want to get text, so modify line 47
title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
movie['title'] = title
zoomE = html.xpath("//div[@id='Zoom']") [0] # find the common top container, convenient after the job search
imgs = zoomE.xpath(".//img/@src") Figure out the poster and screenshots
cover = imgs[0]
if len(imgs) > 1:
screenshot = imgs[1]
movie['screenshot'] = screenshot
# print(cover)
movie['cover'] = cover
infos = zoomE.xpath(".//text()")
for index,info in enumerate(infos):
if info.startswith('when years & emsp; Generation '):
info = info.replace("When years & emsp; Generation"."").strip() Strip strip Spaces
movie['year'] = info
elif info.startswith("When producing & emsp; To"):
info = info.replace("When producing & emsp; To"."").strip()
movie["country"] = info
elif info.startswith("When class & emsp; Don't"):
info = info.replace("When class & emsp; Don't"."").strip()
movie["category"] = info
elif info.startswith(◎ Douban Rating):
info = info.replace(◎ Douban Rating."").strip()
movie["douban_rating"] = info
elif info.startswith("When pieces & emsp; Long. ""):
info = info.replace("When pieces & emsp; Long. ""."").strip()
movie["duration"] = info
elif info.startswith("To guide & emsp; Play"):
info = info.replace("To guide & emsp; Play"."").strip()
movie["director"] = info
elif info.startswith("When the Lord & emsp; Play"):
actors = []
actor = info.replace("When the Lord & emsp; Play"."").strip()
actors.append(actor)
# Because there are many leading actors, plus the particularity of their elements in the movie heaven, we need to go through them again and find out each actor separately
for x in range(index+1,len(infos)): Start with actor infos and find each actor
actor = infos[x].strip()
if actor.startswith("When") :# = exit at the end of the tag
break
actors.append(actor)
movie['actor'] = actors
elif info.startswith('when Jane & emsp; Medium ') :# info = info.replace(' j Interface ', ""). The strip ()
for x in range(index+1,len(infos)):
if infos[x].startswith("◎ Awards") :break
profile = infos[x].strip()
movie['profile'] = profile
# print(movie)
elif info.startswith('◎ Awards'):
awards = []
# info = info.replace(" ", "").strip()
for x in range(index+1,len(infos)):
if infos[x].startswith("[Download address]") :break
award = infos[x].strip()
awards.append(award)
movie['awards'] = awards
# print(awards)
download_url = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]
movie['download_url'] = download_url
return movie
Copy the code
The code above crawls every single piece of movie data. I’ve downloaded the HTML at the time of writing this article — “movie.html” — and put it on Github to make it easier for readers to compare the format
Final result:
For more exciting content, please pay attention to my public number “BigDeveloper” — programmer tycoon show