Preface:

This article is very easy to understand, can be said to be zero foundation can be quickly mastered. If you have any questions, please leave a message and I will reply to you as soon as possible. This article is hosted on Github

I. Importance of reptiles

If the Internet is a Spider’s web, a Spider is a Spider that crawls around the web. Web spider web link address to find web pages, one page from the website (usually the home page) and read the content of the page, find other links in a web page, and then through these links for a web page, the cycle continuously, until finish all web crawl the site so far.

1, before buying a house in Beijing, who wants the housing price to start soaring, lianjia housing price and other data analysis only give a small part, far from meeting their needs. So I spent several hours in the evening writing a reptile, which climbed down all the information of Beijing’s residential areas and all the historical transaction records of all the residential areas in Beijing.

2. My wife is a salesman of an Internet company. She needs to collect all kinds of enterprise information and make phone calls. As a result, the collection script was used to catch a lump of data for her to use, and her colleagues every day to search the data sorted out by themselves until midnight.

Ii. Practice: Climb the movie Heaven movie details page

1. Web page analysis and crawling the details page URL of the first page

Latest movie interface from Cinema Paradise. You can see the first page url for www.ygdy8.net/html/gndy/d… , the second page for www.ygdy8.net/html/gndy/d… And pages 3 and 4 are similar

from lxml import etree
import requests


url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'

headers = {
    'User_Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}

response = requests.get(url,headers=headers)

# response.text is the system's default judgment. But it is a pity that judgment error, resulting in garbled code. We can go the other way: Response.content. Specify your own format for decoding
# print(response.text)
# print(response.content.decode('gbk'))
print(response.content.decode(encoding="gbk", errors="ignore"))
Copy the code

Take the first page as an example to print the following data:

Analysis of movieparadise HTML source code shows that each table tag is a movie

Xpath to get the detailed URL for each movie

html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
for detail_url in detail_urls:
    print(detail_url)  Add the domain name to the details URL
Copy the code

The results are as follows:

2. Clean up the code and crawl the movie list URL on the first 7 pages

from lxml import etree
import requests

# the domain name
BASE_DOMAIN = 'http://www.ygdy8.net'
# url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'

HEADERS = {
    'User_Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}

def spider():
    base_url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
    for x inRange (1, 8) : url = base_url. The format (x)print(url) # the url of each page movie list eg: http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html

if __name__ == '__main__':
    spider()
Copy the code

3. Crawl the details page address for each movie

def get_detail_urls(url):
    response = requests.get(url, headers=HEADERS)

    # response.text is the system's default judgment. But it is a pity that judgment error, resulting in garbled code. We can go the other way: Response.content. Specify your own format for decoding
    # print(response.text)
    # print(response.content.decode('gbk'))
    # print(response.content.decode(encoding="gbk", errors="ignore"))
    text = response.content.decode(encoding="gbk", errors="ignore")

    Get the details of each movie URL by xpath
    html = etree.HTML(text)
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")

    detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls) Replace each URL in the list
    # def abc(url):
    # return BASE_DOMAIN+url
    # index = 1
    # for detail_url in detail_urls:
    # detail_url = abc(detail_url)
    # detail_urls[index] = detail_url
    # index+1

    return detail_urls
Copy the code

4. Grab the data of the movie details page

# Parse the details page
def parse_detail_page(url):
    movie = {}
    response = requests.get(url,headers = HEADERS)
    text = response.content.decode('gbk', errors='ignore')
    html = etree.HTML(text)
    # title = html.xpath("//div[@class='title_all']//font[@color='#07519a']"

   [
      
       , 
       
        ]
       
      
   # print(title)

    # To display, we need to turn the encoding
    # for x in title:
    # print(etree.tostring(x,encoding='utf-8').decode('utf-8'))

     # we want to get text, so modify line 47
    title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
    movie['title'] = title

    zoomE = html.xpath("//div[@id='Zoom']") [0] # find the common top container, convenient after the job search
    imgs = zoomE.xpath(".//img/@src") Figure out the poster and screenshots
    cover = imgs[0]
    if len(imgs) > 1:
        screenshot = imgs[1]
        movie['screenshot'] = screenshot
    # print(cover)
    movie['cover'] = cover

    infos = zoomE.xpath(".//text()")

    for index,info in enumerate(infos):
        if info.startswith('when years & emsp;   Generation '):
            info = info.replace("When years & emsp;   Generation"."").strip() Strip strip Spaces
            movie['year'] = info
        elif info.startswith("When producing & emsp;   To"):
            info = info.replace("When producing & emsp;   To"."").strip()
            movie["country"] = info
        elif info.startswith("When class & emsp;   Don't"):
            info = info.replace("When class & emsp;   Don't"."").strip()
            movie["category"] = info
        elif info.startswith(◎ Douban Rating):
            info = info.replace(◎ Douban Rating."").strip()
            movie["douban_rating"] = info
        elif info.startswith("When pieces & emsp;   Long. ""):
            info = info.replace("When pieces & emsp;   Long. ""."").strip()
            movie["duration"] = info
        elif info.startswith("To guide & emsp;   Play"):
            info = info.replace("To guide & emsp;   Play"."").strip()
            movie["director"] = info
        elif info.startswith("When the Lord & emsp;   Play"):
            actors = []
            actor = info.replace("When the Lord & emsp;   Play"."").strip()
            actors.append(actor)
            # Because there are many leading actors, plus the particularity of their elements in the movie heaven, we need to go through them again and find out each actor separately
            for x in range(index+1,len(infos)): Start with actor infos and find each actor
                actor = infos[x].strip()
                if actor.startswith("When") :# = exit at the end of the tag
                    break
                actors.append(actor)
            movie['actor'] = actors
        elif info.startswith('when Jane & emsp;   Medium ') :# info = info.replace(' j    Interface ', ""). The strip ()
            for x in range(index+1,len(infos)):
                if infos[x].startswith("◎ Awards") :break
                profile = infos[x].strip()
                movie['profile'] = profile
            # print(movie)
        elif info.startswith('◎ Awards'):
            awards = []
            # info = info.replace(" ", "").strip()
            for x in range(index+1,len(infos)):
                if infos[x].startswith("[Download address]") :break
                award = infos[x].strip()
                awards.append(award)
            movie['awards'] = awards
            # print(awards)

    download_url = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]
    movie['download_url'] = download_url
    return  movie
Copy the code

The code above crawls every single piece of movie data. I’ve downloaded the HTML at the time of writing this article — “movie.html” — and put it on Github to make it easier for readers to compare the format

Final result:




For more exciting content, please pay attention to my public number “BigDeveloper” — programmer tycoon show