The text and pictures in this article come from the network, only for learning, communication, do not have any commercial purposes, if you have any questions, please contact us to deal with.

Author: yangrq1018

The original link: segmentfault.com/a/119000001…

Do some small projects, with the technology and skills will be more scattered more miscellaneous, write an essay record, help familiar with.

What you need: You often watch movies on Tencent Video, and there is a “Douban praise” section in the movie library. I usually choose movies under this item. But there are a lot of movies, and there is a lack of indexes, so you have to keep going down, making JS load more items. However, after watching the previous one, it takes a long time to find a new one. So with the crawler will “Douban praise” in the movie climbed down into a table, convenient selection.

Project address: github.com/yangrq1018/…

Rely on

The following Python packages are required:

  • requests

  • bs4 – Beautiful soup

  • pandas

That’s it, no complex automated crawler architecture is needed, simple and common packages are enough.

Crawl movie information

First look at the movie channel, found that it is asynchronously loaded. You can use the Network TAB in Firefox (or Chrome) inspect to filter through the possible APIS. It quickly turns out that the interface’s URL is in this format:

base_url = 'https://v.qq.com/x/bu/pagesheet/list?_all=1&append=1&channel=movie&listpage=2&offset={offset}&pagesize={page_size}&sort ={sort}'Copy the code

Where offset is the start of the requested page, pagesize is the number of requests per page, and sort is the type. Here sort=21 refers to the type of “bean review” we need. Pagesize cannot be larger than 30, which returns only 30 elements, and less than 30 returns a specified number of elements.

  1. The URL is too long for Pandas. It will be needed later

  2. pd.set_option('display.max_colwidth', -1)

  3. base_url = 'https://v.qq.com/x/bu/pagesheet/list?_all=1&append=1&channel=movie&listpage=2&offset={offset}&pagesize={page_size}&sort ={sort}'

  4. # Douban best type

  5. DOUBAN_BEST_SORT = 21

  6. NUM_PAGE_DOUBAN = 167

A quick loop reveals that there are 167 pages of douban reviews, with 30 elements per page.

We use the Requests library to request the page. Get_soup requests the element on page PAGe_IDx. Beautifulsoup parses response.content, creating a DOMlike object that makes it easy to find the element we need. We return a list. Each movie entry is contained within a div called list_Item, so write a function to help us extract all such divs.

  1. def get_soup(page_idx, page_size=30, sort=DOUBAN_BEST_SORT):

  2. url = base_url.format(offset=page_idx * page_size, page_size=page_size, sort=sort)

  3. res = requests.get(url)

  4. soup = bs4.BeautifulSoup(res.content.decode('utf-8'), 'lxml')

  5. return soup

  6. def find_list_items(soup):

  7. return soup.find_all('div', class_='list_item')

We iterate through each page, returning an HTML list of all the items that bS4 has passed.

  1. def douban_films():

  2. rel = []

  3. for p in range(NUM_PAGE_DOUBAN):

  4. print('Getting page {}'.format(p))

  5. soup = get_soup(p)

  6. rel += find_list_items(soup)

  7. return rel

Here’s the HTML code for one of the movies:

  1. <div __wind="" class="list_item">

  2. <a class="figure" data-float="j3czmhisqin799r" href="https://v.qq.com/x/cover/j3czmhisqin799r.html" tabindex="-1" > < span style =" color: RGB (51, 51, 51); font-family: arial, sans-serif;

  3. <img Alt =" farewell my concubine "class="figure_pic" onerror="picerr(this,'v')" src="//puui.qpic.cn/vcover_vt_pic/0/j3czmhisqin799rt1444885520.jpg/220"/>

  4. <img alt="VIP" class="mark_v" onerror="picerr(this)" src="//i.gtimg.cn/qqlive/images/mark/mark_5.png" srcset="//i.gtimg.cn/qqlive/images/mark/[email protected] 2x"/>

  5. <div class="figure_caption"></div>

  6. < div class = "figure_score" > 9.6 < / div >

  7. </a>

  8. <div class="figure_detail figure_detail_two_row">

  9. <a class="figure_title figure_title_two_row bold" href="https://v.qq.com/x/cover/j3czmhisqin799r.html" target="_blank" Title =" Farewell my concubine "> </a>

  10. <div class="figure_desc" title=" 名 人 : 张 荣, 张 fengyi.gong Li Ge You ">

  11. </div>

  12. <div class="figure_count"><svg class="svg_icon svg_icon_play_sm" height="16" viewbox="0 0 16 16" width="16"><use Xlink: href = "# svg_icon_play_sm" > < / use > < / SVG > 46.71 million < / div >

  13. </div>

It is not difficult to find that the movie farewell My Concubine, its name, broadcast address, cover, score, leading actor, whether membership is required and the number of plays are all in this div. In an interactive environment like Ipython, it’s easy to figure out how to extract them using BS. One trick I use is to open a spyder.py file, write the required functions in it, turn on the ipython auto-reload module option, and then debug the code in the console and copy it to the file, and the ipython functions will be updated accordingly. This has the advantage of being much more convenient than changing code in Ipython. How to turn on ipython autoreload:

  1. %load_ext autoreload

  2. %autoreload 2 # Reload all modules every time before executing Python code

  3. %autoreload 0 # Disable automatic reloading

The parse_films function extracts information using two common methods in BS:

  • find
  • find_all

Because the API of Douban has closed the retrieval function, the crawler will be detected by the anti-crawler, and the function was abandoned when the score of Douban was added in order to retrieve it.

OrderedDict can accept a list of (keys, values), and the order of the keys is remembered. This will be useful later when we export to pandas DataFrame.

  1. def parse_films(films):

  2. '''films is a list of `bs4.element.Tag` objects'''

  3. rel = []

  4. for i, film in enumerate(films):

  5. title = film.find('a', class_="figure_title")['title']

  6. print('Parsing film %d: ' % i, title)

  7. link = film.find('a', class_="figure")['href']

  8. img_link = film.find('img', class_="figure_pic")['src']

  9. # test if need VIP

  10. need_vip = bool(film.find('img', class_="mark_v"))

  11. score = getattr(film.find('div', class_='figure_score'), 'text', None)

  12. if score: score = float(score)

  13. cast = film.find('div', class_="figure_desc")

  14. if cast:

  15. cast = cast.get('title', None)

  16. play_amt = film.find('div', class_="figure_count").get_text()

  17. # db_score, db_link = search_douban(title)

  18. # Store key orders

  19. dict_item = OrderedDict([

  20. ('title', title),

  21. ('vqq_score', score),

  22. # ('db_score', db_score),

  23. ('need_vip', need_vip),

  24. ('cast', cast),

  25. ('play_amt', play_amt),

  26. ('vqq_play_link', link),

  27. # ('db_discuss_link', db_link),

  28. ('img_link', img_link),

  29. ])

  30. rel.append(dict_item)

  31. return rel

export

Finally, we call the written function and run it in the main program.

Objects that are resolved in a List of dictionaries format can be passed directly to the DataFrame constructor. Sort according to the score, the highest score in the front, and then convert the play link into HTML link tags, more beautiful and can be opened directly.

Note that the CSV file generated by pandas is not compatible with Excel and contains Chinese characters. The solution is to select utF_8_SIG for encoding, which will allow Excel to decode properly.

Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle We save our DataFrame as.pkl. Call the to_html method of the DataFrame to save an HTML file. Make sure escape is set to False or the hyperlink cannot be opened directly.

  1. if __name__ == '__main__':

  2. df = DataFrame(parse_films(douban_films()))

  3. # Sorted by score

  4. df.sort_values(by="vqq_score", inplace=True, ascending=False)

  5. # Format links

  6. df['vqq_play_link'] = df['vqq_play_link'].apply(lambda x: '<a href="{0}">Film link</a>'.format(x))

  7. df['img_link'] = df['img_link'].apply(lambda x: '<img src="{0}">'.format(x))

  8. # Chinese characters in Excel must be encoded with _sig

  9. df.to_csv('vqq_douban_films.csv', index=False, encoding='utf_8_sig')

  10. # Pickle

  11. df.to_pickle('vqq_douban_films.pkl')

  12. # HTML, render hyperlink

  13. df.to_html('vqq_douban_films.html', escape=False)

The project management

That’s the code part. When the code is written, archive it for easy analysis. Choose to place on Github.

Github provides a command-line tool (not Git, but an extension of Git) called the Hub. MacOS users can install this way

brew install hub
Copy the code

Hub has many more concise syntax than git, which we mainly use here

hub create -d "Create repo for our proj" vqq-douban-film
Copy the code

How cool is it to create a REPO directly from the command line? No need to open a browser at all. You may then be prompted to register your SSH public key (authentication permissions) on Github. If you don’t have one, use ssh-keygen to generate one, and copy the.pub contents into Github Settings.

The project directory may have __pycache__ and.ds_store files that you don’t want to track. Writing a. Gitignore by hand is too much trouble. Do you have tools? Python has a package

  1. pip install git-ignore

  2. Git-ignore python # Generate a Python template

  3. Add.ds_store manually

Command line only, installed to cool.