Python crawler: Using Requests and BeautifulSoup, we used Requests for web requests, we took the web data and BeautifulSoup parsed it, and just a little while ago, Requests author kennethreitz came out with a new library, Pythonic HTML Parsing for Humans™, which can be used to parse HTML documents. Requests – HTML is based on the existing framework PyQuery, Requests, LXML and other libraries for secondary encapsulation, more convenient for developers to call.

The installation

Mac:

pip3 install requests-html
Copy the code

Windows:

pip install requests-html
Copy the code

The instance

Code quality is much, will let us look at the girl, crawled web site I choose is http://www.win4000.com/zt/xinggan.html, open the web site, observed that this is a list, is a thumbnail images, if you want to save the image to the local, of course need high clearly the figure, so details into the list, the further analysis, The complete code is as follows:

from requests_html import HTMLSession
import requests
import time

session = HTMLSession()


# Parse the list of images
def get_girl_list(a):
    Return a Response object
    response = session.get('http://www.win4000.com/zt/xinggan.html')  Number of seconds

    content = response.html.find('div.Left_bar', first=True)

    li_list = content.find('li')

    for li in li_list:
        url = li.find('a', first=True).attrs['href']
        get_girl_detail(url)


# Parse the picture in detail
def get_girl_detail(url):
    Return a Response object
    response = session.get(url)  Number of seconds
    content = response.html.find('div.scroll-img-cont', first=True)
    li_list = content.find('li')
    for li in li_list:
        img_url = li.find('img', first=True).attrs['data-original']
        img_url = img_url[0:img_url.find('_')] + '.jpg'
        print(img_url + '.jpg')
        save_image(img_url)


# Keep the big picture
def save_image(img_url):
    img_response = requests.get(img_url)
    t = int(round(time.time() * 1000))  # millisecond timestamp
    f = open('/Users/wuxiaolong/Desktop/Girl/%d.jpg' % t, 'ab')  B (binary) is required to store images and multimedia files.
    f.write(img_response.content)  # Multimedia stores content
    f.close()


if __name__ == '__main__':
    get_girl_list()

Copy the code

That’s all there is to it, doesn’t it feel easy?

Description:

1. Unlike BeautifulSoup, pleads-html can be found directly on the label, as follows: SomeClass # someClass # someClass # someClass # someClass # someClass # someClass # someClass http://html.python-requests.org/;

2, here to save local path/Users/wuxiaolong/Desktop/Girl/I write dead, need the reader to own, if is the file name directly, save the path will be a project directory.

legacy

Example of the site is paginated, not done, can be periodic cycle to climb the girl oh, interested readers themselves to play.

reference

requests-html

Today I used the Requests-HTML library (Python crawler)

The public,

My official account: Wu Xiaolong, welcome to exchange ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler Exercise 2: Use requests- HTML

The installation

The instance

legacy

reference

The public,

Python crawler Exercise 2: Use requests- HTML

The installation

The instance

legacy

reference

The public,

Related Posts

Java collection LinkedBlockingQueue source analysis

JVM tuning practices in high concurrency scenarios

Design of concurrent purchase system architecture for Internet giants