MechanicalSoup is a popular and useful Python crawler library

1. Introduction

Hello, I’m Anguo!

Here’s another lightweight crawler library for a niche: MechanicalSoup

MechanicalSoup, also a reptilian artifact! It is developed in pure Python with Beautiful Soup and Requests at the bottom to automate web pages and crawl data

Project Address:

Github.com/MechanicalS…

2. Installation and common usage

First install the dependency libraries

Pip3 install MechanicalSoupCopy the code

Common operations are as follows:

2-1 Instantiate the browser object

You can instantiate a browser object using the StatefulBrowser() method built into MechanicalSoup

Import mechanicalsoup # instantiation browser object browser = mechanicalsoup. StatefulBrowser (user_agent = 'mechanicalsoup')Copy the code

PS: When instantiated, the parameter can execute the User Agent and the data parser, which is LXML by default

2-2 Open the website and return value

Using the browser instance object open(URL) opens a web page with the return type: requests. Models.Response

Result = browser.open("http://httpbin.org/") print(result) # request. Models.Copy the code

As you can see from the return value, opening the site with a browser object is equivalent to making a request to the site using the Requests library

2-3 Page elements and the current URL

You can obtain the URL address of the current page by using the url property of the browser object. The “Page” property of the browser is used to get all the page element content of the page

Since MechanicalSoup’s underlying layer is based on BS4, all the syntax of BS4 applies to MechanicalSoup

Print (URL) = browser.url print(URL)Copy the code

2-4 Form operations

The browser object’s built-in select_Form (selector) method is used to get the Form Form element of the current page

If the current page has only one Form, the parameter can be omitted

Browser.select_form ('form[action="/post"]')Copy the code

Form.print_summary () is used to print out all the elements in the form

Form = browser.select_form() # print_summary()Copy the code

As for the input normal input box, radio, checkbox checkbox in the form

Norm_input ["norm_input"] = "norm_input" # <input name="size" type="radio" value="small"/> # <input name="size" type="radio" value="medium"/> # <input name="size" type="radio" value="large"/> browser["size"] =" medium" /> # <input name="topping" type="checkbox" value="bacon"/> # <input name="topping" type="checkbox" value="onion"/> # <input name="topping" type="checkbox" value="mushroom"/> browser["topping"] = ("bacon", "cheese")Copy the code

The submit_selected(btnName) method of the browser object is used to submit the form

Note that the return value after submitting the form is of type: requests. Models.Response

Response = browser.submit_selected() print(" result :",response.text) requests.models.Response print(type(response))Copy the code

2-5 Debugging tools

The browser object Browser provides a method: launch_browser()

Used to start a real Web browser, visual display of the current state of the Web page, in the process of automation is very intuitive and useful

PS: Instead of actually opening the web page, it creates a temporary page containing the content of the page and points the browser to the file

More functions can be referred to:

Mechanicalsoup. Readthedocs. IO/en/stable/t…

3. The Best of 2015

We take “wechat article search, crawling article title and link address” as an example

3-1 Open the target website and specify a random UA

Since many websites reverse crawl User Agent, a UA is generated randomly and set into it

PS: As you can see from the MechanicalSoup source, UA is set to the request header for Requests

The import mechanicalsoup from faker import Factory home_url = 'https://weixin.sogou.com/' # # to instantiate a browser object user_agent: Specify the UA f = Factory. The create (UA) = f.u ser_agent () browser = mechanicalsoup. StatefulBrowser (user_agent = UA) # result = open the target site browser.open(home_url)Copy the code

3-2 Form submission, search once

Use a browser object to get form elements from a web page, then set values to the input field in the form, and finally simulate form submission

Browser.form.print_summary () # select form element browser.select_form() # select form element browser.form.print_summary() Browser ["query"] = "Python" # submit response = browser. Submit_selected ()Copy the code

3-3 Data crawl

The data crawl part is simple and has a similar syntax to BS4, which is not explained here

Search_results = browser.get_current_page(). Select ('.news-list li.txt-box ') print(' search result :', Len (search_results)) # for result in search_results: Element_a = result.select('a')[0] Href = "https://mp.weixin.qq.com" + element_a. Attrs ['href'] text = element_a. Text print(" ", text) print(" address :", href)Copy the code

3-4 had climbed

MechanicalSoup can also set proxies via the “session.Proxies” of browser objects in addition to UA

# Proxy IP Proxies = {' HTTPS ': 'https_ip', 'HTTP ': 'http_ip'} # Proxy IP browser.session.proxies = proxiesCopy the code

4. The last

In this paper, MechanicalSoup is used to complete an automatic and crawler operation combined with wechat article search example

The major difference between Selenium and JS is that Selenium can interact with JS. But not MechanicalSoup

But MechanicalSoup is a simple, lightweight solution for some simple automation scenarios

I have sent the complete source file to the background, follow the public account “AirPython”, background reply “MS” can be obtained

If you think the article is good, please like, share, leave a message, because this will be my continuous output of more high-quality articles the strongest power!

MechanicalSoup is a popular and useful Python crawler library

1. Introduction

2. Installation and common usage

3. The Best of 2015

4. The last

Related Posts

Byte new generation decoder BVC | power H.266/VVC standard commercial landing

2 hours, really only need 2 hours, god is not magic to see it!

Speed up access to and download github projects, originally replaced by a domain name can speed up