1. Introduction
Hello, I’m Anguo!
Here’s another lightweight crawler library for a niche: MechanicalSoup
MechanicalSoup, also a reptilian artifact! It is developed in pure Python with Beautiful Soup and Requests at the bottom to automate web pages and crawl data
Project Address:
Github.com/MechanicalS…
2. Installation and common usage
First install the dependency libraries
Pip3 install MechanicalSoupCopy the code
Common operations are as follows:
2-1 Instantiate the browser object
You can instantiate a browser object using the StatefulBrowser() method built into MechanicalSoup
Import mechanicalsoup # instantiation browser object browser = mechanicalsoup. StatefulBrowser (user_agent = 'mechanicalsoup')Copy the code
PS: When instantiated, the parameter can execute the User Agent and the data parser, which is LXML by default
2-2 Open the website and return value
Using the browser instance object open(URL) opens a web page with the return type: requests. Models.Response
Result = browser.open("http://httpbin.org/") print(result) # request. Models.Copy the code
As you can see from the return value, opening the site with a browser object is equivalent to making a request to the site using the Requests library
2-3 Page elements and the current URL
You can obtain the URL address of the current page by using the url property of the browser object. The “Page” property of the browser is used to get all the page element content of the page
Since MechanicalSoup’s underlying layer is based on BS4, all the syntax of BS4 applies to MechanicalSoup
Print (URL) = browser.url print(URL)Copy the code
2-4 Form operations
The browser object’s built-in select_Form (selector) method is used to get the Form Form element of the current page
If the current page has only one Form, the parameter can be omitted
Browser.select_form ('form[action="/post"]')Copy the code
Form.print_summary () is used to print out all the elements in the form
Form = browser.select_form() # print_summary()Copy the code
As for the input normal input box, radio, checkbox checkbox in the form
Norm_input ["norm_input"] = "norm_input" # <input name="size" type="radio" value="small"/> # <input name="size" type="radio" value="medium"/> # <input name="size" type="radio" value="large"/> browser["size"] =" medium" /> # <input name="topping" type="checkbox" value="bacon"/> # <input name="topping" type="checkbox" value="onion"/> # <input name="topping" type="checkbox" value="mushroom"/> browser["topping"] = ("bacon", "cheese")Copy the code
The submit_selected(btnName) method of the browser object is used to submit the form
Note that the return value after submitting the form is of type: requests. Models.Response
Response = browser.submit_selected() print(" result :",response.text) requests.models.Response print(type(response))Copy the code
2-5 Debugging tools
The browser object Browser provides a method: launch_browser()
Used to start a real Web browser, visual display of the current state of the Web page, in the process of automation is very intuitive and useful
PS: Instead of actually opening the web page, it creates a temporary page containing the content of the page and points the browser to the file
More functions can be referred to:
Mechanicalsoup. Readthedocs. IO/en/stable/t…
3. The Best of 2015
We take “wechat article search, crawling article title and link address” as an example
3-1 Open the target website and specify a random UA
Since many websites reverse crawl User Agent, a UA is generated randomly and set into it
PS: As you can see from the MechanicalSoup source, UA is set to the request header for Requests
The import mechanicalsoup from faker import Factory home_url = 'https://weixin.sogou.com/' # # to instantiate a browser object user_agent: Specify the UA f = Factory. The create (UA) = f.u ser_agent () browser = mechanicalsoup. StatefulBrowser (user_agent = UA) # result = open the target site browser.open(home_url)Copy the code
3-2 Form submission, search once
Use a browser object to get form elements from a web page, then set values to the input field in the form, and finally simulate form submission
Browser.form.print_summary () # select form element browser.select_form() # select form element browser.form.print_summary() Browser ["query"] = "Python" # submit response = browser. Submit_selected ()Copy the code
3-3 Data crawl
The data crawl part is simple and has a similar syntax to BS4, which is not explained here
Search_results = browser.get_current_page(). Select ('.news-list li.txt-box ') print(' search result :', Len (search_results)) # for result in search_results: Element_a = result.select('a')[0] Href = "https://mp.weixin.qq.com" + element_a. Attrs ['href'] text = element_a. Text print(" ", text) print(" address :", href)Copy the code
3-4 had climbed
MechanicalSoup can also set proxies via the “session.Proxies” of browser objects in addition to UA
# Proxy IP Proxies = {' HTTPS ': 'https_ip', 'HTTP ': 'http_ip'} # Proxy IP browser.session.proxies = proxiesCopy the code
4. The last
In this paper, MechanicalSoup is used to complete an automatic and crawler operation combined with wechat article search example
The major difference between Selenium and JS is that Selenium can interact with JS. But not MechanicalSoup
But MechanicalSoup is a simple, lightweight solution for some simple automation scenarios
I have sent the complete source file to the background, follow the public account “AirPython”, background reply “MS” can be obtained
If you think the article is good, please like, share, leave a message, because this will be my continuous output of more high-quality articles the strongest power!