This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

  1. 👻👻 I believe that many friends can develop their own crawler projects independently after being bombarded by my last few blog posts on crawler technology!! The road of reptiles has been opened! 👻 👻

 

  1. 😬😬 but a few days ago a fan VX asked me this question: “The source code I see in the browser through the developer tools is completely different from the source code I crawled down through the Requests library! What’s going on here? Through the blogger you teach the method can not solve ah!” 😬 😬

In fact, this involves front-end knowledge, but MY energy time is limited, so currently only updated an HTML essential knowledge article, pay attention to this blogger – will work hard to continue to update CSS and JavaScript related knowledge of the article oh! 💊 As a reptilian must also know the front-end knowledge of the first two HTML explanations. It will be out soon!

  1. ⏰⏰ We need to understand why this is the case before we can do something about it. The first thing to remember about Requests is that they were retrieving raw HTML documents, whereas the page in the browser was the result of JavaScript processing of data from a variety of sources, whether loaded via Ajax or included in the HTML document. It can also be generated by JavaScript and a specific algorithm. ⏰ ⏰

For the first case: Data loading is a kind of asynchronous loading way, the original page will contain some data first, after the original page loading, will request an interface to get the data from the server, and then the data will be processed and appear on the page, it is actually send an Ajax request (this is a situation of JavaScript to render the page!) ;

In the third case, the data load is generated using JavaScript and a specific algorithm, not raw HTML code, and there is no Ajax request involved.

  1. 📻📻 principle known, the following question is how we in the end to solve? 📻 📻
The first approach (for the first case above) is to analyze Ajax requests sent from the back of the page to the interface and use the Requests library to simulate Ajax requests so that they can be successfully fetched!
Followed by the second method (applies to all the crawler scene, especially the third round or Ajax interface parameter is very complicated situation) : direct use of the method of simulation browser run to solve, so that you can do it in the browser to see what kind of grab is what kind of data that is visible to climb. So we don’t need JS reverse a series of complex operations!

  1. 🔩🔩 This blog post will take you through the second method – directly using simulated browser to solve the above problems! I will try to make the technical articles as easy/interesting as possible to make sure that anyone who wants to read this article will find it rewarding. Of course, if you read the feeling of writing this article can also, really learn something, I hope to give me a “like” and “favorites”, this is very important to me, thank you! 🔩 🔩

 


Here we go! Here we go!! 💗 💗 💗

Our great Python provides us with a number of libraries that simulate browser execution. One of the most powerful is Selenium. This blog post takes you into the world of Selenium!


 

🎅 Part one — Getting to know Selenium!

 

🍏1. What is Selenium?

Selenium is an automated testing tool that can drive the browser to perform specific actions, such as clicking, pulling down, and so on. It can also extract the source code of the browser’s current application page, which can be climbed visually. For some JS dynamic rendering interface, this kind of grab is very effective!

🍒2. Operating environment

Selenium tests run directly in the browser as if it were a real user, and support most major browsers, including Internet explorer (7,8,9,10,11), Firefox, Safari, Chrome, Opera, and more. We can use it to simulate user clicks to visit the website, bypassing some complex authentication scenarios through Selenium + driving the browser this combination can directly render parsing JS, bypassing most of the parameter construction and anti-crawling.

🍓 3. Installation

My browser is Chrome. If you don’t have Google Browser, go to the app market or download it from the official website.

1 Installation of Selenium library:

Direct PIP installation (one command && in one step) :

pip install selenium
Copy the code

2 installation of ChromeDriver:

(Install the browser driver based on the browser version.) Procedure

  1. Get the current browser version (Google: Help)
  2. To download the driver version, visit the download address. (Note that the browser version and driver version do not correspond to each other. Check out the driver version of your browser.)
  3. Unzip to get the executable:

Windows for Chromedriver. Exe For Linux/MAC for ChromeDriver 3. In Windows, set the chromedriver.exe directory to the path variable. / Drag the chromedriver.exe file to Python Scripts. ; ② In Linux/MAC OS, set the chromedriver directory to the PATH environment value.

  1. Verify the installation — Enter Chromedriver in the CMD window and the following interface will appear:

 

🍌4. The role and working principle of Selenium

1  ⃣ role:

  1. Automated testing, through which we can write automated programs, simulate the operation of the Web interface in the browser. For example, click the interface button, input text in the text box and other operations.
  2. Get information – Get information from the Web interface. For example, job information on recruitment websites, stock price information on financial websites and so on, and then analyzed and processed with procedures.
  3. Official website address.

2 working principle:

Use a header browser for development and a browser without an interface for deploymentPhantomjs no interface browser official download address).

Matters needing attention:

The new version of Selenium no longer supports PhantomJS, and the original authors have given up maintaining the project. Also, try not to use this method when doing crawlers. Selenium+ browser combinations are slow and cannot handle high volume crawls and concurrent crawls. And eats computer resources.

🍠5. Simple use of practical examples:

⛔

from selenium import webdriver      # Control the browser module
import time         # Add sleep, or it will run too fast to record the screen effect!

# declare the browser object -- if firefox: driver = webdriver.firefos ()
driver = webdriver.Chrome()   Get chrome control object -- WebDriver object

# 1. Request a URL
driver.get('http://www.baidu.com')
time.sleep(1)
# 2. Navigate to the search box TAB
input_tag = driver.find_element_by_id('kw')
# 3. Type your search into the search box
input_tag.send_keys('Cat pictures')
# 4. Navigate to the search icon on Baidu
submit_tag = driver.find_element_by_id('su')
time.sleep(1)
# 5. Click the search icon
submit_tag.click()
time.sleep(5)
# 6. Quit! If you do not exit, there will be residual processes!!
driver.quit()
Copy the code

⛔

  1. Webdriver.chrome () if the driver is not included in the environment variable, add the executable parameter to Chrome, which is the path of the downloaded Chromedriver file.
  2. Driver.find_element_by_id (‘kw’).send_keys(‘ cat picture ‘) Locate the tag whose ID attribute value is ‘kw’ and enter the string ‘cat picture’ into it;
  3. Driver.find_element_by_id (‘su’).click() locate the id attribute whose value is the su label;
  4. The click function fires the tag’s JS click event.

⛔③ Effect display:

Copy the code

Note: If find_element does not match, an exception is thrown, but if find_elements does not match, an empty list is returned!

The second method — By object lookup:

In addition to the above various search methods, there are two private methods to integrate all the above search methods, let us more convenient use!
methods role
Find_element (By XPATH, ‘/ / button/span) Find one by Xpath
Find_elements (By XPATH, ‘/ / button/span) Find more than one by Xpath

The first parameter can choose to use a method to find out, By the way of using XXX XXX, analytic method is as follows (note – By import objects: the from selenium.webdriver.com mon. By the import By) :

  • ID = “id” ​
  • XPATH = “xpath” ​
  • LINK_TEXT = “link text” ​
  • PARTIAL_LINK_TEXT = “partial link text”
  • ​ NAME = “name” ​
  • TAG_NAME = “tag name” ​
  • CLASS_NAME = “class name”
  • ​ CSS_SELECTOR = “css selector”

 

🐞3. Node interaction:

Selenium can drive the browser to perform operations, which means that the browser can simulate performing actions.

🚀 (1) Common Usage:

methods role
send_keys() Enter text
clear() Empty words
click() Click on the button
submit() Submit the form

🚀 (2) Example operations:

     # Locate the user name
	element=driver.find_element_by_id("userA")
	Enter a user name
	element.send_keys("admin1")
	# delete the entered user name
	element.send_keys(Keys.BACK_SPACE)
	Reenter the user name
	element.send_keys("admin_new")
	# all
	element.send_keys(Keys.CONTROL,'a')
	# copy
	element.send_keys(Keys.CONTROL,'c')
	# paste
	driver.find_element_by_id('passwordA').send_keys(Keys.CONTROL,'v')
Copy the code

 

🐻4. Action Chain:

👑 (1) Explanation:

  • In Selenium, in addition to simple click actions, there are also some slightly complex actions, which need the ActionChains sub-module to meet our needs.

  • ActionChains can complete complex page interaction behaviors, such as element drag, mouse movement, hover behavior, content menu interaction. When you call the ActionChains method, you don’t execute it immediately. Instead, you store all of your actions in a queue. When you call the Perform () method, you perform the actions in the order they’re placed in the queue.

  • Import ActionChains package:

from selenium.webdriver.common.action_chains import ActionChains

👑 (2) Method:

ActionChains provides the method role
click(on_element=None) Left mouse click on the element passed in
double_click(on_element=None) Double click the left mouse button
context_click(on_element=None) Right mouse click
click_and_hold(on_element=None) Click and hold the left mouse button
release(on_element=None) Release the left mouse button at an element position
drag_and_drop(source, target) Drag to an element and release
drag_and_drop_by_offset(source, xoffset, yoffset) Drag to some coordinates and release
move_to_element(to_element) Mouse over an element
move_by_offset(xoffset, yoffset) Move the mouse to the specified x, Y positions
move_to_element_with_offset(to_element, xoffset, yoffset) How far away from an element to move the mouse
perform() Perform all actions in the chain

👑 (3) Example:

Example:1.Guide package:from selenium.webdriver.common.action_chains import ActionChains
    2.Instantiate ActionChains object: Action=ActionChains(driver)3.Element = action.context_click (username)4.Execution: element. The perform ()Copy the code

 

🐮5. Extract node text content and attribute values:

♥ (1) Get text content:

  • element.text

Get the text content by locating the text property of the obtained tag object.

♥ (2) Obtain attribute values:

  • Element.get_attribute (‘ Attribute name ‘)

Get the value of the attribute by locating the get_attribute() function of the obtained tag object, passing in the attribute name.

🐒6. Execute JavaScript code:

For some operations: Selenium does not provide an API. For example, scroll down the page, but the great creators of Selenium have given us another, more convenient way to emulate running JavaScript directly, using the execute_script() method!

⚜ Combat demonstration:

📌① Code:

import time
from selenium import webdriver

browser = webdriver.Chrome()

browser.get('https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E6%96%87%E5%BA%93/4928294?fr=aladdin')

# Execute JS code, slide to the bottom of the page!
js = 'window.scrollTo(0, document.body.scrollHeight)'
browser.execute_script(js)

# Execute JS code, popup prompt text!
browser.execute_script('alert '(" Bottom! ) ')

time.sleep(3)
Copy the code

📌② Effect display:

) based on index subscript of TAB page handle listCopy the code

🔆② code:

import time
from selenium import webdriver

driver=webdriver.Chrome()

driver.get('https://www.baidu.com/')

time.sleep(1)
driver.find_element_by_id('kw').send_keys('python')
time.sleep(1)
driver.find_element_by_id('su').click()
time.sleep(1)

Open a new TAB by executing js
js = "window.open('https://www.sougou.com');"
driver.execute_script(js)
time.sleep(1)

# 1. Get all current Windows
windows = driver.window_handles

time.sleep(2)
# 2. Switch according to window index
driver.switch_to.window(windows[0])
time.sleep(2)
driver.switch_to.window(windows[1])

time.sleep(6)
driver.quit()
Copy the code

🔆③ Effect display:

[video (video – f7hFykeq – 1627557936718) (type – bilibili) (url-player.bilibili.com/player.html…)”

🏃8. Field — Selenium slider verification code:

Here it is – stay tuned!

🔮 Part 3 — In The End!

Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!

If there are mistakes or inappropriate words can be pointed out in the comment area, thank you!

If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!