This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
I believe that many friends can develop their own crawler projects independently after being bombarded by my last few blog posts on crawler technology!! The road of reptiles has been opened!
but a few days ago a fan VX asked me this question: “The source code I see in the browser through the developer tools is completely different from the source code I crawled down through the Requests library! What’s going on here? Through the blogger you teach the method can not solve ah!”
In fact, this involves front-end knowledge, but MY energy time is limited, so currently only updated an HTML essential knowledge article, pay attention to this blogger – will work hard to continue to update CSS and JavaScript related knowledge of the article oh!
As a reptilian must also know the front-end knowledge of the first two HTML explanations. It will be out soon!
We need to understand why this is the case before we can do something about it. The first thing to remember about Requests is that they were retrieving raw HTML documents, whereas the page in the browser was the result of JavaScript processing of data from a variety of sources, whether loaded via Ajax or included in the HTML document. It can also be generated by JavaScript and a specific algorithm.
For the first case: Data loading is a kind of asynchronous loading way, the original page will contain some data first, after the original page loading, will request an interface to get the data from the server, and then the data will be processed and appear on the page, it is actually send an Ajax request (this is a situation of JavaScript to render the page!) ;
In the third case, the data load is generated using JavaScript and a specific algorithm, not raw HTML code, and there is no Ajax request involved.
principle known, the following question is how we in the end to solve?
|
The new version of Selenium no longer supports PhantomJS, and the original authors have given up maintaining the project. Also, try not to use this method when doing crawlers. Selenium+ browser combinations are slow and cannot handle high volume crawls and concurrent crawls. And eats computer resources. |
5. Simple use of practical examples:
from selenium import webdriver # Control the browser module
import time # Add sleep, or it will run too fast to record the screen effect!
# declare the browser object -- if firefox: driver = webdriver.firefos ()
driver = webdriver.Chrome() Get chrome control object -- WebDriver object
# 1. Request a URL
driver.get('http://www.baidu.com')
time.sleep(1)
# 2. Navigate to the search box TAB
input_tag = driver.find_element_by_id('kw')
# 3. Type your search into the search box
input_tag.send_keys('Cat pictures')
# 4. Navigate to the search icon on Baidu
submit_tag = driver.find_element_by_id('su')
time.sleep(1)
# 5. Click the search icon
submit_tag.click()
time.sleep(5)
# 6. Quit! If you do not exit, there will be residual processes!!
driver.quit()
Copy the code
- Webdriver.chrome () if the driver is not included in the environment variable, add the executable parameter to Chrome, which is the path of the downloaded Chromedriver file.
- Driver.find_element_by_id (‘kw’).send_keys(‘ cat picture ‘) Locate the tag whose ID attribute value is ‘kw’ and enter the string ‘cat picture’ into it;
- Driver.find_element_by_id (‘su’).click() locate the id attribute whose value is the su label;
- The click function fires the tag’s JS click event.
③ Effect display:
Copy the code
Note: If find_element does not match, an exception is thrown, but if find_elements does not match, an empty list is returned!
The second method — By object lookup:
In addition to the above various search methods, there are two private methods to integrate all the above search methods, let us more convenient use! |
methods | role |
---|---|
Find_element (By XPATH, ‘/ / button/span) | Find one by Xpath |
Find_elements (By XPATH, ‘/ / button/span) | Find more than one by Xpath |
The first parameter can choose to use a method to find out, By the way of using XXX XXX, analytic method is as follows (note – By import objects: the from selenium.webdriver.com mon. By the import By) :
- ID = “id”
- XPATH = “xpath”
- LINK_TEXT = “link text”
- PARTIAL_LINK_TEXT = “partial link text”
- NAME = “name”
- TAG_NAME = “tag name”
- CLASS_NAME = “class name”
- CSS_SELECTOR = “css selector”
3. Node interaction:
Selenium can drive the browser to perform operations, which means that the browser can simulate performing actions.
(1) Common Usage:
methods | role |
---|---|
send_keys() | Enter text |
clear() | Empty words |
click() | Click on the button |
submit() | Submit the form |
(2) Example operations:
# Locate the user name
element=driver.find_element_by_id("userA")
Enter a user name
element.send_keys("admin1")
# delete the entered user name
element.send_keys(Keys.BACK_SPACE)
Reenter the user name
element.send_keys("admin_new")
# all
element.send_keys(Keys.CONTROL,'a')
# copy
element.send_keys(Keys.CONTROL,'c')
# paste
driver.find_element_by_id('passwordA').send_keys(Keys.CONTROL,'v')
Copy the code
4. Action Chain:
(1) Explanation:
-
In Selenium, in addition to simple click actions, there are also some slightly complex actions, which need the ActionChains sub-module to meet our needs.
-
ActionChains can complete complex page interaction behaviors, such as element drag, mouse movement, hover behavior, content menu interaction. When you call the ActionChains method, you don’t execute it immediately. Instead, you store all of your actions in a queue. When you call the Perform () method, you perform the actions in the order they’re placed in the queue.
-
Import ActionChains package:
from selenium.webdriver.common.action_chains import ActionChains
(2) Method:
ActionChains provides the method | role |
---|---|
click(on_element=None) | Left mouse click on the element passed in |
double_click(on_element=None) | Double click the left mouse button |
context_click(on_element=None) | Right mouse click |
click_and_hold(on_element=None) | Click and hold the left mouse button |
release(on_element=None) | Release the left mouse button at an element position |
drag_and_drop(source, target) | Drag to an element and release |
drag_and_drop_by_offset(source, xoffset, yoffset) | Drag to some coordinates and release |
move_to_element(to_element) | Mouse over an element |
move_by_offset(xoffset, yoffset) | Move the mouse to the specified x, Y positions |
move_to_element_with_offset(to_element, xoffset, yoffset) | How far away from an element to move the mouse |
perform() | Perform all actions in the chain |
(3) Example:
Example:1.Guide package:from selenium.webdriver.common.action_chains import ActionChains
2.Instantiate ActionChains object: Action=ActionChains(driver)3.Element = action.context_click (username)4.Execution: element. The perform ()Copy the code
5. Extract node text content and attribute values:
♥️ (1) Get text content:
- element.text
Get the text content by locating the text property of the obtained tag object.
♥️ (2) Obtain attribute values:
- Element.get_attribute (‘ Attribute name ‘)
Get the value of the attribute by locating the get_attribute() function of the obtained tag object, passing in the attribute name.
6. Execute JavaScript code:
For some operations: Selenium does not provide an API. For example, scroll down the page, but the great creators of Selenium have given us another, more convenient way to emulate running JavaScript directly, using the execute_script() method! |
Combat demonstration:
① Code:
import time
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E6%96%87%E5%BA%93/4928294?fr=aladdin')
# Execute JS code, slide to the bottom of the page!
js = 'window.scrollTo(0, document.body.scrollHeight)'
browser.execute_script(js)
# Execute JS code, popup prompt text!
browser.execute_script('alert '(" Bottom! ) ')
time.sleep(3)
Copy the code
② Effect display:
) based on index subscript of TAB page handle listCopy the code
② code:
import time
from selenium import webdriver
driver=webdriver.Chrome()
driver.get('https://www.baidu.com/')
time.sleep(1)
driver.find_element_by_id('kw').send_keys('python')
time.sleep(1)
driver.find_element_by_id('su').click()
time.sleep(1)
Open a new TAB by executing js
js = "window.open('https://www.sougou.com');"
driver.execute_script(js)
time.sleep(1)
# 1. Get all current Windows
windows = driver.window_handles
time.sleep(2)
# 2. Switch according to window index
driver.switch_to.window(windows[0])
time.sleep(2)
driver.switch_to.window(windows[1])
time.sleep(6)
driver.quit()
Copy the code
③ Effect display:
[video (video – f7hFykeq – 1627557936718) (type – bilibili) (url-player.bilibili.com/player.html…)”
8. Field — Selenium slider verification code:
Here it is – stay tuned!
Part 3 — In The End!
Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts! |
This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!
If there are mistakes or inappropriate words can be pointed out in the comment area, thank you!
If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!