Hello everyone, I’m Xiao CAI. A man who wants to be a man who talks architecture! If you also want to be the person I want to be, or point a concern to do a companion, let small dishes no longer lonely!
This article focuses on getting started with Python
Refer to it if necessary
If it is helpful, do not forget the Sunday
Wechat public account has been opened, vegetable farmers said, do not pay attention to the students remember to pay attention to oh!
Hello, everybody. Here is CAI not dishes, the predecessor of vegetable farmers yue. Don’t get lost just because you change your name or your profile picture
Recently, in order to expand the language, I got to know how to play Python this week. After learning it, I realized, wow, it smells good. I don’t know when you just learn a language, have you ever felt that the language is a bit interesting, what would like to try.
When it comes to Python, people’s reactions may be crawlers and automated tests, and they seldom talk about using Python to do Web development. Comparatively speaking, Java is still the most popular language for Web development in China. But it is not that Python is not suitable for web development. As far as I know, the most common web frameworks are Django and Flask, etc
Django is a heavy framework that provides a lot of handy tools and encapsulates a lot of things without having to build your own wheels
The Flask has the advantage of being small, but it also has the disadvantage of being small, and being flexible means you have to build more wheels yourself, or spend more time configuring them
But this article is not about web development in Python, nor is it about getting started with Python. It is about getting started with automated tests and crawlers in Python
In my opinion, if you have other language development experience, the dishes were suggested to directly from a case of watching while learning, grammar and so on are the same (behind the combination of Java to learn python content), basic can read code a close, but without any language development experience of the students. Learn Python from scratch. Videos and books are good choices. Here is liao Xuefeng’s blog, which contains a good Python tutorial
I. Automated testing
Python does a lot of things, and it does a lot of interesting things
Learning a language, of course, you have to find something interesting to learn faster, for example, you want to climb the pictures or videos of xyz website, right
What is automated testing? Once you have written a script (.py file) that automatically runs your testing process in the background, there is a great tool that can help you with automated testing. Selenium is the Selenium tool
Selenium is a Web automation testing tool that makes it easy to simulate real users’ browser operations. It supports a wide variety of major browsers, such as Internet Explorer, Chrome, Firefox, Safari, Opera, etc. Here we use Python to demonstrate. Selenium doesn’t just support Python. It has client-side drivers for multiple programming languages.
1) Preparation
In order for the demo to go smoothly, we need to do some pre-preparation, otherwise the browser may not open properly
Step 1
To check the browser version, we use Edge, we can enter Edge ://version in the url input box to check the browser version, and then go to the corresponding driver store to install the corresponding version of the driver Microsoft Edge-WebDriver (Windows.net).
Step 2
Then we will unzip the downloaded driver file into the Scripts folder in your Python installation directory
2) Browser operation
To prepare, let’s look at the following simple code:
Add the guide package to the total of 4 lines of code, and enter python autotest.py on the terminal, and get the following demonstration:
You can see that the script has realized the automatic opening of the browser, automatic enlargement window, automatic opening of Baidu web page, three automatic operations, our learning forward close a step, is not feel a little interesting ~ let you gradually sink!
Here are a few common approaches to browser manipulation:
methods | instructions |
---|---|
webdriver.xxx() | Used to create browser objects |
maximize_window() | Window maximization |
get_window_size() | Get the browser size |
set_window_size() | Setting the browser size |
get_window_position() | Get browser location |
set_window_position(x, y) | Setting the browser location |
close() | Close the current TAB/window |
quit() | Close all tabs/Windows |
These are, of course, the basic general operations of Selenium, and there are better ones to come
When we open the browser, we want to do more than just open the web page, after all, the ambition of the programmer is infinite! We also want to automate page elements, which brings us to Selenium’s location operations
3) Locate elements
Page element positioning is not strange for the front end, with JS can be very easy to achieve element positioning, such as the following:
- Location by ID
document.getElementById("id")
- Locate by name
document.getElementByName("name")
- Location is performed by label name
document.getElementByTagName("tagName")
- Locate through the class class
document.getElementByClassName("className")
- Location through the CSS selector
document.querySeletorAll("css selector")
Selenium is an automated testing tool that implements page element location in eight ways, as follows:
- Id positioning
driver.find_element_by_id("id")
When we open baidu page, we can find that the ID of the input box is KW.
Once we know the element ID, we can use the ID to locate the element as follows
from selenium import webdriver
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")
Locate elements by ID
i = driver.find_element_by_id("kw")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code
- Name Attribute value location
driver.find_element_by_name("name")
The locating method of name is similar to that of ID, which is to find the value of name and then call the corresponding API. The use method is as follows:
from selenium import webdriver
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")
Locate elements by ID
i = driver.find_element_by_name("wd")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code
- The name of the class location
driver.find_element_by_class_name("className")
The location mode is the same as that of ID and name. You need to find the corresponding className and locate ~
- Label name location
driver.find_element_by_tag_name("tagName")
This way we use in daily life is relatively rare, because in HTML is defined by tag, such as input is input, table is table… Each element is actually a tag, and a tag is often used to define a class of functions. There may be multiple divs, input, tables, etc., in a page, so it is difficult to accurately locate elements using tags
- CSS selectors
driver.find_element_by_css_selector("cssVale")
This approach requires connecting five selectors of CSS
Five big selectors
- Element selector
The most common CSS selector is the element selector, which in HTML documents usually refers to an HTML element, such as:
html {background-color: black; }p {font-size: 30px; backgroud-color: gray; }h2 {background-color: red; }Copy the code
- Class selectors
. Add the class name to form a class selector, for example:
.deadline { color: red; }span.deadline { font-style: italic; }Copy the code
- The id selector
ID selectors are somewhat similar to class selectors, but the differences are significant. First, an element cannot have multiple classes like a class attribute. An element can only have a unique ID attribute. Use the ID selector to add the hash # to the ID value, for example:
#top { ...} Copy the code
- Property selector
We can select elements based on their attributes and their values, for example:
a[href][title] { ...} Copy the code
- Derived selector
Also known as a context selector, it uses the document DOM structure for CSS selection. Such as:
body li { ...} h1 span { ...} Copy the code
Of course, this selector is just a simple introduction, more content to consult their own documents ~
Now that we know about selectors, we can have fun locating CSS selectors:
from selenium import webdriver
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")
Locate the element through the ID selector
i = driver.find_elements_by_css_selector("#kw")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code
- Link text location
driver.find_element_by_link_text("linkText")
This way is specially used to locate text links, such as we can see baidu’s home page has a news, hao123, map… Etc link elements
Then we can use the link text to locate
from selenium import webdriver
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")
Position the element with the link text and click
driver.find_element_by_link_text("hao123").click()
Copy the code
- Partial link text
driver.find_element_by_partial_link_text("partialLinkText")
This approach is an aid to link_text, and sometimes a hyperlink text can be too long, which would be cumbersome and ugly if we typed all of it
In fact, we only need to truncate a portion of the string for Selenium to understand what we are selecting, using the partial_link_text method
- Xpath path expression
driver.find_element_by_xpath("xpathName")
Ideally, each element would have a unique id or name or class or hyperlinked text attribute, so we could locate them by that unique attribute value. But sometimes the element we’re trying to locate doesn’t have the ID,name, or class attributes, or the values of these attributes are the same for multiple elements, or they change when the page is refreshed. At this point we can only use xpath or CSS to locate. And of course you don’t have to calculate the value of xpath we just go to the page and find the element in F12, right click copy xpath
Then position it in the code:
from selenium import webdriver
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://www.baidu.com")
driver.find_element_by_xpath("//*[@id='kw']").send_keys("The vegetable farmer said.")
Copy the code
4) Element operation
What we want to do is not just select the element, but what we want to do after we select the element. We have already done the click() and send_keys(“value”) operations in the above demo. Here are a few more operations
The method name | instructions |
---|---|
click() | Click on the element |
send_keys(“value”) | Analog key input |
clear() | Clears the contents of an element, such as an input field |
submit() | Submit the form |
text | Gets the text content of the element |
is_displayed | Determines whether the element is visible |
After seeing whether there is a kind of like ever similar feeling, this is the basic operation of JS ~!
5) Practical exercises
After learning the above operations, we can simulate a xiaomi mall shopping operation, the code is as follows:
from selenium import webdriver
item_url = "https://www.mi.com/buy/detail?product_id=10000330"
Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open the shopping page
driver.get(item_url)
# Implicit wait Settings prevent network congestion pages from not loading in time
driver.implicitly_wait(30)
# select address
driver.find_element_by_xpath("//*[@id='app']/div[3]/div/div/div/div[2]/div[2]/div[3]/div/div/div[1]/a").click()
driver.implicitly_wait(10)
Click to manually select the address
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div["
"1]/div/div/div[2]/span[1]").click()
# Select Fujian
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[13]").click()
driver.implicitly_wait(10)
# option,
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# selection area
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# Pick a street
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(20)
# Click to add to cart
driver.find_element_by_class_name("sale-btn").click()
driver.implicitly_wait(20)
# Click to go to cart checkout
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div[1]/div[2]/a[2]").click()
driver.implicitly_wait(20)
# Click to settle
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div/div/div[1]/div[4]/span/a").click()
driver.implicitly_wait(20)
# Click agree agreement
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div[3]/button[1]").click()
Copy the code
The effect is as follows:
This is the practice of our learning results, of course, if the second kill situation might as well write a script to practice ~:boom:, if there is no goods, we can add a while loop to polling access!
Second, crawler test
We have shown you how to use Selenium for automated testing, using the whiskers method. Next, we will show you another powerful feature of Python, which is for crawlers
Before learning about crawlers, we need to understand a few necessary tools
1) Page loader
The Python standard library already provides: Urllib, URllib2, Httplib and other modules for HTTP requests, but the API is not elegant enough ~, it requires a huge amount of work, as well as various methods of coverage, to complete the simplest task, of course, this is the programmer can not tolerate, the parties developed in addition to a variety of good third-party libraries for use ~
- request
Request is an HTTP library based on Python developed under the Apache 2 license. It is highly encapsulated on the basis of Python built-in modules, so that users can easily complete all operations available in the browser when making network requests
- scrapy
The difference between Request and scrapy may be that scrapy is a more important framework. It is a site-level crawler, while Request is a page-level crawler, with less concurrency and performance than scrapy
2) Page parsers
- BeautifulSoup
BeautifulSoup is a module that takes an HTML or XML string, formats it, and then makes it easy to find a specified element in HTML or XML using methods it provides to quickly find the specified element.
- scrapy.Selector
Selector is based on Parsel, an advanced wrapper that selects a portion of an HTML file using a specific XPath or CSS expression. It is built on top of the LXML library, which means they are very similar in speed and parsing accuracy.
See Scrapy for details
3) Data storage
When we climb down the content, we need to have a corresponding storage source to store it
Detailed database operations will be covered in a future Web development blog post
- TXT text
Common operations using file
- sqlite3
SQLite, a lightweight database, is an ACID-compliant relational database management system contained in a relatively small C library
- mysql
Do not do too much introduction, understand all understand, web development old lover
4) Practical exercises
Web crawler, which is actually called Web data collection, is easier to understand. It is to programmatically request data (HTML forms) from the web server, and then parse the HTML to extract the data you want.
We can simply divide it into four steps:
- Gets HTML data based on the given URL
- Parsing the HTML to get the target data
- Store the data
Of course, all this requires that you understand the simple syntax of Python and the basic operations of HTML
Let’s use a combination of Request + BeautifulSoup + Text for an exercise. Suppose we want to climb the Python tutorial by Liao Xuefeng
# import requests library
import requests
# Import file operation library
import codecs
import os
from bs4 import BeautifulSoup
import sys
import json
import numpy as np
import importlib
importlib.reload(sys)
Assign a request header to the request to mimic chrome
global headers
headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
server = 'https://www.liaoxuefeng.com/'
# Liao Xuefeng Python tutorial address
book = 'https://www.liaoxuefeng.com/wiki/1016959663602400'
Define the storage location
global save_path
save_path = 'D:/books/python'
if os.path.exists(save_path) is False:
os.makedirs(save_path)
Get chapter content
def get_contents(chapter) :
req = requests.get(url=chapter, headers=headers)
html = req.content
html_doc = str(html, 'utf8')
bf = BeautifulSoup(html_doc, 'html.parser')
texts = bf.find_all(class_="x-wiki-content")
# get the content of div tag id attribute content \xa0 is unbroken whitespace
content = texts[0].text.replace('\xa0' * 4.'\n')
return content
Write file
def write_txt(chapter, content, code) :
with codecs.open(chapter, 'a', encoding=code)as f:
f.write(content)
# main method
def main() :
res = requests.get(book, headers=headers)
html = res.content
html_doc = str(html, 'utf8')
# HTML parsing
soup = BeautifulSoup(html_doc, 'html.parser')
Get all chapters
a = soup.find('div'.id='1016959663602400').find_all('a')
print('Total entries: %d' % len(a))
for each in a:
try:
chapter = server + each.get('href')
content = get_contents(chapter)
chapter = save_path + "/" + each.string.replace("?"."") + ".txt"
write_txt(chapter, content, 'utf8')
except Exception as e:
print(e)
if __name__ == '__main__':
main()
Copy the code
When we run the program, we can see the tutorial content we have climbed in the D:/books/python location!
In this way, we have simply implemented the crawler, but the crawler needs to be careful ~!
This article explores python’s use of automated tests and crawlers in two dimensions
Don’t talk, don’t be lazy, and xiao CAI do a blowing bull X do architecture of the program ape ~ point a concern to do a companion, let xiao CAI no longer lonely. See you later!
Today you work harder, tomorrow you will be able to say less words!
I am xiao CAI, a man who grows stronger with you. 💋
Wechat public account has been opened, vegetable farmers said, do not pay attention to the students remember to pay attention to oh!