Hello, I’m Yue Chuang.
Many of you might have a knee-jerk reaction to hearing Python or a programming language that says “hard.” But today’s Python class is an exception, because the Python skills you’ll learn today don’t require you to understand computer principles or complex programming patterns. ** Even non-developers can easily do this by simply replacing links and files.
And these few practical tips are simply best practices for Python’s everyday helpers. Such as:
- Crawl documents, crawl tables, crawl learning materials;
- Play with charts and generate data visualizations;
- Batch naming files to achieve automated office;
- Batch map, add watermark, adjust size.
Next, we’ll do it one by one in Python. The code I’ve provided is universal code that can be replaced with links to web pages, file locations, and photos that you want to crawl.
If you don’t have Python and related environment constructs installed, you can refer to my previous article:
- Data analysis environment doesn’t match? Look at this!
- Python3 Web crawler System one-to-one Teaching (Environment Installation)
** Since data from different chapters may be cross-referenced, it is recommended that you first create a workfolder on your desktop, and then experiment with a separate Python file for each chapter. For example, you could create a new PyTips directory and create a Tips folder for each section with the corresponding.py file. (Depending on your specific, my folder is also different from this one)
1. Skillfully use Python crawlers to achieve wealth freedom
First, you can use Python to do crawlers. What is a crawler? In simple terms, it is to grab data (documents, information, pictures, etc.) on the network. For instance you take an examination of one’s deceased father grind can climb document and study data, want the form data on the network to do analysis, batch download picture and so on.
Let’s take a look at how to do that.
1.1 Obtaining Documents and Learning Materials
First of all, you have to decide what site you want to climb. What is the purpose of your acquisition? For example, Xiao Yue wants to climb the Qingyan Gang websiteEnter oneself for an examination guide“, so he wanted to collect the title and hyperlinks of all the articles on the current page to facilitate subsequent browsing.
Crawl the links of the site: zkaoy.com/sions/exam Objective: Collect the titles and hyperlinks of all articles currently on this page
To use Python, you can refer to the following two-step code template implementation (hint: Python dependency: urllib3 bS4 is required first). Install the required libraries:
pip install urllib3 BeautifulSoup4
Copy the code
The first step is to download the page and save it as a file with the following code. **PS: ** Here, FOR the sake of clarity, I split into two code files, later I will merge one into a code file.
# urllib3 method
# file_name:Crawler_urllib3.py
import urllib3
def download_content(url) :
""" The first function, used to download a web page, returns the web content parameter URL representing the url of the web page to be downloaded. The whole code is like before.
http = urllib3.PoolManager()
response = http.request("GET", url)
response_data = response.data
html_content = response_data.decode()
return html_content
The second function saves the string contents to a file
The first argument is the filename to save, and the second argument is the variable of the string content to save
def save_to_file(filename, content) :
fo = open(filename, "w", encoding="utf-8")
fo.write(content)
fo.close()
def main() :
Download the application guide
url = "https://zkaoy.com/sions/exam"
result = download_content(url)
save_to_file("tips1.html", result)
if __name__ == '__main__':
main()
Copy the code
# code requests
# file_name:Crawler_requests.py
import requests
def download_content(url) :
""" The first function, used to download a web page, returns the web content parameter URL representing the url of the web page to be downloaded. The whole code is like before.
response = requests.get(url).text
return response
The second function saves the string contents to a file
The first argument is the filename to save, and the second argument is the variable of the string content to save
def save_to_file(filename, content) :
with open(filename, mode="w", encoding="utf-8") as f:
f.write(content)
def main() :
Download the application guide
url = "https://zkaoy.com/sions/exam"
result = download_content(url)
save_to_file("tips1.html", result)
if __name__ == '__main__':
main()
Copy the code
The second step is to parse the web page and extract the link and title of the article.
# file_name:html_parse.py
# 1
from bs4 import BeautifulSoup
The input parameter is the HTML file name to analyze and the corresponding BeautifulSoup object is returned
def create_doc_from_filename(filename) :
with open(filename, "r", encoding='utf-8') as f:
html_content = f.read()
doc = BeautifulSoup(html_content)
return doc
def parse(doc) :
post_list = doc.find_all("div", class_="post-info")
for post in post_list:
link = post.find_all("a") [1]
print(link.text.strip())
print(link["href"])
def main() :
filename = "tips1.html"
doc = create_doc_from_filename(filename)
parse(doc)
if __name__ == '__main__':
main()
Copy the code
# file_name:html_parse_lxml.py
Parsing method two, specify the parser
from bs4 import BeautifulSoup
The input parameter is the HTML file name to analyze and the corresponding BeautifulSoup object is returned
def create_doc_from_filename(filename) :
with open(filename, "r", encoding='utf-8') as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
return soup
def parse(soup) :
post_list = soup.find_all("div", class_="post-info")
for post in post_list:
link = post.find_all("a") [1]
print(link.text.strip())
print(link["href"])
def main() :
filename = "tips1.html"
soup = create_doc_from_filename(filename)
parse(soup)
if __name__ == '__main__':
main()
Copy the code
**PS: ** The code is similar except that the parser, LXML, is specified
After executing the code, you can see that the title and link from the web page have been printed to the screen.
On the blackboard! These provinces past unripe cannot forecast name! https://zkaoy.com/15123.HTML World War II must return to census register seat test? https://zkaoy.com/15103.HTML these students can not participate in the prediction! Do not notice, take an examination of grind likely sign up for failure! https://zkaoy.com/15093.HTML hoo ~ postgraduate entrance examination registration fee, this situation can be refunded! https://zkaoy.com/15035.HTML note: send notice again!22Research with4Point change ‼ ️ https://zkaoy.com/14977.html
2021Take an examination of grind first test time decided! The official web time has changed https://zkaoy.com/14915.HTML fast code live! Take an examination of grind before these key time point, must not miss! https://zkaoy.com/14841.HTML nearly ten thousand candidates failed to register for postgraduate entrance examination! That's the problem!22Take an examination of grind certain attention! https://zkaoy.com/14822.Former students are easier to land than new students, do you agree or disagree? https://zkaoy.com/14670.HTML provinces and cities take an examination of grind registration cost! https://zkaoy.com/14643.HTML registration starting? Don't worry now, it's not as complicated as you think... https://zkaoy.com/14620.HTML Ministry of Education announces important data: postgraduate enrollment expansion20.74%!
https://zkaoy.com/14593.HTML fake admissions? This university near the beginning of the scholarship cancellation! https://zkaoy.com/14494.HTML next month to forecast the name, high-frequency problems early know https://zkaoy.com/14399.HTML attention! These net report information to be ready, otherwise affect9Enroll in month grind! https://zkaoy.com/14352.HTML want to take an examination of grind, each section should take an examination of how many points? https://zkaoy.com/14273.HTML select entry points need to pay attention to what? Enter oneself for an examination point has limitation! https://zkaoy.com/14161.HTML around the postgraduate entrance examination fee summary! Come and see how much you have to pay! https://zkaoy.com/14158.HTML postgraduate entrance examination university push free number announced, unified examination quota is still how many? https://zkaoy.com/14092.HTML this a few colleges take an examination of grind reference book to have change! How to collect the bibliography? https://zkaoy.com/14061.HTML colleges guide https://zkaoy.com/sions/zxgg1 these to be ready in advance! Otherwise, registration will be affected! https://zkaoy.com/13958.HTML help! Because of this, nearly ten thousand people missed the chance of postgraduate entrance examination! https://zkaoy.com/13925.How does HTML take one's postgraduate entrance examination see recruit students brochure and recruit students catalogue? https://zkaoy.com/13924.html
Copy the code
Above, I was disassembled, now merged into a code file can be:
# file_name:Crawler.py
import requests
from bs4 import BeautifulSoup
def download_content(url) :
""" The first function, used to download a web page, returns the web content parameter URL representing the url of the web page to be downloaded. The whole code is like before.
response = requests.get(url).text
return response
The second function saves the string contents to a file
The first argument is the filename to save, and the second argument is the variable of the string content to save
def save_to_file(filename, content) :
with open(filename, mode="w", encoding="utf-8") as f:
f.write(content)
def create_doc_from_filename(filename) :
The input parameter is the HTML file name to analyze and the corresponding BeautifulSoup object is returned
with open(filename, "r", encoding='utf-8') as f:
html_content = f.read()
soup = BeautifulSoup(html_content, "lxml")
return soup
def parse(soup) :
post_list = soup.find_all("div", class_="post-info")
for post in post_list:
link = post.find_all("a") [1]
print(link.text.strip())
print(link["href"])
def main() :
Download the application guide
url = "https://zkaoy.com/sions/exam"
filename = "tips1.html"
result = download_content(url)
save_to_file(filename, result)
soup = create_doc_from_filename(filename)
parse(soup)
if __name__ == '__main__':
main()
Copy the code
Code file: [github.com/AndersonHJB… Universal code template: 10 must-learn practical tips /1.1 Skillfully using Python crawlers to achieve wealth freedom](github.com/AndersonHJB… 1.1 Use Python crawlers to achieve wealth and freedom
So what if you want to crawl other pages? You only need to replace a few places, as shown below.
- Replace it with the address of the page you want to download
- Replace with the file name of the page saved
- BeautifulSoup is the function that we use to parse the HTML structure out of what we want, and what we’re doing here is finding all of the class attributes first
post-info
Div tag, and extract the text portion of the A tag from those tags. If you parse a web page with a different structure than this one, see how BeautifulSoup works in this tutorialSix, www.aiyc.top/673.html#….
1.2 Grab tables and do data analysis
When we surf the Internet, we often see some useful tables and hope to save them for future use. However, it is easy to distort, garble or format disorder when copying them directly to Excel. With the help of Python, we can easily save tables in web pages. Install dependencies: urllib3, pandas
pip install urllib3 pandas
Copy the code
Take CMB foreign exchange page as an example:The Python code is as follows:
# file_name: excel_crawler_urllib3.py
import urllib3
import pandas as pd
def download_content(url) :
Create a PoolManager object and name it HTTP
http = urllib3.PoolManager()
# call the request method of the HTTP object and pass the string "GET" as the first argument.
The second argument is the url to download from, which is our URL variable
The # request method returns an Object of the HTTPResponse class, which we'll call Response
response = http.request("GET", url)
Get the data property of the response object and store it in the variable response_data
response_data = response.data
Decode response_data to get the content of the web page, stored in html_content
# variable
html_content = response_data.decode()
return html_content
def save_excel() :
html_content = download_content("http://fx.cmbchina.com/Hq/")
# Call the read_html function, pass in the content of the web page, and store the result in cmb_table_list
The # read_html function returns a list of dataframes
cmb_table_list = pd.read_html(html_content)
By printing each list element, verify that we need the second one, the subscript 1
cmb_table_list[1].to_excel("tips2.xlsx")
def main() :
save_excel()
if __name__ == '__main__':
main()
Copy the code
# file_name: excel_crawler_requests.py
import requests
import pandas as pd
from requests.exceptions import RequestException
def download_content(url) :
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
return "None"
except RequestException as e:
return e
def save_excel(filename) :
html_content = download_content("http://fx.cmbchina.com/Hq/")
# Call the read_html function, pass in the content of the web page, and store the result in cmb_table_list
The # read_html function returns a list of dataframes
cmb_table_list = pd.read_html(html_content)
By printing each list element, verify that we need the second one, the subscript 1
# print(cmb_table_list)
cmb_table_list[1].to_excel(filename)
def main() :
filename = "tips2.xlsx"
save_excel(filename)
if __name__ == '__main__':
main()
Copy the code
To help understand: After execution, it is generated in the directory where the code file residestips2.xlsx
The Excel file is opened as shown in the figure below.When you want to grab your own table, replace the following three sections.
- Change the name of the Excel file you want to save;
- Instead, you want to grab the url of the page where the table is located;
- Replace it with the table number, such as the number of tables you want to grab from a web page;
Code link: github.com/AndersonHJB…
1.3 Download images in batches
When we see a web page with a lot of favorite pictures, one by one saving efficiency is relatively low.
We can also download images quickly with Python. In the case of The Sugar web, we see this page.I feel very nice. I hope I can download all the pictures. The scheme is basically the same as that of 1.
We first download the web page, then analyze the IMG tag in it, and then download the image. First we create a folder tips_3 in the working directory to hold the downloaded images.
First, download the web page again. The Python code is as follows.
# -*- coding: utf-8 -*-
# @author: AI Yue Chuang
# @Date: 2021-09-13 20:16:07
# @Last Modified by: aiyc
# @Last Modified time: 2021-09-13 21:02:58
import urllib3
The first function, used to download the web page, returns the content of the web page
The url parameter represents the url of the web page to be downloaded.
The overall code is the same as before
def download_content(url) :
http = urllib3.PoolManager()
response = http.request("GET", url)
response_data = response.data
html_content = response_data.decode()
return html_content
The second function saves the string contents to a file
The first argument is the filename to save, and the second argument is the variable of the string content to save
def save_to_file(filename, content) :
fo = open(filename, "w", encoding="utf-8")
fo.write(content)
fo.close()
url = "Https://www.duitang.com/search/?kw=AI yue chong & type = feed." "
result = download_content(url)
save_to_file("tips3.html", result)
Copy the code
Then extract the IMG tag and download the image.
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
The input parameter is the HTML file name to analyze and the corresponding BeautifulSoup object is returned
def create_doc_from_filename(filename) :
fo = open(filename, "r", encoding='utf-8')
html_content = fo.read()
fo.close()
doc = BeautifulSoup(html_content, "lxml")
return doc
doc = create_doc_from_filename("tips3.html")
images = doc.find_all("img")
for i in images:
src = i["src"]
filename = src.split("/")[-1]
# print(i["src"])
urlretrieve(src, "tips_3/" + filename)
Copy the code
After the command is executed, open ittips_3
Catalog, and you can see that the images have been downloaded.The replacement instructions are as follows.
- Replace with the file name you want to save (web page file);
- Replace with the url from which you want to download the web page;
- Instead of the folder you want to save your pictures, create the folder.
In addition, some web images are first displayed after the dynamic loading of the page, such dynamic loading of the content of the image download is not supported oh. Code link: github.com/AndersonHJB…
AI Yue Chuang ·V: Jiabcdefh