Python based
I wrote the Python 3 Minimalist Tutorial. PDF, which is suitable for a quick start with some programming basics. You will be able to write interfaces on your own.
requests
Requests, the Python HTTP request library equivalent of Android’s Retrofit, It includes keep-alive and connection pooling, Cookie persistence, automatic content extraction, HTTP proxy, SSL authentication, connection timeout, Session, and many other features. It is compatible with Python2 and Python3. Github.com/requests/re… .
The installation
Mac:
pip3 install requestsCopy the code
Windows:
pip install requestsCopy the code
Send the request
HTTP request methods include GET, POST, PUT, and DELETE.
import requests
# get request
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')
# post request
response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')
# put request
response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')
# the delete request
response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')Copy the code
The request returns a Response object, which encapsulates the Response data returned to the browser by the server in THE HTTP protocol. The main elements of the Response include: status code, reason phrase, Response header, Response URL, Response encoding, Response body and so on.
# status code
print(response.status_code)
# response URL
print(response.url)
# Response phrases
print(response.reason)
# Response content
print(response.json())Copy the code
Custom request headers
To request to add HTTP Headers, simply pass a dict to the Headers keyword argument.
header = {'Application-Id': '19869a66c6'.'Content-Type': 'application/json'
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)Copy the code
Build query parameters
Think for the URL query string (query string) pass some data, such as: http://127.0.0.1:1024/developer/api/v1.0/all? Key1 =value1&key2=value2, Requests allows you to supply these arguments with a string dictionary using the params keyword argument.
payload = {'key1': 'value1'.'key2': 'value2'}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)Copy the code
You can also pass a list as a value:
payload = {'key1': 'value1'.'key2': ['value2'.'value3']}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)
# response URL
print(response.url)# print: http://127.0.0.1:1024/developer/api/v1.0/all? key1=value1&key2=value2&key2=value3Copy the code
Post request data
If the server requires that the data to be sent be form data, you can specify the keyword parameter data.
payload = {'key1': 'value1'.'key2': 'value2'}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)Copy the code
If you want to pass jSON-formatted string arguments, you can use JSON keyword arguments, whose values can be passed as dictionaries.
obj = {
"article_title": "The Death of a Civil servant 2"
}
# response = requests. Post (' http://127.0.0.1:1024/developer/api/v1.0/insert ', json = obj)Copy the code
Response content
Requests automatically decodes the content from the server. Most Unicode character sets can be decoded seamlessly. Once the request is issued, Requests makes an educated guess about the encoding of the response based on the HTTP header.
# Response content
# returns the content of type STR
# print(response.text())
# return JSON response content
print(response.json())
The return is binary response content
# print(response.content())
# original response content, the original request set stream=True
# response = requests. Get (' http://127.0.0.1:1024/developer/api/v1.0/all ', stream = True)
# print(response.raw())Copy the code
timeout
Requests will not be automatically timed out if timeout is not explicitly specified. If there is no response from the server, the entire application is blocked and unable to process other requests.
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5) Number of secondsCopy the code
The proxy Settings
If you visit a site frequently, it’s easy to block requests from the server, and Requests supports the proxy perfectly.
# agent
proxies = {
'http': 'http://127.0.0.1:1024'.'https': 'http://127.0.0.1:4000',
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)Copy the code
BeautifulSoup
BeautifulSoup, a Python Html parsing library, is the Java equivalent of Jsoup.
The installation
BeautifulSoup 3 has been discontinued and BeautifulSoup 4 is used directly.
Mac:
pip3 install beautifulsoup4Copy the code
Windows:
pip install beautifulsoup4Copy the code
Installing the parser
I’m using HTML5lib, pure Python.
Mac:
pip3 install html5libCopy the code
Windows:
pip install html5libCopy the code
Simple to use
BeautifulSoup converts complex HTML documents into a complex tree structure, each node being a Python object.
parsing
from bs4 import BeautifulSoup
def get_html_data():
html_doc = ""< HTML > WuXiaolong
The original intention of writing blog: sum up experience, record own growth.
You have to work hard enough to look effortless! Focus! Exquisite!
Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong"> GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub """
soup = BeautifulSoup(html_doc, "html5lib")Copy the code
tag
tag = soup.head
print(tag) # <head><title>WuXiaolong</title></head>
print(tag.name) # head
print(tag.title) # <title>WuXiaolong</title>
print(soup.p) # Share Android technology and also look at popular technologies like Python.
print(soup.a['href']) Print the href attribute of the a tag: http://wuxiaolong.me/Copy the code
Note: If multiple tags match, return the first one, such as the P tag here.
To find the
print(soup.find('p')) # Share Android technology and also look at popular technologies like Python.
Copy the code
Find also returns the first matching label by default, or None if no matching node is found. If I want to specify a search, such as the public number here, I can specify the tag’s class attribute value:
# Since class is the Python keyword, this is specified as class_.
print(soup.find('p', class_="WeChat"))
# < p class = "WeChat" > < a href = "https://open.weixin.qq.com/qr/code?username=MrWuXiaolong" > public number < / a > < / p >Copy the code
Find all P tags:
for p in soup.find_all('p') :print(p.string) Copy the code
In actual combat
Some time ago, some users reported that my personal APP was suspended. Although I no longer maintain this APP, I have to at least ensure its normal operation. Most people know that this APP data is crawling (see: “Hand to Hand teach you to do personal APP”), one of the benefits of data crawling is that you do not have to manage the data, the drawback is that other people’s website hangs or the HTML node of the website changes, I can not parse, there is no data. I was thinking of using Python crawler first, MySQL to insert into the local database, then Flask to write the interface, using Android Retrofit. Insert bMob with the BMob SDK… Later, I learned that BMob provides RESTful, solve big problems, I can directly Python crawler insert good, here I demonstrate is to insert local database, if using BMob, is to tune the RESTful data provided by BMob.
Site selected
I choose the demo website: meiriyiwen.com/random, we can find that each request of the article is not the same, just take advantage of this, AS long as I regularly request, parsing their own data, insert the database is OK.
Creating a database
I created it directly using NaviCat Premium, or on the command line.
Create a table
Create table article_title, article_author, article_content; create table article_title, article_author, article_content;
import pymysql
def create_table():
# establish a connection
db = pymysql.connect(host='localhost',
user='root',
password='root',
db='python3learn')
Create a database statement named article
sql = ' ''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'' '
Create a cursor object cursor using the cursor() method
cursor = db.cursor()
try:
Execute SQL statement
cursor.execute(sql)
# commit transaction
db.commit()
print('create table success')
except BaseException as e: Rollback if an error occurs
db.rollback()
print(e)
finally:
Close the cursor connection
cursor.close()
Close the database connection
db.close()
if __name__ == '__main__':
create_table()
Copy the code
Analytical website
Requests requests the site first, and BeautifulSoup resolves the nodes it needs.
import requests
from bs4 import BeautifulSoup
def get_html_data():
# get request
response = requests.get('https://meiriyiwen.com/random')
soup = BeautifulSoup(response.content, "html5lib")
article = soup.find("div", id='article_show')
article_title = article.h1.string
print('article_title=%s' % article_title)
article_author = article.find('p', class_="article_author").string
print('article_author=%s' % article.find('p', class_="article_author").string)
article_contents = article.find('div', class_="article_text").find_all('p')
article_content = ' '
for content in article_contents:
article_content = article_content + str(content)
print('article_content=%s' % article_content)Copy the code
Inserting a database
By default, the title of the article on this site is unique. When inserting data, it will not be inserted if it has the same title.
import pymysql
def insert_table(article_title, article_author, article_content):
# establish a connection
db = pymysql.connect(host='localhost',
user='root',
password='root',
db='python3learn',
charset="utf8")
# insert data
query_sql = 'select * from article where article_title=%s'
sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
Create a cursor object cursor using the cursor() method
cursor = db.cursor()
try:
query_value = (article_title,)
Execute SQL statement
cursor.execute(query_sql, query_value)
results = cursor.fetchall()
if len(results) == 0:
value = (article_title, article_author, article_content)
cursor.execute(sql, value)
# commit transaction
db.commit()
print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- "% s" insert table success -- -- -- -- -- -- -- -- -- -- -- -- --' % article_title)
return True
else:
print('-------------- %s already exists -------------' % article_title)
return False
except BaseException as e: Rollback if an error occurs
db.rollback()
print(e)
finally: Close the cursor connection
cursor.close()
Close the database connection
db.close()Copy the code
Timing Settings
Do a regular time, after a period of time to climb once.
import sched
import time
Initialize the Scheduler class of the Sched module
The first argument is a function that can return a timestamp, and the second argument can block until the time arrives.
schedule = sched.scheduler(time.time, time.sleep)
Function triggered by periodic scheduling
def print_time(inc):
# to do something
print('to do something')
schedule.enter(inc, 0, print_time, (inc,))
The default parameter is 60 s
def start(inc=60):
# enter: interval event, priority (used for ordering two simultaneous events that arrive at the same time)
# Give arguments to the trigger function (tuple form)
schedule.enter(0, 0, print_time, (inc,))
schedule.run()
if __name__ == '__main__':
# 5 s output once
start(5)Copy the code
The complete code
import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time
def create_table():
# establish a connection
db = pymysql.connect(host='localhost',
user='root',
password='root',
db='python3learn')
Create a database statement named article
sql = ' ''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'' '
Create a cursor object cursor using the cursor() method
cursor = db.cursor()
try:
Execute SQL statement
cursor.execute(sql)
# commit transaction
db.commit()
print('create table success')
except BaseException as e: Rollback if an error occurs
db.rollback()
print(e)
finally:
Close the cursor connection
cursor.close()
Close the database connection
db.close()
def insert_table(article_title, article_author, article_content):
# establish a connection
db = pymysql.connect(host='localhost',
user='root',
password='root',
db='python3learn',
charset="utf8")
# insert data
query_sql = 'select * from article where article_title=%s'
sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
Create a cursor object cursor using the cursor() method
cursor = db.cursor()
try:
query_value = (article_title,)
Execute SQL statement
cursor.execute(query_sql, query_value)
results = cursor.fetchall()
if len(results) == 0:
value = (article_title, article_author, article_content)
cursor.execute(sql, value)
# commit transaction
db.commit()
print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- "% s" insert table success -- -- -- -- -- -- -- -- -- -- -- -- --' % article_title)
return True
else:
print('-------------- %s already exists -------------' % article_title)
return False
except BaseException as e: Rollback if an error occurs
db.rollback()
print(e)
finally: Close the cursor connection
cursor.close()
Close the database connection
db.close()
def get_html_data():
# get request
response = requests.get('https://meiriyiwen.com/random')
soup = BeautifulSoup(response.content, "html5lib")
article = soup.find("div", id='article_show')
article_title = article.h1.string
print('article_title=%s' % article_title)
article_author = article.find('p', class_="article_author").string
print('article_author=%s' % article.find('p', class_="article_author").string)
article_contents = article.find('div', class_="article_text").find_all('p')
article_content = ' '
for content in article_contents:
article_content = article_content + str(content)
print('article_content=%s' % article_content)
Insert database
insert_table(article_title, article_author, article_content)
Initialize the Scheduler class of the Sched module
The first argument is a function that can return a timestamp, and the second argument can block until the time arrives.
schedule = sched.scheduler(time.time, time.sleep)
Function triggered by periodic scheduling
def print_time(inc):
get_html_data()
schedule.enter(inc, 0, print_time, (inc,))
The default parameter is 60 s
def start(inc=60):
# enter: interval event, priority (used for ordering two simultaneous events that arrive at the same time)
# Give arguments to the trigger function (tuple form)
schedule.enter(0, 0, print_time, (inc,))
schedule.run()
if __name__ == '__main__':
start(60*5)
Copy the code
Question: this is only for an article crawler, if it is a list of articles, click on the article details, this kind of crawler parsing? First of all, you must get the list, and then loop through the details of the article and insert them into the database, okay? I haven’t figured out how to do better yet, so I’ll leave it to the next topic.
The last
Although I am learning Python as a hobby, I need to put it into practice, otherwise I will soon forget it. Look forward to my next article on Python.
reference
Get started — Requests 2.18.1 documentation
Crawler Entry series ii: Elegant HTTP library Requests
Beautiful Soup 4.2.0 documentation
BeautifulSoup: HTML text parsing library