Python based

I wrote the Python 3 Minimalist Tutorial. PDF, which is suitable for a quick start with some programming basics. You will be able to write interfaces on your own.

requests

Requests, the Python HTTP request library equivalent of Android’s Retrofit, It includes keep-alive and connection pooling, Cookie persistence, automatic content extraction, HTTP proxy, SSL authentication, connection timeout, Session, and many other features. It is compatible with Python2 and Python3. Github.com/requests/re… .

The installation

Mac:

pip3 install requestsCopy the code

Windows:

pip install requestsCopy the code

Send the request

HTTP request methods include GET, POST, PUT, and DELETE.

import requests
​
# get request
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')
​
# post request
response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')
​
# put request
response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')
​
# the delete request
response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')Copy the code

The request returns a Response object, which encapsulates the Response data returned to the browser by the server in THE HTTP protocol. The main elements of the Response include: status code, reason phrase, Response header, Response URL, Response encoding, Response body and so on.

# status code
print(response.status_code)
​
# response URL
print(response.url)
​
# Response phrases
print(response.reason)
​
# Response content
print(response.json())Copy the code

Custom request headers

To request to add HTTP Headers, simply pass a dict to the Headers keyword argument.

header = {'Application-Id': '19869a66c6'.'Content-Type': 'application/json'
          }
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)Copy the code

Build query parameters

Think for the URL query string (query string) pass some data, such as: http://127.0.0.1:1024/developer/api/v1.0/all? Key1 =value1&key2=value2, Requests allows you to supply these arguments with a string dictionary using the params keyword argument.

payload = {'key1': 'value1'.'key2': 'value2'}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)Copy the code

You can also pass a list as a value:

payload = {'key1': 'value1'.'key2': ['value2'.'value3']}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)
​
# response URL
print(response.url)# print: http://127.0.0.1:1024/developer/api/v1.0/all? key1=value1&key2=value2&key2=value3Copy the code

Post request data

If the server requires that the data to be sent be form data, you can specify the keyword parameter data.

payload = {'key1': 'value1'.'key2': 'value2'}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)Copy the code

If you want to pass jSON-formatted string arguments, you can use JSON keyword arguments, whose values can be passed as dictionaries.

obj = {
    "article_title": "The Death of a Civil servant 2"
}
# response = requests. Post (' http://127.0.0.1:1024/developer/api/v1.0/insert ', json = obj)Copy the code

Response content

Requests automatically decodes the content from the server. Most Unicode character sets can be decoded seamlessly. Once the request is issued, Requests makes an educated guess about the encoding of the response based on the HTTP header.

# Response content
# returns the content of type STR
# print(response.text())
# return JSON response content
print(response.json())
The return is binary response content
# print(response.content())
# original response content, the original request set stream=True
# response = requests. Get (' http://127.0.0.1:1024/developer/api/v1.0/all ', stream = True)
# print(response.raw())Copy the code

timeout

Requests will not be automatically timed out if timeout is not explicitly specified. If there is no response from the server, the entire application is blocked and unable to process other requests.

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5)  Number of secondsCopy the code

The proxy Settings

If you visit a site frequently, it’s easy to block requests from the server, and Requests supports the proxy perfectly.

# agent
proxies = {
    'http': 'http://127.0.0.1:1024'.'https': 'http://127.0.0.1:4000',
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)Copy the code

BeautifulSoup

BeautifulSoup, a Python Html parsing library, is the Java equivalent of Jsoup.

The installation

BeautifulSoup 3 has been discontinued and BeautifulSoup 4 is used directly.

Mac:

pip3 install beautifulsoup4Copy the code

Windows:

pip install beautifulsoup4Copy the code

Installing the parser

I’m using HTML5lib, pure Python.

Mac:

pip3 install html5libCopy the code

Windows:

pip install html5libCopy the code

Simple to use

BeautifulSoup converts complex HTML documents into a complex tree structure, each node being a Python object.

parsing

from bs4 import BeautifulSoup
​
def get_html_data():
    html_doc = ""< HTML >  WuXiaolong   

The original intention of writing blog: sum up experience, record own growth.

You have to work hard enough to look effortless! Focus! Exquisite!

Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog

WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">

GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub

"
"" soup = BeautifulSoup(html_doc, "html5lib")Copy the code

tag

tag = soup.head
print(tag)  # <head><title>WuXiaolong</title></head>
print(tag.name)  # head
print(tag.title)  # <title>WuXiaolong</title>
print(soup.p)  # 

Share Android technology and also look at popular technologies like Python.

print(soup.a['href']) Print the href attribute of the a tag: http://wuxiaolong.me/Copy the code

Note: If multiple tags match, return the first one, such as the P tag here.

To find the

print(soup.find('p'))  # 

Share Android technology and also look at popular technologies like Python.

Copy the code

Find also returns the first matching label by default, or None if no matching node is found. If I want to specify a search, such as the public number here, I can specify the tag’s class attribute value:

# Since class is the Python keyword, this is specified as class_.
print(soup.find('p', class_="WeChat"))
# < p class = "WeChat" > < a href = "https://open.weixin.qq.com/qr/code?username=MrWuXiaolong" > public number < / a > < / p >Copy the code

Find all P tags:

for p in soup.find_all('p') :print(p.string) Copy the code

In actual combat

Some time ago, some users reported that my personal APP was suspended. Although I no longer maintain this APP, I have to at least ensure its normal operation. Most people know that this APP data is crawling (see: “Hand to Hand teach you to do personal APP”), one of the benefits of data crawling is that you do not have to manage the data, the drawback is that other people’s website hangs or the HTML node of the website changes, I can not parse, there is no data. I was thinking of using Python crawler first, MySQL to insert into the local database, then Flask to write the interface, using Android Retrofit. Insert bMob with the BMob SDK… Later, I learned that BMob provides RESTful, solve big problems, I can directly Python crawler insert good, here I demonstrate is to insert local database, if using BMob, is to tune the RESTful data provided by BMob.

Site selected

I choose the demo website: meiriyiwen.com/random, we can find that each request of the article is not the same, just take advantage of this, AS long as I regularly request, parsing their own data, insert the database is OK.

Creating a database

I created it directly using NaviCat Premium, or on the command line.

Create a table

Create table article_title, article_author, article_content; create table article_title, article_author, article_content;

import pymysql
​
​
def create_table():
    # establish a connection
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    Create a database statement named article
    sql = ' ''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'' '
    Create a cursor object cursor using the cursor() method
    cursor = db.cursor()
    try:
        Execute SQL statement
        cursor.execute(sql)
        # commit transaction
        db.commit()
        print('create table success')
    except BaseException as e:  Rollback if an error occurs
        db.rollback()
        print(e)
​
    finally:
        Close the cursor connection
        cursor.close()
        Close the database connection
        db.close()
​
​
if __name__ == '__main__':
    create_table()
​Copy the code

Analytical website

Requests requests the site first, and BeautifulSoup resolves the nodes it needs.

import requests
from bs4 import BeautifulSoup
​
​
def get_html_data():
    # get request
    response = requests.get('https://meiriyiwen.com/random')
​
    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ' '
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)Copy the code

Inserting a database

By default, the title of the article on this site is unique. When inserting data, it will not be inserted if it has the same title.

import pymysql
​
​
def insert_table(article_title, article_author, article_content):
    # establish a connection
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # insert data
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    Create a cursor object cursor using the cursor() method
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        Execute SQL statement
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # commit transaction
            db.commit()
            print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- "% s" insert table success -- -- -- -- -- -- -- -- -- -- -- -- --' % article_title)
            return True
        else:
            print('-------------- %s already exists -------------' % article_title)
            return False
​
    except BaseException as e:  Rollback if an error occurs
        db.rollback()
        print(e)
​
    finally:  Close the cursor connection
        cursor.close()
        Close the database connection
        db.close()Copy the code

Timing Settings

Do a regular time, after a period of time to climb once.

import sched
import time
​
​
Initialize the Scheduler class of the Sched module
The first argument is a function that can return a timestamp, and the second argument can block until the time arrives.
schedule = sched.scheduler(time.time, time.sleep)
​
​
Function triggered by periodic scheduling
def print_time(inc):
    # to do something
    print('to do something')
    schedule.enter(inc, 0, print_time, (inc,))
​
​
The default parameter is 60 s
def start(inc=60):
    # enter: interval event, priority (used for ordering two simultaneous events that arrive at the same time)
    # Give arguments to the trigger function (tuple form)
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()
​
​
if __name__ == '__main__':
    # 5 s output once
    start(5)Copy the code

The complete code

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time
​
​
def create_table():
    # establish a connection
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    Create a database statement named article
    sql = ' ''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'' '
    Create a cursor object cursor using the cursor() method
    cursor = db.cursor()
    try:
        Execute SQL statement
        cursor.execute(sql)
        # commit transaction
        db.commit()
        print('create table success')
    except BaseException as e:  Rollback if an error occurs
        db.rollback()
        print(e)
​
    finally:
        Close the cursor connection
        cursor.close()
        Close the database connection
        db.close()
​
​
def insert_table(article_title, article_author, article_content):
    # establish a connection
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # insert data
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    Create a cursor object cursor using the cursor() method
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        Execute SQL statement
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # commit transaction
            db.commit()
            print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- "% s" insert table success -- -- -- -- -- -- -- -- -- -- -- -- --' % article_title)
            return True
        else:
            print('-------------- %s already exists -------------' % article_title)
            return False
​
    except BaseException as e:  Rollback if an error occurs
        db.rollback()
        print(e)
​
    finally:  Close the cursor connection
        cursor.close()
        Close the database connection
        db.close()
​
​
def get_html_data():
    # get request
    response = requests.get('https://meiriyiwen.com/random')
​
    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ' '
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)
​
    Insert database
    insert_table(article_title, article_author, article_content)
​
​
Initialize the Scheduler class of the Sched module
The first argument is a function that can return a timestamp, and the second argument can block until the time arrives.
schedule = sched.scheduler(time.time, time.sleep)
​
​
Function triggered by periodic scheduling
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))
​
​
The default parameter is 60 s
def start(inc=60):
    # enter: interval event, priority (used for ordering two simultaneous events that arrive at the same time)
    # Give arguments to the trigger function (tuple form)
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()
​
​
if __name__ == '__main__':
    start(60*5)
​Copy the code

Question: this is only for an article crawler, if it is a list of articles, click on the article details, this kind of crawler parsing? First of all, you must get the list, and then loop through the details of the article and insert them into the database, okay? I haven’t figured out how to do better yet, so I’ll leave it to the next topic.

The last

Although I am learning Python as a hobby, I need to put it into practice, otherwise I will soon forget it. Look forward to my next article on Python.

reference

Get started — Requests 2.18.1 documentation

Crawler Entry series ii: Elegant HTTP library Requests

Beautiful Soup 4.2.0 documentation

BeautifulSoup: HTML text parsing library