Introduction to Python

Python is an easy to learn and powerful programming language. It provides efficient high-level data structures, as well as simple and effective object-oriented programming features. Python’s elegant syntax and dynamic typing, as well as its interpreted nature, make it an ideal language for scripting in many areas.

At the same time, the Python interpreter is easy to extend, new functions and data types can be extended using C or C++ (or any other language that can be called from C), and Python can be used as an extension programming language in customizable software. In addition, the Python interpreter for most platforms, as well as the rich standard library source code and executable files, are freely available for download on the Python website.

Like other scripting languages, Python itself has evolved from many other languages and has borrowed from them, including ABC, Modula-3, C, C++, Algol-68, SmallTalk, Unix shell, and others. Compared to other scripting languages, Python has the following characteristics:

  • Easy to learn and read: Python is easier to learn with relatively few keywords, simple structure, and a well-defined syntax, and Python code is simple and easy to read.
  • Easy to maintain: Python succeeds because its source code is fairly easy to maintain.
  • Extensive standard library support: One of Python’s biggest advantages is that it has a rich cross-platform standard library that works well with UNIX, Windows, and Macintosh.
  • Great extensibility: If you want to write algorithms or modules that you don’t want to expose, you can develop them in C or C++ and call them in your Python program.
  • Portability: Due to its open source nature, Python supports portability across multiple platforms.
  • Database support: Python provides interfaces to all major commercial databases.
  • GUI programming: Python supports GUI programming, and programs developed can be ported to multiple systems.

Install Python on MAC

Python 2.7 comes with Mac OS X 10.8 and up, but Python 2.7 is old and many apis do not support it. You are advised to install Python 3.7 or later from the Python website (www.python.org).

After the installation is complete, there are:

  • There will be a Python 3.9 folder in your Applications folder. Here you’ll find IDLE, the development environment that is a standard part of the official Python distribution; And PythonLauncher, which handles double-clicking a Python script in the Finder.

  • Frame/Library/Frameworks/Python framework, including Python executables and libraries. The installer adds this location to the shell path, and the symbolic link to the Python executable is placed in /usr/local/bin/.

At the same time, Apple provides Python version installed respectively in the/System/Library/Frameworks/Python. Framework and/usr/bin/Python. It is important to note that you should never modify or delete this content, as it is controlled by Apple and used by Apple or third-party software.

Wait for the Python installation to complete, then open the system’s./bash_profile file to register Python, run the open-e. bash_profile command on the terminal to open the. Bash_profile file, and add the following script:

The export PATH = "/ Library/Frameworks/Python framework/Versions / 3.9 / bin: ${PATH}"Copy the code

Then, run the following script command on the terminal.

Alias python = "/ Library/Frameworks/python framework Versions / 3.9 / bin/python3.9" source. The followingCopy the code

After the execution is complete, use thepython -V Check the latest version to see that it has been updated.

Development tools

Tools that support Python development include Jupyter Notebook, Pycharm, Subllime/Vs Code /Atom + Kite, etc.

3.1 Jupyter notebook

The Jupyter Notebook is opened in the form of a web page. You can directly write and run codes on the web page, and the running results of codes are displayed under the code block. After installing PIP, typing Jupyter Notebook on the command line will open in the default browser. For some Python developers, Jupyter Notebook is the best IDE because it takes Python’s interactive features to the extreme. It has the following major advantages:

  • Can be Shared
  • Supports more than 40 programming languages
  • lightweight
  • interactive
  • Excellent visualization services
  • Support the Markdown

Installation Reference: Introduction to Jupyter Notebook, installation and use tutorial

3.2 Pycharm

PyCharm is a Python IDE created by JetBrains, the developer of the Resharper plugin for VS2010. PyCharm, like all other developers, supports common code completion, smart tips, syntax checking, and version-control, unit testing, Git, and the ability to quickly create Django, Flask, and other Python Web frameworks. This is often used in large projects, but the only downside is that some cards are not free when you start up, but you can download a free community version. For the paid version of the problem, I can use the following way to crack: JetBrain Unlimited reset 30 day trial technique.

Subllime/Vs code/Atom + kite

Sublime Text is a lightweight, cross-platform code editor that supports dozens of programming languages, including Python, Java, C/C++, etc. It is compact and easy to run, with support for code highlighting, auto-completion, syntax hints, and extensive plugin extensions. You can run Python programs directly.

VS Code is a cross-platform Code editor developed by Microsoft. It supports common programming language development, rich plug-in expansion, not only intelligent completion, syntax check, Code highlighting, but also supports Git function, and runs smoothly. It is a very good Code editor, after installing relevant plug-ins, you can directly run Python programs.

Atom is a code editor developed by Github for programmers. The interface is simple and intuitive. It is very easy to use, automatic completion, code highlighting, syntax prompt, and fast start-up.

4. Run Python

Currently, there are three ways to run Python:

4.1 Interactive interpreter

You can go into Python through a command line window and start writing Python code in an interactive interpreter. You can then code Python on Unix, DOS, or any other system that provides a command line or shell.

$ python # Unix/Linux
或者
C:>python # Windows/DOS
Copy the code

Common arguments to the Python command line are:

  • -d: Displays debugging information during parsing
  • -o: Generate optimized code (.pyo files)
  • -s: the location to find Python paths is not introduced at startup
  • -v: Outputs the Python version number
  • -x: Since version 1.6, built-in exceptions (for strings only) are obsolete
  • -c CMD Executes the Python script and takes the result as a CMD string
  • File executes a Python script in the given Python file

4.2 Command Line Script

Python scripts can be executed on the command line by introducing an interpreter in the application, as follows:

$python script.py # Unix/Linux or C:>python script.py # Windows/DOSCopy the code

Note: When executing the script, ensure that the script has the executable permission.

4.3 Integrated Development Environment

PyCharm is a Python IDE created by JetBrains for macOS, Windows, and Linux. You can run Python programs by clicking the Run button on the panel.

Use Requests for web crawlers

5.1 Basic Principles of Network crawlers

The so-called crawler is a program or script that automatically captures information on the World Wide web according to certain rules. The basic principle behind it is that the crawler initiates HTTP request to the target server, and then the target server returns the response result, and the crawler client receives the response and extracts data from it, and then performs data cleaning and data storage.

Therefore, web crawler is also an HTTP request process. Take a web address accessed by the browser as an example. After the user enters the URL, the client obtains the IP address of the target server through DNS resolution and establishes a TCP connection with the target server. After the connection is successful, the browser constructs an HTTP request and sends it to the server. The corresponding data is retrieved from the database and encapsulated into an HTTP response. The response result is then returned to the browser, which parses, extracts, renders, and finally displays the response to the user. The complete process is as follows:

Note that the HTTP request and response must follow a fixed format. Only when the HTTP request format is consistent, can the server correctly parse the requests sent by different clients. Similarly, only when the server complies with the unified response format, can the client correctly parse the responses sent by different websites.

5.2 Web crawler Examples

Python provides a lot of tools to implement HTTP requests, but third-party open source libraries provide a lot more functionality that developers don’t need to start writing from Socket communication.

Before launching a Request, we first need to construct the Request object Request, specify the URL address, Request method, Request header, here the Request body data is empty, because we do not need to submit data to the server, so we can not specify. The urlopen function will automatically establish a connection with the target server and send an HTTP request. The return value of this function is a Response object, which contains the Response header information, Response body, status code and other attributes.

However, the built-in module provided by Python is too low-level and requires a lot of code to write. Simple crawlers can be used to consider Requests, which has a nearly 30K Star on GitHub and is a very Pythonic framework.

Here is sample code for requesting a URL using the Python build-in module urllib.

import ssl from urllib.request import Request, urlopen def print_hi(): context = ssl._create_unverified_context() request = Request(url="https://foofish.net/pip.html", method="GET", headers={"Host": "foofish.net"}, data=None) response = urlopen(request, Context =context) headers = response.info() # response.read() # response.getCode () # status code print(headers) print(content) print(code) if __name__ == '__main__': print_hi()Copy the code

Execute the above code, and you can see that the captured information is printed on the Python console:

Next, let’s familiarize ourselves with the Pythonic framework process and how to use it.

5.2.1 installation requests

Installing requests is simple, using the PIP install command.

pip install requests
Copy the code

5.2.2 Basic Requests

The basic GET request is relatively simple, requiring only the GET () method of Requests.

    import requests

    url = ''
    headers = {'User-Agent':''}
    res = requests.get(url, headers=headers)
    res.encoding = 'utf-8'
    print(res.text)
Copy the code

POST requests are as simple as using the POST () method of Requests.

. data = {} res = requests.post(url, headers=headers, data=data) ...Copy the code

5.2.3 Advanced Request

Request parameters In front-end development, GET requests with parameters require the parameters to be concatenated at the end of the request address, whereas Python uses GET requests the way params is used.

. params = {} res = request.get(url, headers=headers, params = params) ...Copy the code

Specify Cookie mode Cookie login:

. headers = { 'User-Agent' : '', 'Cookie' : '', } res = request.get(url, headers=headers) ...Copy the code

Session If you want to stay logged in (Session) with the server all the time without having to specify cookies each time, you can use Session, which provides the same API as Requests.

import requests

s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a": "c"})
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'
Copy the code

Client authentication

The Web client typically validates with the Auth field, as shown below.

. Auth = (' username ', 'password ') res = request.get(url, headers=headers, auth=auth)...Copy the code

Sometimes, we need to specify a timeout for a request, so we can specify the timeout field to do this, as shown below.

requests.get('https://google.com', timeout=5)
Copy the code

Setting the proxy to send too many requests over a period of time is easy to be judged by the server as crawler, so many times we use proxy IP to disguise the real IP of the client, such as.

Import requests Proxies = {' HTTP ': 'http://127.0.0.1:1080', 'HTTPS ': 'http://127.0.0.1:1080'} r = requests. Get (' http://www.kuaidaili.com/free/ ', proxies = proxies, timeout = 2)Copy the code

5.2.3 Test the cat

With the basics covered, let’s use Requests to complete a crawl zhihu column as an example. How did I find it? Click the requests on the left one by one to see if any data appears on the right. Static resources ending in.jpg, js and CSS can be ignored as shown below.These are not tips for front-end development. We then copied the request to the browser and found that we could actually use it to get the corresponding data. Next, let’s examine how the request is structured.

  • Request URL:www.zhihu.com/api/v4/memb…
  • Request method: GET
  • Mozilla / 5.0 (Linux; Android 6.0.1; Moto G (4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36
  • Query parameters:

type: answer column_id: c_1413092853995851776

With this request data, we can use the Requests library to build a request and then grab the network data in Python code, as shown below.

import requests class SimpleCrawler: def crawl(self, params=None): # you must specify UA, Otherwise zhihu server will determine the request url = "https://www.zhihu.com/api/v4/members/6c58e1e8c4d898befed2fafeb98adafe/profile/creations/feed" illegal # Query parameter params = {"type": 'answer', "column_id": 'c_1413092853995851776'} headers = {" authority": "Www.zhihu.com", "the user-agent: Mozilla / 5.0 (Linux; Android 6.0.1; Moto G (4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36" } response = requests. Get (url, headers=headers, params=params) print(" ", response.text) # parse the returned data for follower in response.json().get("data"): print(follower) if __name__ == '__main__': SimpleCrawler().crawl()Copy the code

Then, run the code above, and the output looks like this:

Above is a simple single-threaded crawler based on Requests. Through this example, we understand how to use it and the flow. Request headers, request parameters, and Cookie information can all be specified directly in request methods. The return value response is a JSON () method that returns a Python object.