How to use native Urllib to send network requests

Urllib. Urlopen ([url, data [, proxies]]) : docs.python.org/2/library/u…

The default network request library in Python is the urlllib family, including urllib, urllib2, and urllib3, and most of the time all three work together.

Of course, there are many excellent third-party libraries for sending network requests, the most well-known of which is the Requests library, but this article is going to start with the basic URllib for sending network requests.

GithubSource code address:Github.com/snowdreams1…
GithubOnline Address:Snowdreams1006. Making. IO/learn – pytho…

“Snow dream technology station” to remind you: because the article has a certain timeliness, it is likely that readers are reading, part of the link has been invalid, please follow the relevant instructions of the article personally practice verification, as far as the letter book is better than no book, do not copy and paste code directly, must personally knock again!

The article directories

Environment set up

The Python environment used in this article is a virtual environment based on the VirtualEnv implementation, just to facilitate isolation of different environments and better simulate real user environments.

In actual development, you can follow this article step by step to build a development environment, or directly use the system default environment.

Environmental demonstration

Information about the demo environment is as follows:

(. Env)$python --version python 2.7.16 (. Env)$PIP --version PIP 19.3.1 from ~ / python/SRC/url/urllib/env/lib/python2.7 / site - packages/PIP (python 2.7)Copy the code

The following code works fine in this environment, but there is no guarantee that the rest of the environment will match the results of the demo, so the actual results will prevail.

Environment set up

If you don’t need a virtual environment, you can skip the installation section and use the default environment, as long as the Python version used for testing is PYTHon2 and not Python3!

Step 1 Install the virtual environmentvirtualenv

sudo pip install virtualenv
Copy the code

Installing a virtual environment makes it easier to isolate different Python environments. You can also use the system default environment, so this step is optional, as are the following steps.

Step 2 Prepare the virtual environment directory.env

virtualenv .env
Copy the code

The virtual environment directory is hidden to prevent misoperations, but it can also be displayed as a normal directory.

Step 3 Activate the virtual environment.env

source .env/bin/activate
Copy the code

Once the virtual environment directory is ready, you need to activate the virtual environment. This step can be repeated without error!

Step 4 View the information that is being usedpython 与 pipversion

(.env) $ which python
~/python/src/url/urllib/.env/bin/python
(.env) snowdreams1006s-MacBook-Pro:urllib snowdreams1006$ which pip
~/python/src/url/urllib/.env/bin/pip
Copy the code

The python dependencies are automatically downloaded when the virtual environment is enabled, so the python and PIP files are in the current directory. Env rather than the system default. If the virtual environment is not enabled, the system directory is displayed.

Network request URllib library

Runtime if readers close test found that the network can’t normal request, can be httpbin. Snowdreams1006. Cn/replace or httpbin.org/ set up their own local test environment.

Here are two kinds of structures, local test environment installation, of course, also can access httpbin. Snowdreams1006. Cn/or httpbin.org/ online environment.

dockerinstallation

docker run -p 8000:80 kennethreitz/httpbin
Copy the code

For the first run, the image will be downloaded to the local computer and then the container will be started. For the non-first run, the container will be directly started. The access address is http://127.0.0.1:8000/

pythoninstallation

pip install gunicorn httpbin && gunicorn httpbin:app
Copy the code

Gunicorn httpbin: app-b :9898 Specifies port 8000. If a port conflict occurs, the gunicorn httpbin: app-b :9898 specifies port 8000.

How to send the simplest network request

Urllib2.urlopen (URL) : Sends the simplest network request and returns the response body text data directly.

Create a new Python file named urllib_demo.py. The core code consists of loading urllib2, sending the simplest GET request using urllib2.urlopen(), and finally using response.read(). The response body content can be read at one time.

The code content is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.read()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

If the file name is urllib_demo.py, run Python urllib_demo.py from the terminal command line to view the output.

How do I know what the properties and methods are

Print Dir (Response) : Obtains the enumeration value of the object’s methods and attributes, and guesses the methods and attributes without documentation.

Urllib2. urlopen(URL) already makes it possible to send the simplest network request. Whether it is a GET request or a POST request, it is important to GET the body of the response after the request, but there are other methods and attributes that cannot be ignored in actual development.

Therefore, in addition to mastering response.read() to read all the contents of the response body at once, we also need to know what properties and methods response has.

Using type(response) to get the object type and dir(Response) to get the property enumeration value, you can roughly guess what properties and methods the object has to call without documentation.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print type(response)
    print dir(response)

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Here is the output of print Type (response) and print dir(response), and then we’ll pick out the commonly used attributes and methods.

# print type(response)
<type 'instance'>

# print dir(response)
['__doc__'.'__init__'.'__iter__'.'__module__'.'__repr__'.'close'.'code'.'fileno'.'fp'.'getcode'.'geturl'.'headers'.'info'.'msg'.'next'.'read'.'readline'.'readlines'.'url']
Copy the code

The status code (property) of the response object

Response. code: Obtains the status code of the response object. In normal cases, 200 indicates that the request is successful, while 500 indicates a typical system error.

The attribute enumeration value is obtained through dir(response). Combined with type(response), it is not difficult to find that Response.code is used to obtain the response status code. The specific call method is Response.code or response.code() Print type(response.code) roughly infer.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print type(response.read)
    print type(response.code)

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Print type(response.read) is the

method type, and response.code is the method type

Basic type, so Response. code is the property invocation.

The output of type(response.code) is

, not

, so the way to get the status code is attribute call.

The detailed code is as follows:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.code

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Status code of the response object (method)

Response.getcode () : Gets the status code of the response object. Normally 200 indicates that the request is successful, while 500 is a typical system error.

Similarly, we know from print dir(response) that the getCode field is available for call, but we don’t know if it is a property call or a method call. Print type(response.getCode) again to get

, thus determining it as a method call.

The detailed code is as follows:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.getcode()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Status code information for the response object (properties)

Response. MSG: obtains the status description of the response object, for example, status code 200 is OK and 500 is INTERNAL SERVER ERROR.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get the response status code.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/200')
    print response.code
    print response.msg

    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/500')
    print response.code
    print response.msg

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

The normal request status is 200 OK, and the request exception is likely to be 500 INTERNAL SERVER ERROR. Once there is an exception, an ERROR will be reported if the exception is handled, and the program will terminate.

Access links (properties) for the response object

Response. url: Gets the request link.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.url

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Access link for the response object (method)

Response.geturl () : Gets the request link.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.geturl()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Access links (properties) for the response object

Response.headers. Dict: Gets the request header information and displays it as a dictionary.

In some cases, a request must be sent with a specific request header to succeed. Therefore, it is important to know what the server receives when the request header is not set by default. Similarly, print Type (Response.headers) can be used with print Dir (Response.headers) explores the attributes and methods available for invocation.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.headers.dict

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Request header information for the response object (method)

Response.info () : Gets the request header information and displays it line by line

This is similar to the previous response.headers dict to get the request header information, except that Response.info () is suitable for visual display rather than application use.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get requester information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.info()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

The response body of the response object (method)

Response.read () : reads the response body at one time, which is suitable for the case where the amount of data in the response body is relatively small, and reads all the data into the memory at one time for convenient operation.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.read()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Response.read () returns a string, so it’s easy to use variables to receive for subsequent processing, such as result = response.read() :

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = response.read()
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

The response body of the response object (method)

Response.readline () : reads the response body line by line, which is suitable for the case of large data body and circulates until no data can be read at last.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    line = response.readline()
    while line:
        print line
        line = response.readline()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Response.readline () can only be readline by line, so manual concatenation is required to obtain the completed response body, for example:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = ' '
    line = response.readline()
    result = result + str(line)
    while line:
        line = response.readline()
        result = result + str(line)
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

STR (line) is used to ensure that the response body string is a string, which it shouldn’t be; response.readline() is already a string.

The response body of the response object (method)

Response.readlines () : Iterates through and reads the response body, loops through and saves to the list object, suitable for the case where line by line processing is required.

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    for line in response.readlines():
        print line

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Similarly, if the complete response body result needs to be obtained by response.readlines(), it can be spliced as follows, as shown in the following example:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2(a):
    Get response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = ' '
    for line in response.readlines():
        result = result + str(line)
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()
Copy the code

Result = “.join([line for line in response.readlines()])

How to send normal`GET`request

Send directly without parameters

Urllib2.urlopen (URL) : Only a destination URL is required to send a GET request.

The easiest way to GET the request is a GET way, without setting the other parameters, you just need to fill out the URL request, such as urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/get ‘), the sample code is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2(a):
    Get the response header and response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use simple urllib2<<<'
    use_simple_urllib2()
Copy the code

If the above code file is named urllib_demo.py, run the python urllib_demo.py file from the terminal command line, and the output is as follows:

(.env) $ python urllib_demo.py 
>>>Use simple urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 13:38:27 GMT
Content-Type: application/json
Content-Length: 263
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

>>>Response Body:
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
  }, 
  "origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get"
}
Copy the code

Where the response header Connection: close indicates that the Connection is closed automatically, and the response body args is an empty dictionary indicates that there are no query parameters.

There are parameters transcoding to send

In the actual development process, there are few GET requests that do not need to carry parameters. For GET requests with parameter queries, native URllib also supports them. The simplest way is to splice query parameters to the target URL to GET a URL with query parameters.

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2(a):
    Get the response header and response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()
Copy the code

Similarly, if the above code file is named urllib_demo.py, run the Python urllib_demo.py file from the terminal command line, and the output looks like this:

$python urllib_demo.py >>>Use params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 13:59:23 GMT Content-Type: application/json Content-Length: 338 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true

>>>Response Body:
{
  "args": {
    "param1": "hello"."param2": "world"
  }, 
  "headers": {
    "Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
  }, 
  "origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world"
}
Copy the code

The response header Connection: close indicates that the Connection is automatically closed, and the response body args is no longer an empty dictionary, but the query parameters just passed indicate that the server actually received the sent query parameters, so this approach is also possible.

If the query parameters are too many, it will be very tedious to concatenate the request link URL to form a new URL, and must be followed. Param1 =hello&param2=world = param1=hello&param2=world

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2(a):
    Get the response header and response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world&author=snowdreams1006&website=http://blog.snowdreams1006 .cn&url=snowdreams1006.github.io/learn-python/url/urllib/teaching.html&wechat=snowdreams1006&email=snowdreams1006@163.co m&github=https://github.com/snowdreams1006/')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()
Copy the code

The above tedious is not only reflected in the splicing into a new URL long container error, but also encounter dynamic query parameters replacement problem, so the automatic splicing query parameters function is really timely!

params = urllib.urlencode({
    'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
})
print params
Copy the code

Urllib.urlencode () can transcode dictionary query parameters into & join query parameters and then manually concatenate them to the request URL? Params gets the URL with the parameters.

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2(a):
    params = urllib.urlencode({
        'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()
Copy the code

If the above code file is named urllib_demo.py, run the python urllib_demo.py file from the terminal command line, and the output is as follows:

$python urllib_demo.py >>>Use params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 14:27:21 GMT Content-Type: application/json Content-Length: 892 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true

>>>Response Body:
{
  "args": {
    "author": "snowdreams1006"."email": "snowdreams1006@163.com"."github": "https://github.com/snowdreams1006/"."param1": "hello"."param2": "world"."url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html"."website": "http://blog.snowdreams1006.cn"."wechat": "snowdreams1006"
  }, 
  "headers": {
    "Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
  }, 
  "origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get?website=http%3A%2F%2Fblog.snowdreams1006.cn&github=https%3A%2F%2Fgithub.com%2Fsnow dreams1006%2F&wechat=snowdreams1006&param2=world&param1=hello&author=snowdreams1006&url=https%3A%2F%2Fsnowdreams1006.git hub.io%2Flearn-python%2Furl%2Furllib%2Fteaching.html&email=snowdreams1006%40163.com"
}
Copy the code

Urllib2.urlopen (URL) is used to send the GET request. Urllib2.urlopen (URL) is used to send the GET request.

How to send normal`POST`request

If the request link URL only supports POST requests, then the above concatenated address implementation will no longer meet the requirements. Interestingly, it takes only one step to convert a GET request into a POST request.

If it is a GET request, sends a request is this: urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/post?%s’ % params);

If it is a POST request, sends a request is this: urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/post ‘, params);

def post_params_urllib2(a):
    Get the response header and response body information.
    params = urllib.urlencode({
        'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Post params urllib2<<<'
    post_params_urllib2()
Copy the code

Urllib2. Urlopen (URL) : how do you concatenate the url?

However, a more intuitive method is to send a request for direct validation, as shown in the following example:

$python urllib_demo.py >>>Post params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 14:45:43 GMT Content-Type: application/json Content-Length: 758 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true

>>>Response Body:
{
  "args": {}, 
  "data": ""."files": {}, 
  "form": {
    "author": "snowdreams1006"."email": "snowdreams1006@163.com"."github": "https://github.com/snowdreams1006/"."param1": "hello"."param2": "world"."url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html"."website": "http://blog.snowdreams1006.cn"."wechat": "snowdreams1006"
  }, 
  "headers": {
    "Accept-Encoding": "identity"."Connection": "close"."Content-Length": "285"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
  }, 
  "json": null, 
  "origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/post"
}
Copy the code

It is worth noting that the parameters submitted by the POST request above are stored in the form property rather than the ARgs property of the GET request.

How do I set up a proxy to make network requests

Environment set up

If proxyip. Snowdreams1006. Cn/can’t access, can visit github.com/jhao104/pro… The project builds its own proxy pool.

{
  "delete? The proxy = 127.0.0.1:8080": "delete an unable proxy"."get": "get an useful proxy"."get_all": "get all proxy from proxy pool"."get_status": "proxy number"
}
Copy the code

Do not press, malicious access will close the black house yo, recommend everyone to build their own local environment, thank you for your support.

This agent pool is based on jhao104/proxy_pool project to provide two installation methods, divided into docker installation and source installation.

`docker`Way to install

Docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password=' ' -p 5010:5010 jhao104/proxy_pool
Copy the code

You can also download the image in advance: Docker pull Jhao104 /proxy_pool, and then run the command above to start the container.

Source code installation

Step 1: Download the source code

git clone https://github.com/jhao104/proxy_pool.git
Copy the code

You can also download the installation package: github.com/jhao104/pro…

Step 2: Install dependencies

pip install -r requirements.txt
Copy the code

Installation project depend on the need to switch to the root directory, such as CD proxy_pool, then will be automatically downloaded from the default installation source, also can be used PIP install -i https://pypi.tuna.tsinghua.edu.cn/simple – r Requirements.txt speeds up installation.

Step 3: ConfigureConfig/setting.py

# Config/setting.py is the project configuration file

# configuration DB
DATABASES = {
    "default": {
        "TYPE": "REDIS".# Currently supports SSDB or REDIS databases
        "HOST": "127.0.0.1".# db host
        "PORT": 6379.# db port, for example SSDB is usually 8888, REDIS is usually 6379 by default
        "NAME": "proxy".# default configuration
        "PASSWORD": ""         # db password}}Configure the API service
SERVER_API = {
    "HOST": "0.0.0.0".0.0.0.0 listen on all IP addresses
    "PORT": 5010        # monitor port
}
       
The proxy pool access address is http://127.0.0.1:5010
Copy the code

Step 5: Start the project

If your dependencies are already installed and ready to run, start them in ProxyPool. Py in the CLI directory.
The program is divided into schedule scheduler and WebServer Api service

Start the scheduler first
python proxyPool.py schedule

Then start the webApi service
python proxyPool.py webserver
Copy the code

This command requires that the current environment is in the CLI directory. If it is in another directory, adjust the proxyPool. Py path (CD cli is used to switch to the CLI directory).

If the preceding steps are normal, the free Internet proxy IP address is automatically captured after the project is started. You can visit http://127.0.0.1:5010 to view the IP address.

The agent requests

Advice first using the browser directly access proxyip. Snowdreams1006. Cn/get/see if can get random proxy IP, and then reusing a python program, ensure that code runs correctly, convenient follow-up development testing.

{
    "proxy": "183.91.33.41:8086"."fail_count": 0."region": ""."type": ""."source": "freeProxy09"."check_count": 59."last_status": 1."last_time": "The 2020-01-18 13:14:32"
}
Copy the code

This is a request/get/get random IP example, complete request address: http://proxyip.snowdreams1006.cn/get/

Obtain a random proxy IP address

# -*- coding: utf-8 -*-
import urllib2
import json

def get_proxy(a):
    "Get random proxy"
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()
Copy the code

If the browser environment, you can directly access proxyip. Snowdreams1006. Cn/get/verify whether can get random proxy IP, or in terminal command line running curl http://proxyip.snowdreams1006.cn/get/ Command to view the result.

Set the proxy IP address for access

Urllib. FancyURLopener(proxy) : Sets the proxy IP address for indirect access

Through urllib.FancyURLopener(proxy), you can set the proxy to hide the real information of the client from the server. However, whether the server can distinguish the proxy request from the ordinary request depends on the proxy IP address.

If it is a high secret agent, it is the most ideal situation to achieve the role of real agent.

On the other hand, transparent proxy is the most useless proxy. The server not only knows that you are using the proxy but also knows your real IP address.

How to validate whether can set the proxy IP identified by the server, can visit httpbin. Snowdreams1006. Cn/IP access server read the client IP.

$ curl http://httpbin.snowdreams1006.cn/ip
{
  "origin": "115.217.104.191"
}
Copy the code

If terminal command line not the curl command, install baidu once or directly open the browser to access httpbin. Snowdreams1006. Cn/IP

If the server reads the request IP and set the proxy IP, congratulations, you set the proxy successfully and is a high proxy, otherwise, it is a false alarm.

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy(a):
    "Get random proxy"
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib(a):
    Send a request through a proxy
    # Random proxy IP
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open()
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()
Copy the code

The preceding example only shows how to set the proxy IP address to send requests. It does not verify whether the proxy IP address is set successfully, that is, whether the server reads the request IP address and the newly set proxy IP address, and does not consider exceptions such as unavailable proxy IP address or connection timeout.

The following provides a simple example to check whether the proxy IP address is set successfully:

{
    "proxy": "121.225.199.78:3128"."fail_count": 0."region": ""."type": ""."source": 
    "freeProxy09"."check_count": 15."last_status": 1."last_time": "The 2020-01-17 12:03:29"
}
Copy the code

Obtain the general format of random proxy IP, and extract the general value of random IP is 121.225.199.78:3128

General format for random access proxy IP is a port number, and access httpbin. Snowdreams1006. Cn/IP access to the source IP does not include port, so the most simple idea is to remove random TCP/IP port number, and then and access the results.

'121.225.199.78:3128'.split(':') [0]
Copy the code

First, divide the IP address into two parts with:, and then take only the first part, that is, obtain the IP address without port number: 121.225.199.78

Next, since the response body data obtained by Response.read () is a string, it is not convenient to extract the corresponding value of origin, and the response body data is obviously in JSON format, so json.loads(result) can be easily converted to Python The dictionary type of.

result = response.read()
result = json.loads(result)
proxyip = result.get('origin')
Copy the code

Both result.get(‘origin’) and result[‘origin’] can be used for dictionary types. However, if the key name does not exist, the result[‘origin’] and result[‘origin’] are different.

Now the simplest complete example to verify that the proxy IP was set successfully is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy(a):
    "Get random proxy"
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib(a):
    Send a request through a proxy
    # Random proxy IP
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open()
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':') [0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()
Copy the code

If the randomly obtained proxy IP address is normal, no exception will be thrown. The setting succeeds or fails.

(.env) $ python urllib_demo.py >>>Get proxy urllib<<< >>>Get Proxy: 52.80.58.248:3128 Proxy Fail (.env) $python urllib_demo.py >>>Get Proxy urllib<<< >>>Get Proxy: 117.88.176.152:3000 Proxy SuccessCopy the code

The quality of free proxy IP is just so-so, so don’t expect too much and you should choose paid proxy IP in the actual development process.

Clear the direct proxy IP address connection

Urllib.fancyurlopener ({}) : Clears proxy IP information for direct access

When setting the proxy IP address, you need to pass a proxy dictionary to urllib.FancyURLopener(proxy). When clearing proxy information, you only need to set the original proxy dictionary as an empty object.

The main code is the same as setting the proxy IP address, so please refer to the following code:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def clear_proxy_urllib(a):
    Send the request after clearing the agent
    # Random proxy IP
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open("http://httpbin.snowdreams1006.cn/ip")
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    result = response.read()
    print(result)
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':') [0]
    if proxy_ip == response_ip:
        print 'Set proxy success'
    else:
        print 'Set proxy fail'

    opener = urllib.FancyURLopener({})
    response = opener.open("http://httpbin.snowdreams1006.cn/ip")
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    result = response.read()
    print(result)
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':') [0]
    if proxy_ip == response_ip:
        print 'Clear proxy fail'
    else:
        print 'Clear proxy success'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()
Copy the code

In addition to using urllib.fancyurlopener () to set or clear the proxy IP, you can use urllib.urlopen() to achieve a similar requirement.

# Use http://www.someproxy.com:3128 for HTTP proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})

# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)
Copy the code

Examples of setting environment variables are as follows:

% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...
Copy the code

Learn to summarize

This article mainly introduces the python native URllib how to send network requests and some basic environment construction process, which comes with a large number of directly operable off-the-shelf code, documentation and source code are open source, interested friends can browse.

Now briefly review the important knowledge points mainly involved in this article, in order to provide the latecomers to quickly read inquiries when learning.

A virtual environment`virtualenv`

After the virtual environment is installed and activated, the python and PIP versions are as follows:

(. Env)$python --version python 2.7.16 (. Env)$PIP --version PIP 19.3.1 from ~ / python/SRC/url/urllib/env/lib/python2.7 / site - packages/PIP (python 2.7)Copy the code

To set up the virtual environment by yourself, perform the following steps to enable the virtual environment:

Step 1 Install the virtual environmentvirtualenv

sudo pip install virtualenv
Copy the code

Installing a virtual environment makes it easier to isolate different Python environments. You can also use the system default environment, so this step is optional, as are the following steps.

Step 2 Prepare the virtual environment directory.env

virtualenv .env
Copy the code

The virtual environment directory is hidden to prevent misoperations, but it can also be displayed as a normal directory.

Step 3 Activate the virtual environment.env

source .env/bin/activate
Copy the code

After the virtual environment is activated, you can run PIP –version to check the current version information to verify whether the virtual environment is successfully started.

Server background`httpbin`

The default local access address: http://127.0.0.1:8000/, the online access address: httpbin. Snowdreams1006. Cn/or httpbin.org/

If docker is used to install httpbin successfully, access the interface address, the actual preview is as follows:

If you use Python to start the httpbin library, the effect is different from that of docker.

You are advised to install the httpbin service by yourself. The following two methods are used to enable the Httpbin service.

dockerinstallation

docker run -p 8000:80 kennethreitz/httpbin
Copy the code

For the first run, the image will be downloaded to the local computer and then the container will be started. For the non-first run, the container will be directly started. The access address is http://127.0.0.1:8000/

pythoninstallation

pip install gunicorn httpbin && gunicorn httpbin:app
Copy the code

Gunicorn httpbin: app-b :9898 Specifies port 8000. If a port conflict occurs, the gunicorn httpbin: app-b :9898 specifies port 8000.

Free IP proxy pool`proxyip`

The default local access address: http://127.0.0.1:5010/, the online access address: proxyip. Snowdreams1006. Cn/or http://118.24.52.95/

{
  "delete? The proxy = 127.0.0.1:8080": "delete an unable proxy"."get": "get an useful proxy"."get_all": "get all proxy from proxy pool"."get_status": "proxy number"
}
Copy the code

If you need to set up local services by yourself, please decide the installation method based on your own needs. The following two methods are used to enable the ProxyIP service.

dockerinstallation

Docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password=' ' -p 5010:5010 jhao104/proxy_pool
Copy the code

You can also download the image in advance: Docker pull Jhao104 /proxy_pool, and then run the command above to start the container.

Source code installation

Step 1: Download the source code

git clone https://github.com/jhao104/proxy_pool.git
Copy the code

Of course, you can also directly download the installation package: github.com/jhao104/pro…

Step 2: Install dependencies

pip install -r requirements.txt
Copy the code

Note: Installation project depend on the need to switch to the project root directory (CD proxy_pool) in advance, if abandon slow download speed can use tsinghua university image PIP install -i https://pypi.tuna.tsinghua.edu.cn/simple – r Requirements. TXT speeds up the download and installation process.

Step 3: ConfigureConfig/setting.py

# Config/setting.py is the project configuration file

# configuration DB
DATABASES = {
    "default": {
        "TYPE": "REDIS".# Currently supports SSDB or REDIS databases
        "HOST": "127.0.0.1".# db host
        "PORT": 6379.# db port, for example SSDB is usually 8888, REDIS is usually 6379 by default
        "NAME": "proxy".# default configuration
        "PASSWORD": ""         # db password}}Configure the API service
SERVER_API = {
    "HOST": "0.0.0.0".0.0.0.0 listen on all IP addresses
    "PORT": 5010        # monitor port
}
       
The proxy pool access address is http://127.0.0.1:5010
Copy the code

For more details about configuration, please refer to the official introduction of the project directly. The above configuration information is basically enough.

Step 5: Start the project

If your dependencies are already installed and ready to run, start them in ProxyPool. Py in the CLI directory.
The program is divided into schedule scheduler and WebServer Api service

Start the scheduler first
python proxyPool.py schedule

Then start the webApi service
python proxyPool.py webserver
Copy the code

This command requires that the current environment is in the CLI directory. If it is in another directory, adjust the proxyPool. Py path (CD cli).

Native network request`urllib`

Urllib. Urlopen ([url, data [, proxies]]) : docs.python.org/2/library/u…

GETrequest

If the query parameters are simple, you can build the request URL directly and serialize the query parameter dictionary with urllib.urlencode(dict).

Urllib2.urlopen (URL) when the query parameters are not too complicated, especially if no query parameters are needed, you can send the request directly to urllib2.urlopen(URL) as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2(a):
    Get the response header and response body information.
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use simple urllib2<<<'
    use_simple_urllib2()
Copy the code

When many query parameters are required or dynamic concatenation is required, urllib.urlencode(dict) is recommended to serialize query parameters and concatenate them to the request URL to form the completed request URL.

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2(a):
    params = urllib.urlencode({
        'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()
Copy the code

urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')

POSTrequest

Compared with the default GET request, only the query parameters are no longer concatenated to the request link URL but are passed to the parameter data as an optional parameter. Urllib. urlopen(URL,data) request mode is POST request.

def post_params_urllib2(a):
    Get the response header and response body information.
    params = urllib.urlencode({
        'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Post params urllib2<<<'
    post_params_urllib2()
Copy the code

Set the agent

Urllib. FancyURLopener(proxy) can send proxy requests when the proxy object is valid, or clear proxy Settings when the proxy object is an empty dictionary.

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy(a):
    "Get random proxy"
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib(a):
    Send a request through a proxy
    # Random proxy IP
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open('http://httpbin.snowdreams1006.cn/ip')
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':') [0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()
Copy the code

In addition to setting up proxy requests using urllib.fancyurlopener (proxy),proxies can be used to send proxies for GET or POST requests using URllib2. urlopen(URL,data, Proxies).

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy(a):
    "Get random proxy"
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def post_proxy_urllib(a):
    Get response header and response body information by proxy.
    data = urllib.urlencode({
        'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'snowdreams1006@163.com'.'github':'https://github.com/snowdreams1006/'
    })
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxies = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',data=data,proxies=proxies)
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':') [0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    post_proxy_urllib()
Copy the code

Python2 urllib.urlopen(URL [,data[, Proxies]])

Next section:

Visit api.github.com/ to request interfaces of interest and to test publicly available data.

{
  "current_user_url": "https://api.github.com/user"."current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}"."authorizations_url": "https://api.github.com/authorizations"."code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}"."commit_search_url": "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}"."emails_url": "https://api.github.com/user/emails"."emojis_url": "https://api.github.com/emojis"."events_url": "https://api.github.com/events"."feeds_url": "https://api.github.com/feeds"."followers_url": "https://api.github.com/user/followers"."following_url": "https://api.github.com/user/following{/target}"."gists_url": "https://api.github.com/gists{/gist_id}"."hub_url": "https://api.github.com/hub"."issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}"."issues_url": "https://api.github.com/issues"."keys_url": "https://api.github.com/user/keys"."label_search_url": "https://api.github.com/search/labels?q={query}&repository_id={repository_id}{&page,per_page}"."notifications_url": "https://api.github.com/notifications"."organization_url": "https://api.github.com/orgs/{org}"."organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}"."organization_teams_url": "https://api.github.com/orgs/{org}/teams"."public_gists_url": "https://api.github.com/gists/public"."rate_limit_url": "https://api.github.com/rate_limit"."repository_url": "https://api.github.com/repos/{owner}/{repo}"."repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}"."current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}"."starred_url": "https://api.github.com/user/starred{/owner}{/repo}"."starred_gists_url": "https://api.github.com/gists/starred"."user_url": "https://api.github.com/users/{user}"."user_organizations_url": "https://api.github.com/user/orgs"."user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}"."user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}
Copy the code

Reference documentation

The difference between and usage of read(), readline(), and readlines() in Python
Python core module — urllib module
Gunicorn running with configuration
Gunicorn common configuration

If you feel that this article is helpful to you, welcome to like the message to tell me, your encouragement is my motivation to continue to create, might as well pay attention to the personal public number “snow dream technology station”, regularly update quality articles!

How to use native Urllib to send network requests

The article directories

Environment set up

Environmental demonstration

Environment set up

Network request URllib library

How to send the simplest network request

How do I know what the properties and methods are

How to send normalGETrequest

How to send normalPOSTrequest

How do I set up a proxy to make network requests

Environment set up

dockerWay to install

Source code installation

The agent requests

Obtain a random proxy IP address

Set the proxy IP address for access

Clear the direct proxy IP address connection

Learn to summarize

A virtual environmentvirtualenv

Server backgroundhttpbin

Free IP proxy poolproxyip

Native network requesturllib

Reference documentation

Related Posts

Netty framework learning – preset ChannelHandler and codecs

Learn Java from scratch and type your first program “Hello World”

Let’s talk about 12 scenarios where Spring transactions fail

How to send normal`GET`request

How to send normal`POST`request

`docker`Way to install

A virtual environment`virtualenv`

Server background`httpbin`

Free IP proxy pool`proxyip`

Native network request`urllib`