Urllib. Urlopen ([url, data [, proxies]]) : docs.python.org/2/library/u…
The default network request library in Python is the urlllib family, including urllib, urllib2, and urllib3, and most of the time all three work together.
Of course, there are many excellent third-party libraries for sending network requests, the most well-known of which is the Requests library, but this article is going to start with the basic URllib for sending network requests.
Github
Source code address:Github.com/snowdreams1…Github
Online Address:Snowdreams1006. Making. IO/learn – pytho…
“Snow dream technology station” to remind you: because the article has a certain timeliness, it is likely that readers are reading, part of the link has been invalid, please follow the relevant instructions of the article personally practice verification, as far as the letter book is better than no book, do not copy and paste code directly, must personally knock again!
The article directories
Environment set up
The Python environment used in this article is a virtual environment based on the VirtualEnv implementation, just to facilitate isolation of different environments and better simulate real user environments.
In actual development, you can follow this article step by step to build a development environment, or directly use the system default environment.
Environmental demonstration
Information about the demo environment is as follows:
(. Env)$python --version python 2.7.16 (. Env)$PIP --version PIP 19.3.1 from ~ / python/SRC/url/urllib/env/lib/python2.7 / site - packages/PIP (python 2.7)Copy the code
The following code works fine in this environment, but there is no guarantee that the rest of the environment will match the results of the demo, so the actual results will prevail.
Environment set up
If you don’t need a virtual environment, you can skip the installation section and use the default environment, as long as the Python version used for testing is PYTHon2 and not Python3!
- Step 1 Install the virtual environment
virtualenv
sudo pip install virtualenv
Copy the code
Installing a virtual environment makes it easier to isolate different Python environments. You can also use the system default environment, so this step is optional, as are the following steps.
- Step 2 Prepare the virtual environment directory
.env
virtualenv .env
Copy the code
The virtual environment directory is hidden to prevent misoperations, but it can also be displayed as a normal directory.
- Step 3 Activate the virtual environment
.env
source .env/bin/activate
Copy the code
Once the virtual environment directory is ready, you need to activate the virtual environment. This step can be repeated without error!
- Step 4 View the information that is being used
python
与pip
version
(.env) $ which python
~/python/src/url/urllib/.env/bin/python
(.env) snowdreams1006s-MacBook-Pro:urllib snowdreams1006$ which pip
~/python/src/url/urllib/.env/bin/pip
Copy the code
The python dependencies are automatically downloaded when the virtual environment is enabled, so the python and PIP files are in the current directory. Env rather than the system default. If the virtual environment is not enabled, the system directory is displayed.
Network request URllib library
Runtime if readers close test found that the network can’t normal request, can be httpbin. Snowdreams1006. Cn/replace or httpbin.org/ set up their own local test environment.
Here are two kinds of structures, local test environment installation, of course, also can access httpbin. Snowdreams1006. Cn/or httpbin.org/ online environment.
docker
installation
docker run -p 8000:80 kennethreitz/httpbin
Copy the code
For the first run, the image will be downloaded to the local computer and then the container will be started. For the non-first run, the container will be directly started. The access address is http://127.0.0.1:8000/
python
installation
pip install gunicorn httpbin && gunicorn httpbin:app
Copy the code
Gunicorn httpbin: app-b :9898 Specifies port 8000. If a port conflict occurs, the gunicorn httpbin: app-b :9898 specifies port 8000.
How to send the simplest network request
Urllib2.urlopen (URL) : Sends the simplest network request and returns the response body text data directly.
Create a new Python file named urllib_demo.py. The core code consists of loading urllib2, sending the simplest GET request using urllib2.urlopen(), and finally using response.read(). The response body content can be read at one time.
The code content is as follows:
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.read()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
If the file name is urllib_demo.py, run Python urllib_demo.py from the terminal command line to view the output.
How do I know what the properties and methods are
Print Dir (Response) : Obtains the enumeration value of the object’s methods and attributes, and guesses the methods and attributes without documentation.
Urllib2. urlopen(URL) already makes it possible to send the simplest network request. Whether it is a GET request or a POST request, it is important to GET the body of the response after the request, but there are other methods and attributes that cannot be ignored in actual development.
Therefore, in addition to mastering response.read() to read all the contents of the response body at once, we also need to know what properties and methods response has.
Using type(response) to get the object type and dir(Response) to get the property enumeration value, you can roughly guess what properties and methods the object has to call without documentation.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print type(response)
print dir(response)
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Here is the output of print Type (response) and print dir(response), and then we’ll pick out the commonly used attributes and methods.
# print type(response)
<type 'instance'>
# print dir(response)
['__doc__'.'__init__'.'__iter__'.'__module__'.'__repr__'.'close'.'code'.'fileno'.'fp'.'getcode'.'geturl'.'headers'.'info'.'msg'.'next'.'read'.'readline'.'readlines'.'url']
Copy the code
- The status code (property) of the response object
Response. code: Obtains the status code of the response object. In normal cases, 200 indicates that the request is successful, while 500 indicates a typical system error.
The attribute enumeration value is obtained through dir(response). Combined with type(response), it is not difficult to find that Response.code is used to obtain the response status code. The specific call method is Response.code or response.code() Print type(response.code) roughly infer.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print type(response.read)
print type(response.code)
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Print type(response.read) is the
method type, and response.code is the method type
Basic type, so Response. code is the property invocation.
The output of type(response.code) is
, not
, so the way to get the status code is attribute call.
The detailed code is as follows:
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.code
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- Status code of the response object (method)
Response.getcode () : Gets the status code of the response object. Normally 200 indicates that the request is successful, while 500 is a typical system error.
Similarly, we know from print dir(response) that the getCode field is available for call, but we don’t know if it is a property call or a method call. Print type(response.getCode) again to get
, thus determining it as a method call.
The detailed code is as follows:
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.getcode()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- Status code information for the response object (properties)
Response. MSG: obtains the status description of the response object, for example, status code 200 is OK and 500 is INTERNAL SERVER ERROR.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get the response status code.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/200')
print response.code
print response.msg
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/500')
print response.code
print response.msg
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
The normal request status is 200 OK, and the request exception is likely to be 500 INTERNAL SERVER ERROR. Once there is an exception, an ERROR will be reported if the exception is handled, and the program will terminate.
- Access links (properties) for the response object
Response. url: Gets the request link.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.url
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- Access link for the response object (method)
Response.geturl () : Gets the request link.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.geturl()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- Access links (properties) for the response object
Response.headers. Dict: Gets the request header information and displays it as a dictionary.
In some cases, a request must be sent with a specific request header to succeed. Therefore, it is important to know what the server receives when the request header is not set by default. Similarly, print Type (Response.headers) can be used with print Dir (Response.headers) explores the attributes and methods available for invocation.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.headers.dict
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- Request header information for the response object (method)
Response.info () : Gets the request header information and displays it line by line
This is similar to the previous response.headers dict to get the request header information, except that Response.info () is suitable for visual display rather than application use.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get requester information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.info()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- The response body of the response object (method)
Response.read () : reads the response body at one time, which is suitable for the case where the amount of data in the response body is relatively small, and reads all the data into the memory at one time for convenient operation.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print response.read()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Response.read () returns a string, so it’s easy to use variables to receive for subsequent processing, such as result = response.read() :
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
result = response.read()
print result
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
- The response body of the response object (method)
Response.readline () : reads the response body line by line, which is suitable for the case of large data body and circulates until no data can be read at last.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
line = response.readline()
while line:
print line
line = response.readline()
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Response.readline () can only be readline by line, so manual concatenation is required to obtain the completed response body, for example:
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
result = ' '
line = response.readline()
result = result + str(line)
while line:
line = response.readline()
result = result + str(line)
print result
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
STR (line) is used to ensure that the response body string is a string, which it shouldn’t be; response.readline() is already a string.
- The response body of the response object (method)
Response.readlines () : Iterates through and reads the response body, loops through and saves to the list object, suitable for the case where line by line processing is required.
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
for line in response.readlines():
print line
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Similarly, if the complete response body result needs to be obtained by response.readlines(), it can be spliced as follows, as shown in the following example:
# -*- coding: utf-8 -*-
import urllib2
def use_simple_urllib2(a):
Get response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
result = ' '
for line in response.readlines():
result = result + str(line)
print result
if __name__ == '__main__':
print '>>>Use simple urllib2:'
use_simple_urllib2()
Copy the code
Result = “.join([line for line in response.readlines()])
How to send normalGET
request
- Send directly without parameters
Urllib2.urlopen (URL) : Only a destination URL is required to send a GET request.
The easiest way to GET the request is a GET way, without setting the other parameters, you just need to fill out the URL request, such as urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/get ‘), the sample code is as follows:
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_simple_urllib2(a):
Get the response header and response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use simple urllib2<<<'
use_simple_urllib2()
Copy the code
If the above code file is named urllib_demo.py, run the python urllib_demo.py file from the terminal command line, and the output is as follows:
(.env) $ python urllib_demo.py
>>>Use simple urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 13:38:27 GMT
Content-Type: application/json
Content-Length: 263
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
>>>Response Body:
{
"args": {},
"headers": {
"Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
},
"origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get"
}
Copy the code
Where the response header Connection: close indicates that the Connection is closed automatically, and the response body args is an empty dictionary indicates that there are no query parameters.
- There are parameters transcoding to send
In the actual development process, there are few GET requests that do not need to carry parameters. For GET requests with parameter queries, native URllib also supports them. The simplest way is to splice query parameters to the target URL to GET a URL with query parameters.
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_params_urllib2(a):
Get the response header and response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello¶m2=world')
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use params urllib2<<<'
use_params_urllib2()
Copy the code
Similarly, if the above code file is named urllib_demo.py, run the Python urllib_demo.py file from the terminal command line, and the output looks like this:
$python urllib_demo.py >>>Use params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 13:59:23 GMT Content-Type: application/json Content-Length: 338 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true
>>>Response Body:
{
"args": {
"param1": "hello"."param2": "world"
},
"headers": {
"Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
},
"origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get?param1=hello¶m2=world"
}
Copy the code
The response header Connection: close indicates that the Connection is automatically closed, and the response body args is no longer an empty dictionary, but the query parameters just passed indicate that the server actually received the sent query parameters, so this approach is also possible.
If the query parameters are too many, it will be very tedious to concatenate the request link URL to form a new URL, and must be followed. Param1 =hello¶m2=world = param1=hello¶m2=world
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_params_urllib2(a):
Get the response header and response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello¶m2=world&author=snowdreams1006&website=http://blog.snowdreams1006 .cn&url=snowdreams1006.github.io/learn-python/url/urllib/teaching.html&wechat=snowdreams1006&[email protected] m&github=https://github.com/snowdreams1006/')
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use params urllib2<<<'
use_params_urllib2()
Copy the code
The above tedious is not only reflected in the splicing into a new URL long container error, but also encounter dynamic query parameters replacement problem, so the automatic splicing query parameters function is really timely!
params = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
print params
Copy the code
Urllib.urlencode () can transcode dictionary query parameters into & join query parameters and then manually concatenate them to the request URL? Params gets the URL with the parameters.
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_params_urllib2(a):
params = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use params urllib2<<<'
use_params_urllib2()
Copy the code
If the above code file is named urllib_demo.py, run the python urllib_demo.py file from the terminal command line, and the output is as follows:
$python urllib_demo.py >>>Use params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 14:27:21 GMT Content-Type: application/json Content-Length: 892 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true
>>>Response Body:
{
"args": {
"author": "snowdreams1006"."email": "[email protected]"."github": "https://github.com/snowdreams1006/"."param1": "hello"."param2": "world"."url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html"."website": "http://blog.snowdreams1006.cn"."wechat": "snowdreams1006"
},
"headers": {
"Accept-Encoding": "identity"."Connection": "close"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
},
"origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/get?website=http%3A%2F%2Fblog.snowdreams1006.cn&github=https%3A%2F%2Fgithub.com%2Fsnow dreams1006%2F&wechat=snowdreams1006¶m2=world¶m1=hello&author=snowdreams1006&url=https%3A%2F%2Fsnowdreams1006.git hub.io%2Flearn-python%2Furl%2Furllib%2Fteaching.html&email=snowdreams1006%40163.com"
}
Copy the code
Urllib2.urlopen (URL) is used to send the GET request. Urllib2.urlopen (URL) is used to send the GET request.
How to send normalPOST
request
If the request link URL only supports POST requests, then the above concatenated address implementation will no longer meet the requirements. Interestingly, it takes only one step to convert a GET request into a POST request.
If it is a GET request, sends a request is this: urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/post?%s’ % params);
If it is a POST request, sends a request is this: urllib2. Urlopen (‘ http://httpbin.snowdreams1006.cn/post ‘, params);
def post_params_urllib2(a):
Get the response header and response body information.
params = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Post params urllib2<<<'
post_params_urllib2()
Copy the code
Urllib2. Urlopen (URL) : how do you concatenate the url?
However, a more intuitive method is to send a request for direct validation, as shown in the following example:
$python urllib_demo.py >>>Post params urllib2<<< >>>Response Headers: Server: nginx/1.17.6 Date: Thu, 16 Jan 2020 14:45:43 GMT Content-Type: application/json Content-Length: 758 Connection: close Access-Control-Allow-Origin: * Access-Control-Allow-Credentials:true
>>>Response Body:
{
"args": {},
"data": ""."files": {},
"form": {
"author": "snowdreams1006"."email": "[email protected]"."github": "https://github.com/snowdreams1006/"."param1": "hello"."param2": "world"."url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html"."website": "http://blog.snowdreams1006.cn"."wechat": "snowdreams1006"
},
"headers": {
"Accept-Encoding": "identity"."Connection": "close"."Content-Length": "285"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.snowdreams1006.cn"."User-Agent": "Python - urllib / 2.7"
},
"json": null,
"origin": "218.205.55.192"."url": "http://httpbin.snowdreams1006.cn/post"
}
Copy the code
It is worth noting that the parameters submitted by the POST request above are stored in the form property rather than the ARgs property of the GET request.
How do I set up a proxy to make network requests
Environment set up
If proxyip. Snowdreams1006. Cn/can’t access, can visit github.com/jhao104/pro… The project builds its own proxy pool.
{
"delete? The proxy = 127.0.0.1:8080": "delete an unable proxy"."get": "get an useful proxy"."get_all": "get all proxy from proxy pool"."get_status": "proxy number"
}
Copy the code
Do not press, malicious access will close the black house yo, recommend everyone to build their own local environment, thank you for your support.
This agent pool is based on jhao104/proxy_pool project to provide two installation methods, divided into docker installation and source installation.
docker
Way to install
Docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password=' ' -p 5010:5010 jhao104/proxy_pool
Copy the code
You can also download the image in advance: Docker pull Jhao104 /proxy_pool, and then run the command above to start the container.
Source code installation
- Step 1: Download the source code
git clone https://github.com/jhao104/proxy_pool.git
Copy the code
You can also download the installation package: github.com/jhao104/pro…
- Step 2: Install dependencies
pip install -r requirements.txt
Copy the code
Installation project depend on the need to switch to the root directory, such as CD proxy_pool, then will be automatically downloaded from the default installation source, also can be used PIP install -i https://pypi.tuna.tsinghua.edu.cn/simple – r Requirements.txt speeds up installation.
- Step 3: Configure
Config/setting.py
# Config/setting.py is the project configuration file
# configuration DB
DATABASES = {
"default": {
"TYPE": "REDIS".# Currently supports SSDB or REDIS databases
"HOST": "127.0.0.1".# db host
"PORT": 6379.# db port, for example SSDB is usually 8888, REDIS is usually 6379 by default
"NAME": "proxy".# default configuration
"PASSWORD": "" # db password}}Configure the API service
SERVER_API = {
"HOST": "0.0.0.0".0.0.0.0 listen on all IP addresses
"PORT": 5010 # monitor port
}
The proxy pool access address is http://127.0.0.1:5010
Copy the code
- Step 5: Start the project
If your dependencies are already installed and ready to run, start them in ProxyPool. Py in the CLI directory.
The program is divided into schedule scheduler and WebServer Api service
Start the scheduler first
python proxyPool.py schedule
Then start the webApi service
python proxyPool.py webserver
Copy the code
This command requires that the current environment is in the CLI directory. If it is in another directory, adjust the proxyPool. Py path (CD cli is used to switch to the CLI directory).
If the preceding steps are normal, the free Internet proxy IP address is automatically captured after the project is started. You can visit http://127.0.0.1:5010 to view the IP address.
The agent requests
Advice first using the browser directly access proxyip. Snowdreams1006. Cn/get/see if can get random proxy IP, and then reusing a python program, ensure that code runs correctly, convenient follow-up development testing.
{
"proxy": "183.91.33.41:8086"."fail_count": 0."region": ""."type": ""."source": "freeProxy09"."check_count": 59."last_status": 1."last_time": "The 2020-01-18 13:14:32"
}
Copy the code
This is a request/get/get random IP example, complete request address: http://proxyip.snowdreams1006.cn/get/
Obtain a random proxy IP address
# -*- coding: utf-8 -*-
import urllib2
import json
def get_proxy(a):
"Get random proxy"
response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
result = response.read()
return json.loads(result)
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
get_proxy_urllib()
Copy the code
If the browser environment, you can directly access proxyip. Snowdreams1006. Cn/get/verify whether can get random proxy IP, or in terminal command line running curl http://proxyip.snowdreams1006.cn/get/ Command to view the result.
Set the proxy IP address for access
Urllib. FancyURLopener(proxy) : Sets the proxy IP address for indirect access
Through urllib.FancyURLopener(proxy), you can set the proxy to hide the real information of the client from the server. However, whether the server can distinguish the proxy request from the ordinary request depends on the proxy IP address.
If it is a high secret agent, it is the most ideal situation to achieve the role of real agent.
On the other hand, transparent proxy is the most useless proxy. The server not only knows that you are using the proxy but also knows your real IP address.
How to validate whether can set the proxy IP identified by the server, can visit httpbin. Snowdreams1006. Cn/IP access server read the client IP.
$ curl http://httpbin.snowdreams1006.cn/ip
{
"origin": "115.217.104.191"
}
Copy the code
If terminal command line not the curl command, install baidu once or directly open the browser to access httpbin. Snowdreams1006. Cn/IP
If the server reads the request IP and set the proxy IP, congratulations, you set the proxy successfully and is a high proxy, otherwise, it is a false alarm.
# -*- coding: utf-8 -*-
import urllib
import urllib2
import json
def get_proxy(a):
"Get random proxy"
response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
result = response.read()
return json.loads(result)
def get_proxy_urllib(a):
Send a request through a proxy
# Random proxy IP
ip = get_proxy().get('proxy')
print('>>>Get Proxy:')
print(ip)
proxy = {
'http': 'http://{}'.format(ip),
'https': 'https://{}'.format(ip)
}
opener = urllib.FancyURLopener(proxy)
response = opener.open()
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
get_proxy_urllib()
Copy the code
The preceding example only shows how to set the proxy IP address to send requests. It does not verify whether the proxy IP address is set successfully, that is, whether the server reads the request IP address and the newly set proxy IP address, and does not consider exceptions such as unavailable proxy IP address or connection timeout.
The following provides a simple example to check whether the proxy IP address is set successfully:
{
"proxy": "121.225.199.78:3128"."fail_count": 0."region": ""."type": ""."source":
"freeProxy09"."check_count": 15."last_status": 1."last_time": "The 2020-01-17 12:03:29"
}
Copy the code
Obtain the general format of random proxy IP, and extract the general value of random IP is 121.225.199.78:3128
General format for random access proxy IP is a port number, and access httpbin. Snowdreams1006. Cn/IP access to the source IP does not include port, so the most simple idea is to remove random TCP/IP port number, and then and access the results.
'121.225.199.78:3128'.split(':') [0]
Copy the code
First, divide the IP address into two parts with:, and then take only the first part, that is, obtain the IP address without port number: 121.225.199.78
Next, since the response body data obtained by Response.read () is a string, it is not convenient to extract the corresponding value of origin, and the response body data is obviously in JSON format, so json.loads(result) can be easily converted to Python The dictionary type of.
result = response.read()
result = json.loads(result)
proxyip = result.get('origin')
Copy the code
Both result.get(‘origin’) and result[‘origin’] can be used for dictionary types. However, if the key name does not exist, the result[‘origin’] and result[‘origin’] are different.
Now the simplest complete example to verify that the proxy IP was set successfully is as follows:
# -*- coding: utf-8 -*-
import urllib
import urllib2
import json
def get_proxy(a):
"Get random proxy"
response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
result = response.read()
return json.loads(result)
def get_proxy_urllib(a):
Send a request through a proxy
# Random proxy IP
ip = get_proxy().get('proxy')
print('>>>Get Proxy:')
print(ip)
proxy = {
'http': 'http://{}'.format(ip),
'https': 'https://{}'.format(ip)
}
opener = urllib.FancyURLopener(proxy)
response = opener.open()
result = response.read()
result = json.loads(result)
response_ip = result.get('origin')
proxy_ip = ip.split(':') [0]
if proxy_ip == response_ip:
print 'Proxy Success'
else:
print 'Proxy Fail'
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
get_proxy_urllib()
Copy the code
If the randomly obtained proxy IP address is normal, no exception will be thrown. The setting succeeds or fails.
(.env) $ python urllib_demo.py >>>Get proxy urllib<<< >>>Get Proxy: 52.80.58.248:3128 Proxy Fail (.env) $python urllib_demo.py >>>Get Proxy urllib<<< >>>Get Proxy: 117.88.176.152:3000 Proxy SuccessCopy the code
The quality of free proxy IP is just so-so, so don’t expect too much and you should choose paid proxy IP in the actual development process.
Clear the direct proxy IP address connection
Urllib.fancyurlopener ({}) : Clears proxy IP information for direct access
When setting the proxy IP address, you need to pass a proxy dictionary to urllib.FancyURLopener(proxy). When clearing proxy information, you only need to set the original proxy dictionary as an empty object.
The main code is the same as setting the proxy IP address, so please refer to the following code:
# -*- coding: utf-8 -*-
import urllib
import urllib2
import json
def clear_proxy_urllib(a):
Send the request after clearing the agent
# Random proxy IP
ip = get_proxy().get('proxy')
print('>>>Get Proxy:')
print(ip)
proxy = {
'http': 'http://{}'.format(ip),
'https': 'https://{}'.format(ip)
}
opener = urllib.FancyURLopener(proxy)
response = opener.open("http://httpbin.snowdreams1006.cn/ip")
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
result = response.read()
print(result)
result = json.loads(result)
response_ip = result.get('origin')
proxy_ip = ip.split(':') [0]
if proxy_ip == response_ip:
print 'Set proxy success'
else:
print 'Set proxy fail'
opener = urllib.FancyURLopener({})
response = opener.open("http://httpbin.snowdreams1006.cn/ip")
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
result = response.read()
print(result)
result = json.loads(result)
response_ip = result.get('origin')
proxy_ip = ip.split(':') [0]
if proxy_ip == response_ip:
print 'Clear proxy fail'
else:
print 'Clear proxy success'
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
get_proxy_urllib()
Copy the code
In addition to using urllib.fancyurlopener () to set or clear the proxy IP, you can use urllib.urlopen() to achieve a similar requirement.
# Use http://www.someproxy.com:3128 for HTTP proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})
# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)
Copy the code
Examples of setting environment variables are as follows:
% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...
Copy the code
Learn to summarize
This article mainly introduces the python native URllib how to send network requests and some basic environment construction process, which comes with a large number of directly operable off-the-shelf code, documentation and source code are open source, interested friends can browse.
Now briefly review the important knowledge points mainly involved in this article, in order to provide the latecomers to quickly read inquiries when learning.
A virtual environmentvirtualenv
After the virtual environment is installed and activated, the python and PIP versions are as follows:
(. Env)$python --version python 2.7.16 (. Env)$PIP --version PIP 19.3.1 from ~ / python/SRC/url/urllib/env/lib/python2.7 / site - packages/PIP (python 2.7)Copy the code
To set up the virtual environment by yourself, perform the following steps to enable the virtual environment:
- Step 1 Install the virtual environment
virtualenv
sudo pip install virtualenv
Copy the code
Installing a virtual environment makes it easier to isolate different Python environments. You can also use the system default environment, so this step is optional, as are the following steps.
- Step 2 Prepare the virtual environment directory
.env
virtualenv .env
Copy the code
The virtual environment directory is hidden to prevent misoperations, but it can also be displayed as a normal directory.
- Step 3 Activate the virtual environment
.env
source .env/bin/activate
Copy the code
After the virtual environment is activated, you can run PIP –version to check the current version information to verify whether the virtual environment is successfully started.
Server backgroundhttpbin
The default local access address: http://127.0.0.1:8000/, the online access address: httpbin. Snowdreams1006. Cn/or httpbin.org/
If docker is used to install httpbin successfully, access the interface address, the actual preview is as follows:
If you use Python to start the httpbin library, the effect is different from that of docker.
You are advised to install the httpbin service by yourself. The following two methods are used to enable the Httpbin service.
docker
installation
docker run -p 8000:80 kennethreitz/httpbin
Copy the code
For the first run, the image will be downloaded to the local computer and then the container will be started. For the non-first run, the container will be directly started. The access address is http://127.0.0.1:8000/
python
installation
pip install gunicorn httpbin && gunicorn httpbin:app
Copy the code
Gunicorn httpbin: app-b :9898 Specifies port 8000. If a port conflict occurs, the gunicorn httpbin: app-b :9898 specifies port 8000.
Free IP proxy poolproxyip
The default local access address: http://127.0.0.1:5010/, the online access address: proxyip. Snowdreams1006. Cn/or http://118.24.52.95/
{
"delete? The proxy = 127.0.0.1:8080": "delete an unable proxy"."get": "get an useful proxy"."get_all": "get all proxy from proxy pool"."get_status": "proxy number"
}
Copy the code
If you need to set up local services by yourself, please decide the installation method based on your own needs. The following two methods are used to enable the ProxyIP service.
docker
installation
Docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password=' ' -p 5010:5010 jhao104/proxy_pool
Copy the code
You can also download the image in advance: Docker pull Jhao104 /proxy_pool, and then run the command above to start the container.
-
Source code installation
- Step 1: Download the source code
git clone https://github.com/jhao104/proxy_pool.git Copy the code
Of course, you can also directly download the installation package: github.com/jhao104/pro…
- Step 2: Install dependencies
pip install -r requirements.txt Copy the code
Note: Installation project depend on the need to switch to the project root directory (CD proxy_pool) in advance, if abandon slow download speed can use tsinghua university image PIP install -i https://pypi.tuna.tsinghua.edu.cn/simple – r Requirements. TXT speeds up the download and installation process.
- Step 3: Configure
Config/setting.py
# Config/setting.py is the project configuration file # configuration DB DATABASES = { "default": { "TYPE": "REDIS".# Currently supports SSDB or REDIS databases "HOST": "127.0.0.1".# db host "PORT": 6379.# db port, for example SSDB is usually 8888, REDIS is usually 6379 by default "NAME": "proxy".# default configuration "PASSWORD": "" # db password}}Configure the API service SERVER_API = { "HOST": "0.0.0.0".0.0.0.0 listen on all IP addresses "PORT": 5010 # monitor port } The proxy pool access address is http://127.0.0.1:5010 Copy the code
For more details about configuration, please refer to the official introduction of the project directly. The above configuration information is basically enough.
- Step 5: Start the project
If your dependencies are already installed and ready to run, start them in ProxyPool. Py in the CLI directory. The program is divided into schedule scheduler and WebServer Api service Start the scheduler first python proxyPool.py schedule Then start the webApi service python proxyPool.py webserver Copy the code
This command requires that the current environment is in the CLI directory. If it is in another directory, adjust the proxyPool. Py path (CD cli).
Native network requesturllib
Urllib. Urlopen ([url, data [, proxies]]) : docs.python.org/2/library/u…
GET
request
If the query parameters are simple, you can build the request URL directly and serialize the query parameter dictionary with urllib.urlencode(dict).
Urllib2.urlopen (URL) when the query parameters are not too complicated, especially if no query parameters are needed, you can send the request directly to urllib2.urlopen(URL) as follows:
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_simple_urllib2(a):
Get the response header and response body information.
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use simple urllib2<<<'
use_simple_urllib2()
Copy the code
When many query parameters are required or dynamic concatenation is required, urllib.urlencode(dict) is recommended to serialize query parameters and concatenate them to the request URL to form the completed request URL.
# -*- coding: utf-8 -*-
import urllib
import urllib2
def use_params_urllib2(a):
params = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Use params urllib2<<<'
use_params_urllib2()
Copy the code
urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
POST
request
Compared with the default GET request, only the query parameters are no longer concatenated to the request link URL but are passed to the parameter data as an optional parameter. Urllib. urlopen(URL,data) request mode is POST request.
def post_params_urllib2(a):
Get the response header and response body information.
params = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
print('>>>Response Headers:')
print(response.info())
print('>>>Response Body:')
print(response.read())
if __name__ == '__main__':
print '>>>Post params urllib2<<<'
post_params_urllib2()
Copy the code
- Set the agent
Urllib. FancyURLopener(proxy) can send proxy requests when the proxy object is valid, or clear proxy Settings when the proxy object is an empty dictionary.
# -*- coding: utf-8 -*-
import urllib
import urllib2
import json
def get_proxy(a):
"Get random proxy"
response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
result = response.read()
return json.loads(result)
def get_proxy_urllib(a):
Send a request through a proxy
# Random proxy IP
ip = get_proxy().get('proxy')
print('>>>Get Proxy:')
print(ip)
proxy = {
'http': 'http://{}'.format(ip),
'https': 'https://{}'.format(ip)
}
opener = urllib.FancyURLopener(proxy)
response = opener.open('http://httpbin.snowdreams1006.cn/ip')
result = response.read()
result = json.loads(result)
response_ip = result.get('origin')
proxy_ip = ip.split(':') [0]
if proxy_ip == response_ip:
print 'Proxy Success'
else:
print 'Proxy Fail'
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
get_proxy_urllib()
Copy the code
In addition to setting up proxy requests using urllib.fancyurlopener (proxy),proxies can be used to send proxies for GET or POST requests using URllib2. urlopen(URL,data, Proxies).
# -*- coding: utf-8 -*-
import urllib
import urllib2
import json
def get_proxy(a):
"Get random proxy"
response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
result = response.read()
return json.loads(result)
def post_proxy_urllib(a):
Get response header and response body information by proxy.
data = urllib.urlencode({
'param1': 'hello'.'param2': 'world'.'author':'snowdreams1006'.'website':'http://blog.snowdreams1006.cn'.'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html'.'wechat':'snowdreams1006'.'email':'[email protected]'.'github':'https://github.com/snowdreams1006/'
})
ip = get_proxy().get('proxy')
print('>>>Get Proxy:')
print(ip)
proxies = {
'http': 'http://{}'.format(ip),
'https': 'https://{}'.format(ip)
}
response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',data=data,proxies=proxies)
result = response.read()
result = json.loads(result)
response_ip = result.get('origin')
proxy_ip = ip.split(':') [0]
if proxy_ip == response_ip:
print 'Proxy Success'
else:
print 'Proxy Fail'
if __name__ == '__main__':
print '>>>Get proxy urllib<<<'
post_proxy_urllib()
Copy the code
Python2 urllib.urlopen(URL [,data[, Proxies]])
Next section:
Visit api.github.com/ to request interfaces of interest and to test publicly available data.
{
"current_user_url": "https://api.github.com/user"."current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}"."authorizations_url": "https://api.github.com/authorizations"."code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}"."commit_search_url": "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}"."emails_url": "https://api.github.com/user/emails"."emojis_url": "https://api.github.com/emojis"."events_url": "https://api.github.com/events"."feeds_url": "https://api.github.com/feeds"."followers_url": "https://api.github.com/user/followers"."following_url": "https://api.github.com/user/following{/target}"."gists_url": "https://api.github.com/gists{/gist_id}"."hub_url": "https://api.github.com/hub"."issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}"."issues_url": "https://api.github.com/issues"."keys_url": "https://api.github.com/user/keys"."label_search_url": "https://api.github.com/search/labels?q={query}&repository_id={repository_id}{&page,per_page}"."notifications_url": "https://api.github.com/notifications"."organization_url": "https://api.github.com/orgs/{org}"."organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}"."organization_teams_url": "https://api.github.com/orgs/{org}/teams"."public_gists_url": "https://api.github.com/gists/public"."rate_limit_url": "https://api.github.com/rate_limit"."repository_url": "https://api.github.com/repos/{owner}/{repo}"."repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}"."current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}"."starred_url": "https://api.github.com/user/starred{/owner}{/repo}"."starred_gists_url": "https://api.github.com/gists/starred"."user_url": "https://api.github.com/users/{user}"."user_organizations_url": "https://api.github.com/user/orgs"."user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}"."user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}
Copy the code
Reference documentation
- The difference between and usage of read(), readline(), and readlines() in Python
- Python core module — urllib module
- Gunicorn running with configuration
- Gunicorn common configuration
If you feel that this article is helpful to you, welcome to like the message to tell me, your encouragement is my motivation to continue to create, might as well pay attention to the personal public number “snow dream technology station”, regularly update quality articles!