The introduction
When we were developing crawlers, Fildder was an essential artifact. Especially now that large sites are increasingly difficult to deal with, it often takes us a lot of time to figure out which parameters are required. Therefore, if we can quickly convert the parameters in the package caught by Fildder into the format available to Python, there is no doubt that our development efficiency can be greatly improved.
Therefore, I wrote a small tool to quickly convert the headers,data and cookies caught in Fildder into the dict format supported by requests,scrapy and so on.
Parsing data parameters
Note: Only the resolution of parameters copied from WebForm is supported
The format for copying data arguments from WebForm looks like this:
_token
address
channel 6
cityId
gpsLat 23.135075
gpsLng 113.357076
shopId 0
source shoplist
Copy the code
The first column is the key value and the second column is the value. A value followed by a null value is null.
Since its rules are fairly neat, we can use the re to extract them separately:
import re
def re_data(data):
key_rule='(.*)\t'
key = re.findall(key_rule,data)
value_rule = '\t(.*)'
value = re.findall(value_rule,data)
print(len(key))
print(len(value))
result = {}
if len(key) == len(value):
for i in range(len(key)):
result[key[i]] = value[i]
print(result)
Copy the code
The result of running the data argument with this function is:
{'_token': ' '.'address': ' '.'channel': '6'.'cityId': ' '.'gpsLat': '23.135075'.'gpsLng': '113.357076'.'shopId': '0'.'source': 'shoplist'}
Copy the code
The result is printed to the console, and since the result is a standard dict, we can simply copy it to Python code for use.
No matter how many parameters there are, the whole process takes ten seconds. Compared with the copy and paste of a parameter before, the efficiency has been improved several times.
Parse the header,cookie parameters
Similarly, for header and cookie values, we can also use the regular method to parse:
Parsing the headers
def re_header(header):
key = re.findall('[\t|\n]([\w*|\.|-]*):',header)
val = re.findall(':[\n\t]*(.*)\n',h1)
header = {}
print(key)
print(val)
for i in range(0,len(key)):
header[key[i]] = val[i].lstrip(' ')
print(key[i],val[i])
print(len(key))
print(len(val))
print(header)
Copy the code
Parsing the cookie
def re_cookie(cookieStr):
print(cookieStr.encode('utf-8'))
key = re.findall('[\t|\n]([\w*|\.]*)=',cookieStr)
val = re.findall('=[\n\t]*(.*)\n',cookieStr)
cookies = {}
for i in range(0, len(key)):
cookies[key[i]] = val[i]
print(key[i], val[i])
print(key)
print(len(key))
print(len(val))
print(cookies)
Copy the code
Use the same method as parsing data, not to do redundant demonstration here.
Note: Header values only support parsing values copied from raw data values only support parsing values copied from WebForm
- Cookie values can only be resolved from
cookie
The value copied from
The source code for three functions has been updated to my GayHub, and the project will probably update some of the common crawler snippets, such as the latitude and longitude method, in the future.
Maybe not 🙂