Three small tools, greatly improve the speed of crawler development

The introduction

When we were developing crawlers, Fildder was an essential artifact. Especially now that large sites are increasingly difficult to deal with, it often takes us a lot of time to figure out which parameters are required. Therefore, if we can quickly convert the parameters in the package caught by Fildder into the format available to Python, there is no doubt that our development efficiency can be greatly improved.

Therefore, I wrote a small tool to quickly convert the headers,data and cookies caught in Fildder into the dict format supported by requests,scrapy and so on.

Parsing data parameters

Note: Only the resolution of parameters copied from WebForm is supported

The format for copying data arguments from WebForm looks like this:

_token	
address	
channel	6
cityId	
gpsLat	23.135075
gpsLng	113.357076
shopId	0
source	shoplist
Copy the code

The first column is the key value and the second column is the value. A value followed by a null value is null.

Since its rules are fairly neat, we can use the re to extract them separately:

import re
def re_data(data):
    key_rule='(.*)\t'
    key = re.findall(key_rule,data)
    value_rule = '\t(.*)'
    value = re.findall(value_rule,data)

    print(len(key))
    print(len(value))
    result = {}
    if len(key) == len(value):
        for i in range(len(key)):
            result[key[i]] = value[i]
    print(result)
Copy the code

The result of running the data argument with this function is:

{'_token': ' '.'address': ' '.'channel': '6'.'cityId': ' '.'gpsLat': '23.135075'.'gpsLng': '113.357076'.'shopId': '0'.'source': 'shoplist'}
Copy the code

The result is printed to the console, and since the result is a standard dict, we can simply copy it to Python code for use.

No matter how many parameters there are, the whole process takes ten seconds. Compared with the copy and paste of a parameter before, the efficiency has been improved several times.

Parse the header,cookie parameters

Similarly, for header and cookie values, we can also use the regular method to parse:

Parsing the headers

def re_header(header):
    key = re.findall('[\t|\n]([\w*|\.|-]*):',header)
    val = re.findall(':[\n\t]*(.*)\n',h1)
    header = {}
    print(key)
    print(val)
    for i in range(0,len(key)):
        header[key[i]] = val[i].lstrip(' ')
        print(key[i],val[i])

    print(len(key))
    print(len(val))
    print(header)
Copy the code

Parsing the cookie

def re_cookie(cookieStr):
    print(cookieStr.encode('utf-8'))
    key = re.findall('[\t|\n]([\w*|\.]*)=',cookieStr)
    val = re.findall('=[\n\t]*(.*)\n',cookieStr)
    cookies = {}
    for i in range(0, len(key)):
        cookies[key[i]] = val[i]
        print(key[i], val[i])
    print(key)
    print(len(key))
    print(len(val))
    print(cookies)
Copy the code

Use the same method as parsing data, not to do redundant demonstration here.

Note: Header values only support parsing values copied from raw data values only support parsing values copied from WebForm

Cookie values can only be resolved fromcookieThe value copied from

The source code for three functions has been updated to my GayHub, and the project will probably update some of the common crawler snippets, such as the latitude and longitude method, in the future.

Maybe not 🙂

Three small tools, greatly improve the speed of crawler development

The introduction

Parsing data parameters

Parse the header,cookie parameters

Related Posts

Perhaps the most beautiful Spring transaction management detail

Colleague: You still use try catch to handle exceptions? A little Low!

Spring Boot configures the MySQL database in application.yml