Central bank credit crawler solutions

The company I work for is an Internet finance company, among which the biggest business is loans. When a user applies for a loan to us, we need the user’s authorization to obtain the user’s credit data, and then give the data to the risk control rule engine to generate a data report, with which to judge whether the user can be lent.

Scheme 1: crawler based on interface

Technology stack:

nodejs
request
cheerio

Advantages:

The process is simple
Parallel crawls are supported
Fast response
There are no special requirements for the environment

Disadvantages:

User cookies need to be maintained manually
Scheme unavailable

Supplement:

This plan was actually the best plan. But later the central bank credit revision, login page password box to use ActiveX control to encrypt the user password. The pure interface method cannot pass the login verification of the central bank credit investigation, so the scheme is now unavailable.

Scheme two: crawler based on browser

Technology stack:

nodejs
selenium
winio
jquery

Advantages:

It can be verified by the login of the central bank credit investigation
There is no need to maintain user cookies

Disadvantages:

Depends on the Internet Explorer browser environment
Slow response
Parallel crawls are not supported
Driver keyboard input is unstable

Supplement:

Since ActiveX controls can only be loaded in Internet Explorer, the crawler must be deployed on a Windows machine, which is the Worker machine in the picture. In addition, passwords for ActiveX controls cannot be copied directly by code and must rely on driver-level keyboard input to enter user passwords.

Scheme 3: crawler based on browser + interface

Technology stack:

nodejs
selenium
winio
request
cheerio

Advantages:

It can be verified by the login of the central bank credit investigation
Fast response
Parallel crawls are supported

Disadvantages:

Depends on the Internet Explorer browser environment
Driver keyboard input is unstable
Multiple worker machines need to be deployed
Process is complicated
User cookies need to be maintained manually

Supplement:

This scheme actually combines the advantages of scheme ONE and scheme two. The worker machine is used to load the ActiveX control, input the user password, obtain the encrypted password and return it to the crawler server. The rest of the process is the same as plan 1.

Agent optimization

Central bank credit investigation has anti-crawling mechanism, if the same IP login many users, the IP may be blocked. Therefore, crawler needs to add proxy IP to improve stability and success rate. This problem is actually quite good to solve, spend money to buy a proxy IP service.

However, the quality of proxy IP service providers is uneven, the good service is expensive, and the cheap service is unstable. If you are like the author, you can only apply for such a rubbish agent as the Sun agency. Then I offer you an idea here to improve the stability of the slag agent service.

Write a scheduled task that obtains N IP addresses at intervals. Then request the login page of credit investigation of central Bank with these IP addresses. If the response is successful within 1 second, the IP address will be stored in the IP pool; otherwise, it will be discarded.

The stability and success rate of crawler can be greatly improved by maintaining a high quality IP pool through scheduled tasks.

other

The above three schemes and agent optimization schemes have all been used in the production environment of the author’s company. The actual stack used differs from the one listed in this article. The technical stack in the article is the best technical scheme I think after the conclusion. In the future, I will find time to refactor the code of plan 3 and open source it to Github.

Central bank credit crawler solutions

Scheme 1: crawler based on interface

Technology stack:

Advantages:

Disadvantages:

Supplement:

Scheme two: crawler based on browser

Technology stack:

Advantages:

Disadvantages:

Supplement:

Scheme 3: crawler based on browser + interface

Technology stack:

Advantages:

Disadvantages:

Supplement:

Agent optimization

other

Related Posts

Jenkins + Docker continuous integration

How to set up a file server using Tomcat

Build up the Flutter development environment. Android Studio development tool installs the Flutter plugin