The company I work for is an Internet finance company, among which the biggest business is loans. When a user applies for a loan to us, we need the user’s authorization to obtain the user’s credit data, and then give the data to the risk control rule engine to generate a data report, with which to judge whether the user can be lent.
Scheme 1: crawler based on interface
Technology stack:
- nodejs
- request
- cheerio
Advantages:
- The process is simple
- Parallel crawls are supported
- Fast response
- There are no special requirements for the environment
Disadvantages:
- User cookies need to be maintained manually
- Scheme unavailable
Supplement:
This plan was actually the best plan. But later the central bank credit revision, login page password box to use ActiveX control to encrypt the user password. The pure interface method cannot pass the login verification of the central bank credit investigation, so the scheme is now unavailable.
Scheme two: crawler based on browser
Technology stack:
- nodejs
- selenium
- winio
- jquery
Advantages:
- It can be verified by the login of the central bank credit investigation
- There is no need to maintain user cookies
Disadvantages:
- Depends on the Internet Explorer browser environment
- Slow response
- Parallel crawls are not supported
- Driver keyboard input is unstable
Supplement:
Since ActiveX controls can only be loaded in Internet Explorer, the crawler must be deployed on a Windows machine, which is the Worker machine in the picture. In addition, passwords for ActiveX controls cannot be copied directly by code and must rely on driver-level keyboard input to enter user passwords.
Scheme 3: crawler based on browser + interface
Technology stack:
- nodejs
- selenium
- winio
- request
- cheerio
Advantages:
- It can be verified by the login of the central bank credit investigation
- Fast response
- Parallel crawls are supported
Disadvantages:
- Depends on the Internet Explorer browser environment
- Driver keyboard input is unstable
- Multiple worker machines need to be deployed
- Process is complicated
- User cookies need to be maintained manually
Supplement:
This scheme actually combines the advantages of scheme ONE and scheme two. The worker machine is used to load the ActiveX control, input the user password, obtain the encrypted password and return it to the crawler server. The rest of the process is the same as plan 1.
Agent optimization
Central bank credit investigation has anti-crawling mechanism, if the same IP login many users, the IP may be blocked. Therefore, crawler needs to add proxy IP to improve stability and success rate. This problem is actually quite good to solve, spend money to buy a proxy IP service.
However, the quality of proxy IP service providers is uneven, the good service is expensive, and the cheap service is unstable. If you are like the author, you can only apply for such a rubbish agent as the Sun agency. Then I offer you an idea here to improve the stability of the slag agent service.
Write a scheduled task that obtains N IP addresses at intervals. Then request the login page of credit investigation of central Bank with these IP addresses. If the response is successful within 1 second, the IP address will be stored in the IP pool; otherwise, it will be discarded.
The stability and success rate of crawler can be greatly improved by maintaining a high quality IP pool through scheduled tasks.
other
The above three schemes and agent optimization schemes have all been used in the production environment of the author’s company. The actual stack used differs from the one listed in this article. The technical stack in the article is the best technical scheme I think after the conclusion. In the future, I will find time to refactor the code of plan 3 and open source it to Github.