Ctrip Information Security Department · 2016/03/10 10:30

0 x00 preface


If a hacker were to infiltrate the Intranet, you’d be excited to think it was a calculated APT. In fact, it could just be that one of the shapeless employees leaked VPN information on Github and junior, a curious computer, discovered it.

0x01 Loss of consciousness


Post a picture first:

A reporter to Liyang health bureau director dial telephone, the director of the reporter’s interview flustered answer: “you see our microblog ah? Hehe, how did you see that? You can see all of this? ! That can’t be right? Can you see both of us tweeting? It can’t be?” …

Similarly, employees at Internet firms are highly mobile and have varying security (and privacy) perceptions. In large organizations, it is inevitable that some people will simply dump code containing sensitive information (e.g. database connection strings, email accounts, VPN information) onto Github through ignorance or laziness. If these information is seen by people, it can let hackers spend the least cost, to achieve the effect of APT.

Attach a loophole box on confidential information of enterprises in making disclosure report: www.vulbox.com/news/detail…

0x02 Anti-crawl mechanism


Therefore, we want to realize the monitoring of Github codes by customizing many keywords such as password, mysql, Account and email. We hope to realize the monitoring of sensitive codes on Github through crawler. If there is suspected information leakage, the program will notify the person in charge through email, and the person in charge will conduct a second manual audit. In this way, leaks of sensitive code can be found in the first place and submitters can be contacted for treatment in a timely manner.

The expectation is good, but after a few frequent visits to Github.com in a row:

Triggered the github anti-crawl mechanism, is the project aborted?

0 x03 around


Why not use ali Cloud machines to build a set of agents to achieve IP access to bypass anti-crawling mechanism?

So we have the picture below.

The flow of a sensitive code crawl:

  1. The Github crawler engine initiates a crawler request
  2. The request is sent to the load balancer Nginx, which forwards the request on an equally weighted basis to the traffic forwarding Nginx

    Note: Set Nginx load balancing to two to prevent single point of failure.

  3. Received traffic from load balancing Nginx, traffic forwarding Nginx forwards the request to github.com

  4. Content returned from github.com is returned to the Github crawler via Nginx

Thus, for Github.com, he sees three machines visiting each other at a third of the same frequency, which, incidentally, does not trigger anti-crawling. Thus, the continuous access of Github crawler engine is realized, and the efficiency is greatly improved.

At the same time, this solution is also very scalable, if the github.com anti-crawl mechanism again blocked, can be parallel to the Nginx machine by forwarding traffic to achieve horizontal scaling.

Load balancing Nginx core configuration:

Nginx core configuration with traffic forwarding: