Editor: This article was shared by Cui Guangyu, r&d manager of Ctrip Hotel R&d Department, in the third issue of ctrip Technology Micro Share. The following is the summary of the content. Wall crack suggested clickVideo playback, “scene” onlookers duan Son hand attack city lion big Cui, how high IQ & high EQ perfect crush reptile… Pay attention to ctrip Technology Center wechat official account ctripTech, can be the first time to learn about micro-sharing information ~

Have you ever been infested by a reptile? When you see the word “reptile,” doesn’t your blood race? Do be patient, do something a little, can give them a nominal victory, actually let them suffer a loss.

First, why anti-reptilian

1. Crawlers account for a high proportion of total PV, which wastes money (especially crawlers in March).

What is the concept of a march reptile? Once a year in March we have a rush of reptiles.

At first we were puzzled. Until one time, in April, we deleted a URL, and a crawler kept crawling the URL, causing a lot of errors, and the test started to trouble us. We had to publish a site specifically for the crawler and put back the deleted URL.

But at that time, one of our team members was very dismissive, saying, “We can’t kill the crawler, that’s all, we have to release it for it, it’s too shameful.” So came up with an idea, said: URL can, but, never give real data.

So we publish a static file. The bug stops, the crawler doesn’t stop, which means the crawler doesn’t know it’s fake. This leads us to a big lesson that goes directly to the heart of our anti-crawler technology: change.

Then a student applied for an internship. We looked at her resume and found her crawling through Ctrip. Later in the interview, I confirmed that she was the one who caused us to publish in April. But because she’s a girl and she’s good at it, we hired her. It’s almost official now.

Later, when we discussed together, she mentioned that a large number of masters choose to crawl OTA data and conduct public opinion analysis when writing their papers. Because the paper is due in May, so everyone has read the book, you know, early various DotA, LOL, march is coming, too late, hurry to grab the data, analyze in April, and hand in the paper in May.

That’s the rhythm.

2. The resources that the company can query for free are seized in batches, thus losing competitiveness and making less money.

OTA prices can be queried directly in the non-login state. This is the bottom line. If forced login, then you can block the account to make the other side to pay the price, which is also the practice of many websites. But we can’t force each other to log in. So if there is no anti-crawler, the other party can copy our information in batches, and our competitiveness will be greatly reduced.

Competitors can catch our price, over time users will know, just need to go to the competitor, there is no need to come to Ctrip. This is not good for us.

3. Are crawlers suspected of violating laws? If so, can I Sue for compensation? You can make money.

This problem I specially consulted legal affairs, finally found that this is still a marginal ball in China, it is possible to Sue success, may also be completely invalid. So still need to use technical means to do the final guarantee.

Second, what kind of reptile

1, very low level of fresh graduates

The march crawler we mentioned at the beginning is a very obvious example. Fresh graduate crawlers are often simple and crude, regardless of server pressure, coupled with unpredictable numbers, it is easy to crash the site.

By the way, climbing Ctrip.com to get an offer is no longer a viable option. Because we all know that the first person to say that beautiful women are like flowers is a genius. And the second… You know what I mean?

2, very low-level entrepreneurial small company

Now there are more and more startups, and I don’t know who is fooling them. Then people start their own businesses and they don’t know what to do. They think big data is hot, so they start doing big data.

The analysis program is almost written, and I find myself without data.

How to do? Write about crawlers. So there are countless crawlers, crawling through data for the sake of survival.

3. An out-of-control crawler who accidentally wrote wrong and nobody stopped it

Reviews on Ctrip can sometimes get as much as 60% of the visits from crawlers. We’ve opted for a direct lockdown, and they’re still crawling.

What does that mean? That is, they can’t crawl any data at all, except that HTTP code is 200, and everything is wrong, but the crawler still keeps going and it’s probably just some crawler that’s hosted on some server, unclaimed, and still working hard.

4. Formed business rivals

This is the biggest rival, they have technology, money, want anything, if fight with you, you have to fight with him.

5. Search engines

We do not think that search engines are good people, they also have a flak, and a flak will lead to server performance decline, the volume of requests and network attacks are no different.

What are reptiles and anti-reptiles

Since anti-crawler is a relatively new field for the time being, some definitions need to be made by ourselves. Our internal definition looks like this:

  • Crawler: A method of obtaining web site information in bulk, using any technical means. The key is batch.

  • Anti-crawler: Use any technical means to prevent others from obtaining their site information in bulk. The key is also in volume.

  • Error: In the process of anti-crawler, common users are wrongly identified as crawlers. An anti-crawler strategy with a high rate of friendly fire cannot be used, no matter how effective it is.

  • Interception: Successfully blocking crawler access. There will be the concept of interception rate. Generally speaking, the higher the interception rate of anti-crawler strategy, the higher the possibility of accidental injury. So there’s a tradeoff.

  • Resources: the sum of machine costs and labor costs.

Bear in mind that labor costs are also resources, and more important than machines. Because, by Moore’s Law, machines are getting cheaper and cheaper. And according to the trend of the IT industry, programmers are getting more and more expensive. Therefore, let the other side work overtime is king, the machine cost is not particularly valuable.

Know yourself: How to write a simple crawler

To do anti-crawler, we first need to know how to write a simple crawler.

Crawlers currently found on the Web are very limited and usually just a piece of Python code. Python is a great language, but it’s not really the best choice for crawlers on sites that have anti-crawler measures.

Ironically, python crawler code often finds itself using a Lynx user-Agent. I don’t need to tell you how to deal with the user-agent.

There are usually several procedures to write a crawler:

  • Analyze the page request format

  • Create the appropriate HTTP request

  • Send HTTP requests in batches to obtain data

For example, view the Ctrip production URL directly. Click ok on the details page to load the price. Assuming that the price is what you want, which request is the result you want once you’ve captured the network request?

The answer is surprisingly simple: you just sort it in reverse order by the amount of data being transferred over the network. Because other deceptive URLS are more complex, developers will not be willing to add data to him.

Know your enemies: How to write advanced crawlers

So how does reptilian progression work? Usually the so-called advanced has the following:

distributed

There are often textbooks that tell you that crawlers need to be distributed across multiple machines in order to be efficient. It’s a complete lie. The unique function of distributed IP address is to prevent IP address blocking. IP closure is the ultimate means, the effect is very good, of course, the accidental injury of the user is also very cool.

Simulation of JavaScript

Some tutorials will say that emulating javascript and crawling dynamic web pages is an advanced technique. But it’s really just a very simple feature. Because, if they don’t have anti-crawlers, you can just grab ajax itself and not care what JS does. If the other side has anti-crawlers, javascript is bound to be very complex and focused on analysis, not just simple simulation.

In other words: it should be basic.

PhantomJs

This is an extreme example. It was originally intended for automated testing, but it worked so well that a lot of people used it as a crawler. But there’s a catch: efficiency. PhantomJs can also be caught, for a number of reasons that won’t be covered here.

Advantages and disadvantages of different levels of crawlers

The low-level crawler is easier to be blocked, but it has good performance and low cost. The more advanced the crawler is, the more difficult it is to be blocked, but the performance is low and the cost is high.

When the cost is high enough, we can stop blocking crawlers. There’s a word in economics called marginal effect. When the cost is high enough, the benefit is not very much.

Then, if we compare the resources of both parties, we will find that it is not cost-effective to unconditionally submit to each other. There should be a golden point, beyond which it can climb. After all, we’re anti-reptilian not for face, but for business reasons.

Vii. How to Design an Anti-crawler System (Conventional architecture)

A friend once gave me this framework:

1. Preprocess the request for easy identification; 2. Identify whether it is a crawler; 3, according to the identification results, appropriate processing;

At the time, I thought, that sounds reasonable. It’s architecture. The idea is just different from ours. And then when we actually did it, it wasn’t right. Because:

If you can identify a reptile, where’s the crap? You can do whatever you want with it. If you can’t identify the crawler, who do you treat appropriately?

Two out of three sentences are nonsense, only one is useful, and no specific implementation is given. So: what is the use of this architecture?

There is an architect worship problem, so many small startups hire under the name of architect. The title given is: Junior architect, architect itself is a senior position, why there is a junior architecture. This is equivalent to: junior general/junior commander.

You end up going to the company, and you find ten people, one CTO, nine architects, and maybe you’re a junior architect, and the others are senior architects. However, the junior architect is not bad enough, some small startups even hire Ctos to do development.

Traditional anti-reptilian methods

1. The background collects statistics on access. If a single IP address access exceeds the threshold, the access is blocked.

Although this effect is good, but in fact there are two defects, one is very easy to accidentally hurt ordinary users, the other is that IP is not valuable, dozens of dollars may even buy hundreds of thousands of IP. So overall speaking, it is more deficient. But this is very useful for march crawlers.

2. The background collects statistics on the access. If the access of a single session exceeds the threshold, the access is blocked.

This looks a little bit more advanced, but it’s actually worse, because sessions are worthless, so you can just apply for a new session.

3. The background calculates the access. If the access of a single userAgent exceeds the threshold, it will be blocked.

This is a big trick, similar to antibiotics and so on, the effect is surprisingly good, but overkill, accidental injury is very serious, when using to be very careful. So far we’ve only briefly blocked Firefox for MAC.

4. A combination of the above

Combined ability increased, the rate of misfire decreased, in the encounter with low-level crawlers, or relatively easy to use.

From the above we can see, in fact, crawler against crawler is a game, RMB players just the most niubi. Because of the above mentioned methods, the results are mediocre, so it is more reliable to use JavaScript.

Some people may say: javascript to do, can not skip the front-end logic, directly pull the service? How can that be true? Because I’m a clickbait guy. JavaScript is more than just a front end. Skipping the front end is not skipping JavaScript. Translation: Our server is made by NodeJS.

Question: What code are we most afraid of when writing code? What code is hard to debug?

eval

Eval is notorious for being inefficient and poorly readable. Just what we need.

goto

Js does not support Goto well, so you need to implement Goto yourself.

Current Minify tools often have a simple name like MINify abCD, which doesn’t fit our requirements. We can minify something more useful, like Arabic. Why is that? Because Arabic is sometimes written from left to right, sometimes from right to left, sometimes from bottom to top. Unless they hire an Arab programmer, they’re going to have a headache.

Unstable code

What bugs are not easy to fix? Bugs that are not easy to reproduce are hard to fix. Therefore, our code is full of uncertainty, different every time.

Code demo

Download the code itself to make it easier to understand. Here is a brief introduction to the idea:

  1. Pure JAVASCRIPT anti crawler DEMO, by changing the connection address, to let the other party to grab the wrong price. This method is simple, but easy to detect if the other party is targeted.

  2. Pure JAVASCRIPT anti crawler DEMO, change key. This method is simple and not easy to detect. But it can be done by deliberately creeping up on the wrong price.

  3. Pure JAVASCRIPT anti-crawler DEMO, change dynamic key. This approach makes the cost of changing the key go to zero, and therefore less expensive.

  4. Pure JAVASCRIPT anti crawler DEMO, very complex change key. This method can make it harder to analyze, and even harder to crawl if you add the browser detection mentioned later.

That’s it.

Earlier we talked about marginal effects, which means we can stop here. It won’t be worth it to invest more manpower later. Unless you have a dedicated opponent to fight. But this is a fight for dignity, not for commercial reasons.

Browser detection

For different browsers, our detection methods are different.

  • IE detects bugs;

  • The severity of FF test to the standard;

  • Chrome detection has powerful features.

Eight. I got you — and then what

No production event is raised – direct interception

May cause production events – Fake data (also known as poisoning)

There are also divergent ideas. For example, is it possible to do SQL injection in the response? After all, it was the other guy who made the first move. However, the legal department did not give specific reply to this question, and it was not easy to explain to her. So it’s just a hypothesis for the time being.

1, technology suppression

As we all know, there is a de command in DotA AI that increases xp when the AI is killed. Therefore, early kill AI too much, AI will be a god suit, unable to kill.

The correct approach is to suppress the opponent’s level, but do not kill. Anti-reptilian is the same, do not get too far at the beginning, forcing the family and you dead fight.

2. Psychological warfare

Provocation, pity, ridicule, obscene.

The above skip not to mention, we can understand the spirit.

3, water

This is probably the highest level.

It’s not easy for programmers, especially for crawlers. Have pity on them and give them a little bite to eat. Maybe in a few days you’ll be a reptile because you did a good job of anti-reptile.

For example, someone came up to me a while ago and asked me if I could be a reptile… I am so kind person, can say not ????