Click the blue word above to follow the public account
Code egg for the 264th tweet
Crawler and anti – crawler, is a very not sunny industry.
This article is from the wechat public account “Ctrip Technology Center”
Cui Guangyu, development manager of the R&D department of Ctrip Hotel, is gay friends with his reptilian colleague at Qunar. Ctrip technology Center “not famous” jokes.
The article directories
-
preface
-
Running status of crawler anti-crawler
-
Real world reptilian proportions
-
The idea of making a decision
-
Current situation of crawler anti-crawler technology
-
For python rehabilitate
-
The friendly fire rate that can’t be bypassed
-
The front end engineer’s counterattack
-
Friendly fire, friendly fire
-
Current situation of crawler anti-crawler routines
-
Don’t respond to
-
evolution
-
legal
-
Make things worse, raise flags
-
The future of reptile anti-reptile
0
preface
Crawler and anti – crawler, is a very not sunny industry.
There are two meanings of not being sunny.
The first is that the industry is hidden underground and rarely exposed. Many companies don’t announce that they have a crawler team, or even hide the fact that they have an anti-crawler team. This may be a matter of corporate strategy, not technology.
Second, it’s not a very positive industry. Many have been in the industry for years and have accumulated a lot of experience that sadly does not translate into a stellar resume. During the interview, because the two sides have different crawler ideas or anti-crawler ideas, they may not agree with each other, which will affect their job hunting road. Programmers have the tendency of “literary inferiority”, and the concept is really different.
But that’s the fate of programmers. No matter how unsunny this industry is, it still cannot prevent a large number of people from entering this industry because of the demand of the company.
So, what are the needs of the company that lead us to really need crawlers/anti-crawlers?
Anti-reptile is very easy to understand, with a reptile we naturally want to anti-reptile. For programmers, even if it’s just “I just want to prove that I’m better than you”. For the company, it is more significant, at least, and can reduce server load, which alone makes anti-crawler worth living.
What about reptiles?
The earliest crawlers originated in search engines. Search engines are well-meaning crawlers that retrieve all your information and make it available to other users. For this purpose they also specifically defined the robots.txt file, as a gentleman’s agreement, this is a win-win situation.
However, things were soon spoiled by some people. The reptile soon becomes no longer a gentleman.
Then came “big data”. Countless media trumpeted big data as the trend of the future, attracting a batch of cannon fodder to set up big data companies. These people do not have big data at hand, their data can be installed with a USB flash drive, how dare to call it big data? You can’t fool investors with that kind of data. So they started to write crawlers, and they crawled all over the company’s data. Pretty soon, their data won’t fit on a flash drive. This is the time to finally relax and go out and brag about raising money.
Sadly, however, large capacity USB drives continue to be released. They are always racing to keep up with the increase in storage. L
So that’s the history of reptiles and anti-reptiles.
Running status of crawler anti-crawler
Crawler and anti-crawler in e-commerce industry are more interesting. The initial demand of crawler comes from price comparison.
This is the core business of some e-commerce sites. If you’re a price-sensitive user when shopping for a product, you’ve probably used the online price comparison function (which works really well). Unsurprisingly, they will use crawler technology to crawl all the relevant e-commerce prices. Their crawler is still relatively gentle, on everyone’s server will not cause too much pressure.
However, that doesn’t mean people like being picked up by him. After all, it’s bad for other e-commerce. Therefore, it is necessary to do anti-crawler through technical means.
According to the idea of the technical staff, the other side will use technology to fight back, we should not be afraid. The idea is all very well, but in practice it doesn’t work that way at all.
Sure, technique is important, but in practice, technique is even more important. The one who plays the deeper gets to play the other. Whose routine is not good, no matter how good the technology, can only be played round and round. This is a bit of a blow to the techies’ egos, but it’s not the first time we’ve had it. You should be used to it by now.
1. Real world crawler ratio
You should have heard the sentence, probably means that the entire Internet probably more than 50% of the traffic is actually crawler. When I first heard it, I didn’t really believe it. I thought it was way over the top. How could there be more reptiles than people? A reptile is, after all, only an accessory.
Now that I’ve been anti-reptilian for a long time, I still think that was an exaggeration. 50%? Are you kidding me? So little?
For example, a company, a page interface, is about 12,000 visitors per minute. How many of these are normal users?
50%? 60%? Still?
The correct answer: less than 500.
That is, of the 12,000 page views on a single page, 500 are normal users and the rest are crawlers.
Note that when counting crawlers, you can’t identify all of them, so there are actually some crawlers hidden in those 500 users. Then the crawler rate is about:
(12000-500) / 12000 = 95.8%
Did you guess that number?
So many crawlers, so few users, what are we doing? What is the reason why a business with hundreds of people needs thousands of crawlers to assist it? 95% or more, 19 guarantees 1?
The answer could be quite bizarre. Most of these crawlers are the result of bad decisions.
2. The idea of decision making
For example, there are three companies in the world that sell the same e-commerce products. The names of the three companies are A, B and C.
At this time, the customer went to company A to check the price of A certain product, but found the price was not good. So he wasn’t going to buy it. His contribution to the industry as a whole was 0.
However, the background of Company A will detect that one of our customers has lost because he came to inquire about A product whose price is not good. That’s okay. I’ll try climbing someone else.
So he took company B and company C.
The background of COMPANY B detected that someone came to inquire the price, but finally, no order was placed. He thinks, well, we lost a client. What to do?
I can crawl around and see what other people’s prices are. So he climbs A and C.
The background of C company detected that someone came to inquire the price…
After a period of time, the server of three companies respectively alarm, traffic is too high. The Ctos of the three companies were also wondering why the traffic was so high without generating any orders. It must be the other two animals that wrote the crawler that didn’t limit the frequency. Damn it, I want revenge. So respectively do anti-crawler, do not let the other party to seize their data. And then further strengthen their crawler team to capture other people’s data. Must do: rather let me catch the world, don’t let the world catch me.
And then, the anti-creepers work overtime every day figuring out how to intercept them. The crawler is intercepted, so we have to study how to crack the anti-crawler strategy every day. People just waste all their resources on useless things. Until everyone merges, will be calm and sit down to talk, all less grasp point.
There have been a lot of mergers in domestic companies recently. I guess there is a lot of “peace of mind”.
2
Current situation of crawler anti-crawler technology
So let’s talk about how crawlers and anti-crawlers do it.
1. Vindicate Python
First, the reptile. You can search for crawler tutorials everywhere, mostly written in Python. As I mentioned in an article, crawlers written in Python are the weakest because they are inherently unsuited to breaking anti-crawler logic because anti-crawlers are handled in javascript. Over time, however, I found this understanding a bit problematic (of course I said I was hacking Python for my work… believe it or not). .
It’s true that Python is not a good place to write anti-crawler logic, but Python is a glue language that binds any framework. However, the anti-crawler strategy often changes drastically, which requires drastic reconstruction or even rewriting of the code. In this case, Python is an appropriate solution.
For example, if you were using Selenium to crawl someone’s site, and you found yourself blocked in such a subtle way that you had no idea how, what would you do? Will you trace selenium’s source code to find what went wrong?
You won’t. You just change the frame and climb in a different way. Then you dabble with both frameworks and don’t delve deeply into either. Because before you study it, maybe they change it again. You’ll have to find another frame to climb on. After all, the boss is waiting for the data at the meeting tomorrow morning. The boss usually has meetings at 8 or 9 a.m., so you need to get them done by 7. When you get bored and decide to move on, your resume says “understanding the use of nFrameworks” and that’s it.
This is the fate of the reptile engineer, who is worse than outsourcing. Outsourcing is not an easy way to accumulate skills, but at least a reptile engineer has a normal commute, which he doesn’t even have.
But what about anti-reptilian engineers? Not really. Anti-reptile has a natural death hole, is: friendly fire rate.
2, can not bypass the rate of friendly fire
Let’s start with, what’s your first reaction to each other’s reptile?
If the time limit, most people give me the answer is: block the IP.
The problem, however, is that not everyone has an IP. Big companies have export IP, ISP sometimes hijack traffic to let you go proxy, some people naturally like to hang proxy, some people to climb the wall 24 hours to hang VPN, the pit is, now is the era of mobile Internet, if you blocked an IP? Sorry, this is China Unicom’s 4G network, 5 minutes ago or someone else, 5 minutes after the replacement oh!
Therefore, packet IP has the highest friendly fire index. And it was the least effective. Because now even the most novice knows how to use proxy pools. You can go taobao to see, hundreds of thousands of agent value of how much money. Let’s not talk about the free agents that are everywhere.
Some people say: I can scan the other port, if the proxy port is open, that means it is a proxy, I can block it.
The truth is brutal. I once blocked an IP because it opened a proxy port, and it was a very small proxy port. Within a day someone reported that one of our branches had been intercepted. I looked up the IP, it was my IP. I am very depressed to ask them IT, open this port why? He said to be a mail server. I mean, why use such a weird port? He said, isn’t this afraid that others will guess? I just picked up a random one.
Another way to scan the advanced version of the port is to go to the order library to see if the IP has placed an order, if not, then it is safe. If there is, it’s not safe. There are many websites that use this method. However, this is only a way to deceive oneself. Is there a cheaper way to permanently whitewash your IP with just one order?
Therefore, IP blocking, and the advanced version of IP blocking: scanning ports to seal IP, are useless. Don’t even think about starting with IP, because opponents will spend a lot of time thinking about how to evade IP blockades. It doesn’t make any sense.
So, what are you thinking about next?
A lot of site engineers think, well, I can’t stop it, so I’ll make it unreadable. I use images to render key information, such as prices. This way, the human eye can see it, but the machine can’t.
That used to be true. However, the development of bad technology has brought us a bad technology called machine learning. Along the way, it led to the rapid development of an industry called OCR. Soon, recognizing images was no longer any problem. Even verification codes that are hard for the human eye to recognize, some OCR can handle, better than my naked eye recognition rate. What’s more, now there is a coding platform, with capital can be done, do not need technology.
So, what would you consider next?
At this point, there’s not much left for the backend engineer to do.
But when the back end fails, it’s usually the front end, which is always the bad guy when the back end fails. We’ve been doing this for years. This is the time for the front end engineer to step up:
“Don’t have to show off, to compare who’s front-end knowledge cow force, you cow force I let you climb.”
I don’t know how many front-end engineers are reading this article, but I just want to mention by the way that you’re going to be even more in demand.
3. Counterattack by front-end engineers
We know that a data is going to be displayed in the front end, not just output in the back end, and the front end has to do a lot of things, like take json, at least convert it to HTML using template? That’s the least of the steps. And then you have to render it with CSS, right? It’s not that hard.
Wait, do you remember the first time you did this? Really, isn’t it so hard?
Ever had an HTML tag misspelled or not closed, causing the page to crash? Did a CSS go wrong and the entire page drifted off to nowhere?
Is it something you would want someone else to go through again?
This incident fully illustrates: let a veteran front-end engineer to make things a little more complicated, if the other side equipped with a veteran front-end engineer to crack, it will take more than three times as long. After all, it’s reading other people’s code, and it takes them a minute to write code. You always read for two minutes and then curse for a minute. That’s a very small number. If they don’t have a front end engineer… So over time, they grow up to be front end engineers.
After that, because the treatment of front-end engineers is a little better than that of crawler engineers, they will soon leave to work as front-end engineers, which not only alleviates the talent gap of front-end, but also makes the other side shortage and re-recruit. They usually hire the back end as a crawler, and these people need to be tortured again and grow up to be front end engineers again. That’s a good thing.
So if you have a high turnover rate of reptile engineers, think hard about whether you’re hiring in the wrong direction.
So what’s the worst technology on the front end? The worst front end, and the most powerful, is ours: javascript.
Javascript has a lot of tricks to play with. It’s no exaggeration to say that a new feature(bug) can be learned from each other once a week. At this point, you are the interviewer and the interviewer must pass your interview.
For example, array. prototype, do you have a map? When will it be available? You say you are xx browser, that you should have this or should not ah? You said you could have this? But there is no such thing. Can [] retrieve characters from string? Which browser works and which doesn’t? Why do you support the WebKit prefix? Wait, you were for it one minute, but now you’re not? Your statement is not correct.
This is all simple knowledge of the front end, and it’s taken for granted. But it’s a nightmare for the back end.
However, the front-end folks took matters into their own hands and came up with something called NodeJS. Based on V8, all JS runs in seconds.
However, NodeJS implements a number of features that do not exist in browsers. If you access something casually (like why you support process.exit), Node will fail miserably. And… Js in the browser, you pull the background to run with NodeJS, are you thinking of some kind of security hole? Is this called code and data mixing? What if he runs some nasty code in JS that browsers don’t support but Node does?
Luckily, reptile Engineer and PhantomJS. But, uh, why don’t you have a location? Ha ha, you finally simulated the location, but no, according to my current security policy you should not be able to location now? How did you figure that out? Do you really want to keep using PhantomJS when the creators themselves can’t keep up?
Of course, in the end, all anti-crawler strategies will inevitably be cracked. But it takes time, and all anti-crawlers need to do is publish frequently and wear each other down. If they can hack your system in two days and you release it every day, then you’re safe. The system could even be renamed “One Reverse Crawl a Day, Easy to learn the front end.”
Friendly fire, or friendly fire
Which brings us back to the friendly fire rate we started with. We know that the more you publish, the higher the probability of something going wrong. So how do you release so frequently and still have fewer problems?
Another problem is that we write a lot of “unreadable code” to each other, which can cause a lot of stress, but we also have to maintain it ourselves. If one day you say, “No one’s crawling us anymore,” take the code offline. The person who wrote the code is gone, so how do you know how to take it offline?
I can’t tell you how we’re going to do these two things, but we’re smart people, and we probably have our own solutions, and the software industry is so busy with two things, one is how to break code apart, and the other is how to combine code together.
There’s just one small tip about the rate of misfire: you can just turn on anti-crawler, but don’t intercept it, leave it on and send statistics to yourself. It’s like a simulation. And so on the statistics of almost, found really open will not have what problem, then open interception or open fraud.
Here is a problem, often a company’s various channels, the difficulty of climbing is not the same. The reason for this is that friendly fire detection is a business-related thing, and it’s hard for a company’s base department to make a generic one. It has to be done by the departments themselves. Even some departments did and some didn’t. This has led to a bizarre common practice in the crawler world: if a PC page can’t be climbed, try H5. If the H5 is a hassle, try your luck on a PC.
3
Current situation of crawler anti-crawler routines
So once discovered the other side of the data fraud how to do?
In the early days, it was all about spot-checking data to detect whether the other party was cheating. This needs to be checked manually and the cost is very high. But that was in prehistoric times. If your company is still testing in this way, your technology is still behind The Times.
This is what our competitors did before: they would grab us twice, once after they decrypted the key, they would grab us in A proper way, and this time the result was A. One time, I just grab it without the key, and this time I get B. According to the above description, we can know that B must be wrong. So if A is equal to B, you’ve been tricked. At this point the crawler will be stopped and re-cracked.
1. Don’t respond
So there was an article about reptiles, how to crack ours. I keep getting asked to respond. I always felt like there was nothing to reply to.
First, it’s natural that the anti-reptile was cracked. There is an omnipotent reptile called a human reptile. Suppose we were just rich enough to open a branch office in India, hire cheap labor every day and click on it with a mouse. What would you do with me? Second, what we really care about is the follow-on routines. I read the article, found out I just called Selenium and got the results, and thought I was successful.
I’m sure you understand why I don’t want to reply. Our priority is to do our job, not punch each other in the face. If you mix it up a lot, the tech community will find that people who like to punch people in the face every day are generally not very skilled.
That doesn’t mean we’re the best in the world. We face a lot of crawlers every day, and we’ve met a lot of good ones. Just like in wuxia novels, masters are generally low-key. They silently take data, which is hard to detect, and the frequency is so low that it will not affect our evaluation. You know, this is a master of both IQ and EQ.
We also ran into a fairly efficient crawler that pulled our JS, cut off the useless parts and directly solved the key, with no waste requests at all (compared to some crawler tutorials, which always teach you to visit and write useless urls so as not to be found, really don’t know where to go. It doesn’t do you any good, other than cause the machine to alert you and cause the other party to work overtime.)
And we only found this out because he wrote a low-key blog post that was all about technology and nothing useless.
I’m just making a slight complaint here in the hope that I won’t be constantly asked to respond to articles about crawlers. Offline I know a lot of reptile engineers, really good and really low-key (how else do you think I know how to deal with reptiles…). “, people are hanging out together, there is no “must punch each other in the face” mood.
By the way, if you are interested in this industry, you can consider contacting HR to join us. Anti-crawler engineers can join Ctrip, crawler engineers can join where.
2, evolution
In the early days when we were playing against our competitors, both sides were relatively rudimentary. Gradually, crawlers are upgrading, and anti-crawlers are also upgrading. This is what we call evolution. We used to give each other water to try to slow down their evolution. However, the results were not particularly satisfactory. Whether crawler evolves depends on the KPI of crawler engineer, not the evolution speed of anti-crawler.
Later hit the heat of the time, using more and more bizarre techniques. For example, many people will mention that canvas fingerprint is used in anti-crawler, and they think it is the highest level. In fact, this thing is only an aid for anti-crawler. Canvas fingerprint means that as different hardware supports different canvas, as long as you draw a very complex canvas, there will always be pixel-level error in the image you get. Considering that the crawler code is uniform, even for Selenium, it is ghost, so fingerprints are generally consistent, so the chance of bypassing is very low.
But! This thing is inherently flawed in two ways. First, it is impossible to verify legitimacy. Sure, you can use asymmetric encryption to make it legal, but that’s not going to work. Secondly, the conflict probability of canvas is very high, far from the author’s claim that the conflict rate is extremely low. Maybe conflicts are lower in foreign countries, where there are more languages. But domestic companies are usually unified IT installations, both software and hardware are surprisingly consistent. When we tested Canvas fingerprint, we randomly found more than 20 machines inside Ctrip, and all the fingerprints were identical without any difference. As a result, some “advanced techniques” are not practical at all.
3. Legal channels
Then there’s the question you’ve probably considered: Are crawlers illegal? Can you Sue the other party to stop climbing? Legal affairs to the answer is very simply, yes, the premise is evidence. Unfortunately, most of the crawler data in the world is not published on their own sites, but is used for their own data analysis. So, even if there were a few reptile lawsuits as precedents, and they were done, it wouldn’t help us at all. Anti-reptilian, if the other side is low-key enough, is doomed to be a skill.
Do things, set a Flag
By the end of the day, we were no longer limited to beating skills. In anti-crawler code, we often bury little eggs for each other, such as writing comments to each other. Both sides through each other’s war, frequent release, incredibly chat quite high.
For example, ask each other, Beijing house price is high ah? And they say, “Oba, I live on my feet. Keep going. Did you get the number? And so on. It is easy to shake the morale of the other side when such things come back and forth, which is very effective. Imagine if your reptile engineer was working overtime on the eve of the New Year and saw a message saying that he had received a year-end bonus for several months. Do you think your reptile engineer is far from quitting?
Finally, we made a big move and thought we could fool each other for a long time. We even went to a small hot pot restaurant to celebrate and prepare for launch tomorrow. You know, flag usually ends badly. Five minutes into a two-hour buffet hot pot, we got the news that we were investing in our competitor. For the next hour or so, the team atmosphere was awkward and no one could say anything. One of the interns in my group got up the courage to ask me a question:
“Can I stay?”
After all, for the most part, technology is subject to the power of capital.
4
The future of reptile anti-reptile
After reconciling with our competitors, we went to visit each other and sat down together. Before, all the girls who claimed to be girls on the Internet were rude guys, which made us quite desperate. The only girl on the scene was the one we took there by ourselves (the intern mentioned above). We felt that we had been set up for so long, but we were finally set up by the other party.
Fortunately, the food and drink are very good, we play or relatively high. Then there was peace time, when there was no war, the anti-reptilian logic was thrown in there as a defense, and then the whitelist was opened up for creep. Group often call is: XXX you how frequency is so high, XXX you why this interface did not give me open, why I climb the thing is not right I rely on you to seal me ah. And so on.
Anti-reptile is harder in peacetime than in war. As long as the rate of friendly fire is not too high, the company can accept it. Peace time we can not do things, accidental injury rate a little bit more, someone will call: good not to make money, messing about with what to do. Besides, as long as you don’t intercept users in wartime, it’s not friendly fire. In peacetime, we have to consider whitelisting, and intercepting partners is friendly fire. So it’s going to be a little bit more conservative. But generally speaking, it’s happier in peacetime. After all, who likes to work overtime?
Peace did not last long, however, and soon there were new competitors who chose the reptiles to fight us. After all, this is a profit-driven world. As long as there is a lot of profit, capitalists will kill and set fire to people. It is not up to us technicians to decide. We want a worm-free world, but what right do we have?
Fortunately, this will lead to more jobs and raise the value of everyone, which is a good thing.