This paper shares Tencent waterproof wall team’s dynamic thinking about machine confrontation, hoping to throw a brick to draw inspiration to the team currently doing man-machine confrontation, and help more small and medium-sized companies to get rid of the pain of machines and crawlers.
0 x00 preface
The browser is becoming more and more powerful as a major entry point for traffic on today’s Internet. In order to have a better Web experience, various new standards are formulated and implemented. The appearance of PWA pushes the mobile H5 experience to another extreme. The increasing use of the H5 as the main entry point has brought with it another problem: rampant machine behavior. As long as there is interest, there will be malicious, login registration, voting tickets and other pages are easy to become the heavy disaster area of machine brushing, now writing a common voting script is basically as difficult as writing a “Hello World!” Of similar difficulty. The Web front end has been a very weak link in the battle against machines. What can be done to save front-end code security when the browser pulls all front-end code locally and executes it, transparently and transparently?
0x01 Noun explanation
Security code
When I refer to code security in this article, I mean the security of front-end JavaScript code. In general, a piece of JavaScript code is considered safe if it can only run in a normal browser, cannot or has not been executed in a non-normal browser runtime environment, and cannot be translated equivalently into other programming languages.
When an important piece of JavaScript logic is placed in another environment to run several orders of magnitude more efficiently than a normal browser and get the right results, it is almost a disaster for the server and the business behind it.
Data protection
In this article, data protection refers to the protection of the content (such as the body of POST) carried on THE HTTP/HTTPS protocol. The HTTP protocol is a text of the agreement, all the content of the transmission from the perspective of the client (browser) are visible and rich semantics, which means that the content if not protected, malicious users only need to understand the content of the parameters, can simulate the corresponding request without reading or reverse front-end JavaScript code logic. Note that the protection mentioned here does not refer to the protection of transport processes such as TLS, but to the protection of specific data content carried by the upper layer of the HTTP protocol.
Normal, for example, a query request URL such as https://example.com/query?from=shenzhen&destination=beijing, the crawler developers do not need to read JavaScript can know to want to how to construct parameters. And if the request form such as https://example.com/query?params=ZnJvbT1zaGVuemhlbiZkZXN0aW5hdGlvbj1iZWlqaW5n, a malicious user can’t through observing parameters immediately under the premise of constructor, You have to read or reverse JavaScript code to know how to construct parameters. In this way, the purpose of data protection is achieved.
confusion
Through some string substitution rules or abstract syntax tree transformation rules, one piece of code is equivalent to another piece of code that is not readable, so as to protect the original code security. This process is usually irreversible.
Such as:
function foo() { console.log('hello world! '); } foo();Copy the code
Is transformed into:
var a = 'console', b = 'log', c = 'hello', d = ' world! '; function e() { window[a][b](c + d); } e();Copy the code
This reduces the readability of the code, which is an easy way to obfuscate.
Some articles that are searchable by search engines confuse code compression with obfuscation, and tools like Uglify can compress code into less readable code, as shown below:
But after being formatted by the browser’s powerful formatting capabilities, the logic is still visible.
Compression tools don’t protect code much, they just shorten variable names, cut whitespace, and remove unused code. These tools are designed to optimize, not protect. To further protect your front-end code, you need to use some code obfuscation tools.
0x02 Regularization schemes and Defects
1. Reversible transformation protects data
The conventional data protection method is to design a reversible transformation function F to transform data. The data D submitted by the browser to the server is processed by the reversible transformation function F to obtain the transformed data D ‘
D ‘= f (d)
After d ‘is submitted to the server, the original data D can be obtained by using the inverse function F â1.
D = f – (1 d)
If malicious users do not know the operation steps of f, they cannot construct a valid d ‘. Where F can be a data processing algorithm or an encryption algorithm. However, since f’s steps are fixed and the algorithm is ultimately executed in the browser, even if F is a proprietary data processing algorithm, its steps will eventually be reversed. Nodejs is now mature enough that malicious users can intercept the core logic directly from the front-end JavaScript code without using any obfuscating tools to protect F, and write a hack tool that can run on NodeJS at very little cost.
In particular, if f contains an extra parameter p and the operation steps are like D ‘=f(d,p), the security of F cannot be improved by changing P. Malicious users can intercept the transformation function f and obtain the parameter P, and p can be easily extracted using tools in unconfused code.
Therefore, only data transformation is difficult to protect data well. Does obfuscating the code again protect the data?
2. Protect code from obfuscation
At present, there are not many public obfuscation tools available, the common ones are:
- Jscrambler (business)
- JavaScript – Obfuscator (open source)
Jscrambler, a commercial piece of software, scrambles well, but its payment plans are expensive. JavaScript-Obfuscator is a popular open source Obfuscator, but its obfuscation effect is not satisfactory. In order to verify the JavaScript – Obfuscator confusion effect, confused in this paper, as a string, for example, wrote a simple script to pass the JavaScript – the string to automate Obfuscator confusion after reduction, open source code please stamp: https://github.com/conanliu/de-js-obfuscator
With this tool, the cost of reversing a string is almost zero. It’s relatively easy to reverse strings, but this is a start, and it’s only a matter of time before you reverse logic. Plain obfuscation protects business logic for a while, and after a while the code becomes less secure. With the intensity of javascript-Obfuscator obfuscation, “a period of time” is usually no more than a week. If the page is hosting a highly profitable and malicious business, even if the JavaScript code on the page is confused by javascripts Obfuscator, after a week of launch, most of the key logic has probably been reversed. The key logic being reversed means that the brush tool will be written quickly and the business is at risk of being brushed.
For a normal business, the data protection-related logic in JavaScript changes once a month quite frequently. If the logic needs to be changed once a week in order to achieve good cracking resistance, the cost of this resistance is high. Is there a long-term mechanism to secure front-end code without incurring excessive costs? This paper attempts to explore a new man-machine confrontation mode from the dynamic point of view.
0x03 Dynamic scheme introduction
If we have 5 data transformation functions F1, F2, F3, F4 and F5, for each request, we randomly select 2 transformation functions FX and fy, and randomly select a separator S. The real data D is randomly divided into D1 and D2, and the final data is
D ‘= combine (fx (d1), s, fy (d2))
After d ‘is submitted to the server, the server shards to obtain a binary group
(d ‘), 1 d ‘2) = split (d’ s)
D ‘1 and D’ 2 are processed with the inverse functions of FX and fy, and the original data is finally obtained:
D1 = f – 1 x (d ‘1)
D2 = f – 1 y (2) d ‘
d=d1+d2
Although the difficulty of a single crack is still T â1week, the algorithm combination corresponding to each request is different, so a single crack is not applicable to subsequent requests. So the time cost of theoretically reversing and scripting the logic increases exponentially, and eventually malicious users switch to emulators that are easier to use because of the reverse cost. Emulator antagonism is not discussed in this article. What is clear, however, is that emulators are easier to combat than automated scripts. At the same time, since the execution of the simulator requires more resources than the execution of automatic scripts, this also increases the cost of malicious evil, which ultimately leads to the imbalance between the input and output of malicious.
Although this dynamic scheme sounds feasible, it will encounter many problems in practical engineering:
- How do I identify the combination of functions for a request?
- How to weigh page performance?
- How to solve the problem of slow JS compilation?
- Is there a need for confusion?
Next, we will explore how to solve the above problems one by one in engineering.
0x04 Exploration of engineering problems
1. How to identify the function combination of a request?
After random combination, the user may get different JS each time. In this case, an identifier is required to tell the server which fx and fy are respectively. A feasible solution is to encode the contents of X and Y directly into JS in plaintext together with transformation functions FX and fy, and submit x and y along with D ‘when submitting data. However, such identification is easy to be extracted directly from JS files by some regular rules. Malicious users can iterate all transformation functions and their corresponding logic, and then combine them according to the matched identification.
A more rigorous approach is to generate a signature string, signature, when compiling the JS file, and compile that signature as a variable into the JS file. Finally, signature and the generated data D ‘are submitted to the server. The server uses fâ 1SIG (Signature) to obtain the key parameters for decrypting D’, and then decrypts D ‘. The signature generation algorithm can be expressed as follows:
signature=fsig(x,y,s,random,timestamp)
Where x is the identifier of the first transform function, y is the identifier of the second transform function, s is the delimiter, random is a random number, and timestamp is the timestamp of signature generation. The purpose of random number design is to make the signature generated each time different, and the timestamp can sense the freshness of the js file corresponding to the signature, and can aggregate the replay attack to a certain extent.
2. How to weigh page performance?
Front-end page performance is an inevitable concern of Web applications. A general and effective way to optimize the performance is to set up cache for resource files in the page. Typically for a well-modularized project packaged using mature packaging tools, the Cache policy for the entry HTML will be cache-control: no-cache, while resource files such as JS/CSS /image will be cached for a long time. However, if the JS file in charge of data protection contains dynamically generated logic, the JS file can no longer be cached. Otherwise, once the cache time is improperly controlled, various data decryption failures will occur.
Under normal circumstances, in the man-machine confrontation scenario, the page does not need to perform man-machine verification for all requests. In other words, the JavaScript code responsible for man-machine verification will not be accessed many times by normal users. Therefore, some cache-based optimization can be omitted in man-machine verification. Ideally, the user will only access the hCI logic once at a time. The best thing to do at this point is to ensure that the user loads the experience for the first time, while the experience for the second time can be ignored.
The suggested solution is to separate the data protection logic from the entire project’s JavaScript code and compile directly inline to an HTML page, or compile to a separate JS file with a cache-Control: no-cache response header for that file. This JS can communicate with other JS using global variables, postMessage, etc.
3. How to solve the problem of slow JS compilation?
There are many front-end packaging tools, such as gulp, Webpack, Rollup, etc. These tools have their own advantages and many optimizations for the compilation process, but currently, none of them can complete the packaging in a millisecond response scenario, so the compilation packaging needs to be done asynchronously. So how do you generate enough JS to meet the requirements of normal access and confrontation scenarios? A relatively simple scheme is to run the compilation script in a loop, compile a replacement once, and users may access the same JS in a short period of time. As the old JS is replaced by the newly compiled JS, the JS accessed by users in a period of time can be considered as random, and the js transformation interval depends on the compilation speed.
In addition to the simple scenario, here is a more flexible scenario that compiles the production cache and provides random access. First, security-related JS files are separated from static services and js content is output by a back-end Web server. A certain length of the array is maintained on the server. After compiling a JS file, the construction tool will send the content of the file to the Web server. The Web server will fill the received content into the array in sequence. When there is a user page, the browser requests the JS content from the Web server. The Web Server randomly selects one from the array and returns it to the browser.
In addition to ensuring the randomness of secure JS, the generation of signature can be done in the Web server. The build tool sends compiled meta information to the Web server when it compiles JS, and no signature is generated. When the user needs to request the JS, a signature is generated in real time according to the meta-information and filled into the JS file content. This generates signatures independently of each other, making it easy to identify and intercept replay requests by detecting the number of times signature is used.
4. Do you need to mix it up?
Now that we have random dynamics, is there still confusion? The answer is yes. Although FX and FY are different every time, two different variation functions must have their own characteristics. For example, the fx and fy algorithms in JavaScript are implemented as follows:
function foo(x) {
x = String.fromCharCode.apply(null, x.split('').map(i => i.charCodeAt(0) + 23); return btoa(x)
}function bar(y) {
y = String.fromCharCode.apply(null, y.split('').reverse().map(i => i.charCodeAt(0) + 13); return btoa(y);
}Copy the code
In this example, 23, reverse, and 13 are characteristics of FX and fy. If a request contains reverse and 13 in the JS file, it is most likely that the bar transform was used, and if it contains 23, it is most likely that the Foo transform was used. Through this feature detection, it is easy to find out what combination of transformations is used in the requested JS. The detection method is not very complicated, just a few simple regular expressions.
0 x05 summary
This paper analyzes the shortcomings of conventional data protection and obfuscation, and gives a way to counter machine behavior from the dynamic point of view, and has some thoughts on engineering. The road of man-machine confrontation is difficult and long, which is a business security problem that will exist for a long time in the future. Hopefully, dynamic thinking will give some inspiration to the team currently working on human-machine confrontation, and help more small and medium-sized companies to get their business out of the pain of machines and crawlers. In addition, Tencent waterproof wall team has a deep accumulation of machine confrontation, if you want to directly experience the results can click: https://007.qq.com