Abstract: With the arrival of the era of cloud native, micro-service has become the mainstream of application architecture. NaCOS has become the preferred registration center and configuration center in the field of micro-service in China with its core competitiveness of being simple to use, stable and reliable, and excellent performance. Nacos2.0 is to achieve the ultimate performance, so that users with rapid business development do not have to worry about performance issues; At the same time, Ali cloud MSE also provides Nacos2.0 hosting service, one key to open to enjoy all the capabilities of Ali ten years of precipitation micro service!
The author | wind qing
preface
MSE released version 1.1.3 of the NaCOS engine in January 2020, which supports the use of NaCOS as a registry in a fully hosted public cloud environment. Nacos1.2.1 released in July 2020 supports meta-configuration data management and enables micro-service applications to dynamically modify configuration information and routing rules at runtime. As users continue to use the Nacos1.x version, performance issues are becoming apparent. Through the kernel transformation of 1.x version, the performance of Nacos2.0 Professional version has been improved by 10 times, which can basically meet the performance requirements of users for micro-service scenarios.
In addition to improved performance, the Professional Edition has higher SLA protection and greater security in configuration data, and is connected to the Istio ecosystem through the MCP protocol, serving as the registry for Istio.
MSE Nacos1.x base architecture
1.X architecture can be roughly divided into five layers, namely, the access layer, the communication layer, the function layer, the synchronization layer and the persistence layer.
• Users access Nacos through an access layer, such as SDK, SCA, Dubbo, Console. Nacos also provides open API access to the HTTP protocol.
• The communication layer includes HTTP and UDP, NaCOS mainly communicates through HTTP, and UDP is used for a small part of service push function.
• The functionality layer currently has two parts, Naming and config-ing, which provide service discovery and configuration management capabilities, respectively.
• The synchronization layer consists of Distro protocol in AP mode (service registration) and RAFT protocol in CP mode (service meta-information), as well as Notify synchronization for configuration notifications
• Nacos data persistence is useful for MySQL, Derby, and local files. Configuration data, user information, and permission data are stored in MySQL or Derby, while persistent service data is stored in local files
MSE Nacos1.x infrastructure issues
There are several problems with the current 1.X architecture:
• Each service instance is renewed through the heartbeat. In the Dubbo scenario, each interface corresponds to a service. The TPS required for heartbeat renewal will be high when the number of application interfaces in Dubbo is large.
• The perception time of heartbeat renewal is prolonged, and the instance can only be deleted after reaching the renewal timeout time, which generally takes 15S and is of poor timeliness
• The change data pushed through UDP is not reliable, and the client needs to make regular data reconciliation to ensure the correctness of the data. A large number of invalid queries result in high QPS of the overall service
• The communication mode is based on the HTTP short link, and the NACOS side release connection will enter the TIME_WAIT state. When the QPS is high, there will be the risk of running out of connection and reporting errors. Of course, the introduction of HTTP connection pool through SDK can alleviate the problem, but it cannot cure it completely
• The configured long polling mode causes data to enter the Old part of the JVM to request and free memory, causing frequent CMS GC
MSE NACOS 2.0 Pro architecture and new models
1. The core problem of the X architecture lies in the connection model. The 2.0 architecture is upgraded to the long-connection model. In the communication layer, GRPC and RSocket are used to realize long-connection data transmission and push capability
Issues addressed by the 2.0 architecture:
• Application POD performs heartbeat renewal in the long connection dimension, eliminating the need for instance-level, greatly reducing duplicate requests
• Instances can be removed quickly when a long connection is broken without waiting for a renewal timeout
• The NIO streaming push mechanism is more reliable than UDP and can reduce the frequency of application reconciliation data
• There is no overhead associated with repeated connection creation, significantly reducing TIME_WAIT connection problems
• Long connections also solve the configuration module long polling CMS GC problem
Problems with the 2.0 architecture:
• The long connection GRPC is based on HTTP2.0 Stream and is less observable and easier to use than HTTP’s Open API
Overall, the 2.0 architecture reduces resource overhead and improves system throughput, resulting in significant performance gains but also increased complexity
MSE OS 2.0 Professional Performance
NACOS is divided into service discovery module and configuration management module. Here, the performance test of service discovery scenario is carried out first.
Using 200 pressors, each pressor simulates 500 clients, each client registers 5 services, subscribes 5 services, and can provide up to 10W long connections, 50W service instances, and subscriber crush scenarios
Service discovery pressure test mainly has two scenarios of pressure changed state and stable state:
• Change state: During the pressure press phase, there will be a large number of connections to the Nacos registration and subscription service. During this phase, the pressure on the server side will be relatively high, depending on whether the overall registration and subscription are finally fully successful.
• Stable state: When the pressor requests are successful, it will enter a stable state. Only a long connection heartbeat needs to be maintained between the client and the server. At this stage, the pressure on the server will be relatively low. If the pressure of the server in the changed state is too large, the request timeout and connection disconnection will occur, and the server cannot enter the stable state
Service discovery will also upgrade the lower version on the MSE to compare the performance curve before and after the upgrade, so that the performance comparison is more intuitive
The configuration management module is a scenario of write less and read more in actual use. The main bottleneck point is the performance of a single machine. The compression test scenario is mainly based on the read performance and connection support number of a single machine
Using 200 pressors, each pressor can simulate 200 clients, each client subscribes to 200 configurations, initiates configuration subscriptions and reads configuration requests
Compare the Basic and Professional performance data at 2C4G, 4C8G, and 8C16G specifications in the service discovery scenario.
The maximum number of TPS and instances in this case is the number of instances that the service can run reliably with high availability, which is about half or two-thirds of the maximum, meaning that a single machine can run without being hung up.
Stable runtime support scale increased by 7 times, and actually maximum support scale increased by 7-10 times
Another scene is the comparison of the 3 nodes before and after the 2C4G MSE NaCos upgrade, which is mainly divided into three stages:
• In the first stage, the client uses the 1.x version, MSE Nacos uses the basic version, the number of instances goes from 0 to >6000 to >10000, and finally to 14000, the maximum cannot be increased any more, the Server CPU reaches 80-90%, the client keeps complaining errors, and then the number of instances is reduced to 6000
• In the second stage, we upgraded the basic version of MSE NaCos to the professional version. When the number of instances reached 14000, it could not be further increased, and there was little difference in the performance pressure test performance curve
• In the third stage, when the number of instances was maintained at 14000, the client was upgraded to version 2.0 in batches, and the CPU index curve decreased to about 20% continuously, and the whole was in a stable state without any errors
From the performance curve before and after the upgrade, we can see that the performance of MSE NaCS 2.0 Pro has improved greatly. In the final overall test, compared to the base version, the Professional version showed a 10-fold improvement in service discovery performance and a 7-fold improvement in configuration management
MSE NACOS Smoothing Upgrade Pro
For new users, you can directly create a Professional instance, and for old users, you can upgrade by clicking the MSE” Instance Change “button. MSE will upgrade POD in the background. Due to the different V1V2 data structure, at the beginning, the default NACOS data will be double-write. During the upgrade process, the data will synchronize from V1 to V2, and after the upgrade, the data will synchronize from V2 to V1.
The SLB service port will also be added to the GRPC 9848 port. At this time, the application SDK can be upgraded from version 1.x to version 2.0, and the overall client server can be upgraded to the 2.0 architecture
The overall compatibility principle is that the higher version server is compatible with the lower version client, but the higher version client may not be able to access the lower version server:
• 1. The X client can access the Basic Edition as well as the Professional Edition
• The 2.0 client can access the Pro version, but not the Basic version
NACOS configures security management
In the last issue, Daofeng explained the configuration authority control. The overall MSE NACOS does the authority control through AliCloud RAM master and child account system. In this issue, I will mainly talk about the configuration and encryption function of NACOS.
When using the configuration data, users may store user information, database password and other sensitive information in NACOS. However, NACOS storage configuration data is transmitted and stored in clear text, which will lead to leakage of sensitive configuration data items in the case of database leakage or packet capture by the transport layer, and the overall security risk is very high.
The commonly used HTTPS protocol can solve the transmission security, but can not solve the storage security, here directly in the client to encrypt, so in the process of transmission and storage data are encrypted.
The third party encryption system (such as Ali Cloud KMS) is used to strengthen the security of encryption. Symmetric encryption (AES algorithm) is used to encrypt the encryption fast. As the key is transmitted with the ciphertext, the key is encrypted at the same time, and the overall two-level encryption is adopted.
When the SDK releases data, it will first get the key and encrypted key from KMS, then use the key to encrypt the data, and then transfer the encrypted data and encrypted key to Nacos storage. The SDK will obtain the encrypted data and encrypted key from NACOS, and then obtain the plaintext key from KMS through the encrypted key, and then decrypt the encrypted data to obtain the plaintext data through the plaintext key, which solves the data security problem in the overall transmission and storage.
In order to be compatible with the old logic, and only sensitive data needs to be encrypted, Nacos encrypts data with a fixed prefix DATAID, and on the open source side it is implemented through SPI plug-in, allowing users to extend it themselves
Users can encrypt and decrypt sensitive data through the SDK and MSE console. The overall SDK and MSE console will first access KMS and then encrypt storage configuration data, and then decrypt and then display plaintext, using the same process as the plaintext storage before
Users need SDK 1.4.2 version or above to access the encryption and decryption function using SDK, and need to introduce the nacos-client-mse-extension encryption and decryption plug-in implemented internally by MSE.
Com.alibaba. nacos nacos-client-mse-extension 1.0.1 com.alibaba.nacos nacos-client-mse-extension 1.0.1
When initializing SDK, you need to fill in the sub-account AK/SK, and authorize KMS encryption and decryption permissions. For details, you can refer to the creation and use of configuration encryption
Properties properties = new Properties();
properties.put(“serverAddr”, “mse-xxxxxx-p.nacos-ans.mse.aliyuncs.com”);
properties.put(“accessKey”, “xxxxxxxxxxxxxx”);
properties.put(“secretKey”, “xxxxxxxxxxxxxx”);
properties.put(“keyId”, “alias/acs/mse”);
properties.put(“regionId”, “cn-hangzhou”);
ConfigService configService = NacosFactory.createConfigService(properties);
String content = configService.getConfig(“cipher-kms-aes-256-dataid”, “group”, 6000);
conclusion
MSE Pro 2.0 offers significant performance, usability, and security improvements over the base version. The base version is recommended for testing and the professional version is recommended for production. For user identity, password and other configuration sensitive information, it is recommended to enable the permission control ability and encrypt and save to strengthen data security.
This article is the original content of Aliyun, shall not be reproduced without permission.