Use the question before

  1. How do I start the server? When the server is started, it is bound to the correspondingRequestHander, know how to start the Server, you can find the Server after receiving the request processing process
  2. How do I start the client? It might be to set up a connection and start some tasks
  3. How does the client and server communicate? How many connections does each client make? Long connection? Short connection? What happens if the server changes during the process (restart/outage, etc.)?
  4. Configure the push process? What if it fails? Does the client wait while the server is retrying? If a server fails, will it be switched to another server?
  5. Service registration process?
  6. How is the data stored?

The Server end to start

  1. Function: Nacos supports both configuration center and registry, and the registry determines whether to use AP mode or CP mode according to whether it is a temporary node. From this point of view, the function is relatively powerful;
  2. In operation and maintenance: Nacoa supports the configuration center and the registry to be deployed independently, and also supports the configuration center and the registry to be deployed in the same process, depending on their own needs, from this point of view, the deployment is relatively flexible;

Both the configuration center and the registry have springBoot projects in Nacos, which are easier to deploy, but from my perspective, there are some disadvantages, such as a more chaotic startup process for Nacos. Reading the code, we found that the core functionality of Nacos is triggered by @Postconstruct annotations, which are distributed under different modules and packages throughout the functionality. There is no problem with using Spring’s features to trigger the launch of Nacos’s core functionality, but it is still inconvenient for those who want a quick look at the Nacos startup process. Since these annotations are scattered, to understand the startup process, you need to know which @PostConstruct annotations are available, and then understand the logic of each @PostConstruct annotation in detail to form the overall concept.

Below is a list of @PostConstruct annotations for different modules in 2.0.0-alpha.2

1. console      com.alibaba.nacos.console.config.ConsoleConfig 

2. cmdb         com.alibaba.nacos.cmdb.memory.CmdbProvider  

3. config       com.alibaba.nacos.config.server.auth.ExternalPermissionPersistServiceImpl
4. config       com.alibaba.nacos.config.server.auth.ExternalRolePersistServiceImpl
5. config       com.alibaba.nacos.config.server.auth.ExternalUserPersistServiceImpl
6. config       com.alibaba.nacos.config.server.controller.HealthController
7. config       com.alibaba.nacos.config.server.filter.CurcuitFilter
8. config       com.alibaba.nacos.config.server.service.capacity.CapacityService
9.  config       com.alibaba.nacos.config.server.service.capacity.GroupCapacityPersistService
10. config      com.alibaba.nacos.config.server.service.capacity.TenantCapacityPersistService
11. config      com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl  
12. config      com.alibaba.nacos.config.server.service.dump.EmbeddedDumpService   
13. config      com.alibaba.nacos.config.server.service.dump.ExternalDumpService
14. config      com.alibaba.nacos.config.server.service.repository.embedded.EmbeddedStoragePersistServiceImpl
15. config      com.alibaba.nacos.config.server.service.repository.embedded.StandaloneDatabaseOperateImpl
16. config      com.alibaba.nacos.config.server.service.repository.extrnal.ExternalStoragePersistServiceImpl

17. core        com.alibaba.nacos.core.cluster.remote.ClusterRpcClientProxy
18. core        com.alibaba.nacos.core.remote.AbstractRequestFilter
19. core        com.alibaba.nacos.core.remote.BaseRpcServer
20. core        com.alibaba.nacos.core.remote.ClientConnectionEventListener
21. core        com.alibaba.nacos.core.remote.ConnectionManager

22. istio       com.alibaba.nacos.istio.mcp.NacosMcpServer
23. istio       com.alibaba.nacos.istio.mcp.NacosMcpService

24. naming      com.alibaba.nacos.naming.cluster.ServerListManager
25. naming      com.alibaba.nacos.naming.cluster.ServerStatusManager
26. naming      com.alibaba.nacos.naming.consistency.ephemeral.distro.DistroConsistencyServiceImpl
27. naming      com.alibaba.nacos.naming.consistency.ephemeral.distro.DistroHttpRegistry
28. naming      com.alibaba.nacos.naming.consistency.ephemeral.distro.v2.DistroClientComponentRegistry
29. naming      com.alibaba.nacos.naming.consistency.persistent.raft.RaftCore
30. naming      com.alibaba.nacos.naming.consistency.persistent.raft.RaftPeerSet
31. naming      com.alibaba.nacos.naming.consistency.persistent.raft.RaftConsistencyServiceImpl
32. naming      com.alibaba.nacos.naming.core.DistroMapper
33. naming      com.alibaba.nacos.naming.core.ServiceManager
34. naming      com.alibaba.nacos.naming.misc.GlobalConfig
35. naming      com.alibaba.nacos.naming.misc.SwitchManager
36. naming      com.alibaba.nacos.naming.monitor.PerformanceLoggerThread
Copy the code
  1. Loading a Configuration File
  2. Start the log handler
  3. Start the ServerMemberManager
  4. Start the MemberLookup
  5. Start the gRPC server
  6. Start Distro protocol, associated with the AP mode of the registry
  7. Start the RAFT protocol, which is related to the CP schema of the registry

Cluster Node Management

Nacos server startup can be divided into standalone mode and cluster mode: standalone mode is mainly convenient for us to debug, we can specify Nacso to start in standalone mode by adding -dnacos. standalone=true startup parameter; There are several cases of cluster mode. In cluster mode, whether AP mode or CP mode, it is generally necessary to know the list of servers, so that the communication between the servers is easy. How to know the server nodes in Nacao? Nacos provides related APILookupFactory, MemberLookup: LookupFactory is a higher level API, easy for us to quickly get/switch MemberLookup, focus on MemberLookup. MemberLookup in Nacos China has three implementations:

  1. StandaloneMemberLookup: Corresponding to Nacos single-machine mode, the core implementation of this class is to obtain the local IP port
  2. FileConfigMemberLookupFrom:/${user.home}/nacos/conf/cluster.confTo get the server list
  3. AddressServerMemberLookup: Gets the server list from a separate address server

1. If it is a simple test, you can start it in single-machine mode, which corresponds to standAlonemberLookup

-Dnacos.standalone=true
Copy the code

2. Start in cluster mode. Each server corresponds to one machine

1. Launch parameters added - Dnacos. Member. The list = IP1:8848, IP2:8848, IP3:8848 2. Or in the configuration file file ` / Users/luoxy nacos/conf/cluster. The conf `, each IP: port corresponds to one rowCopy the code

3. In cluster mode, three server nodes are deployed on the same machine, but are distinguished by different ports. Note that there are some limitations in this mode

-Dserver.port=8848 -Dnacos.home=/Users/luoxy/nacos8848 -Ddebug=true - Dnacos. Member. The list = 172.16.120.249:8848172.16. 120.249:8858172.16. 120.249:8868 must specify ` nacos. Home ` parameters, because the user directory by default, If all three server nodes are started on the same machine, there will be a conflict between the data directories corresponding to JRAFT and the next two nodes will fail to startCopy the code

The Client end to start

The main thing is to establish a connection to the server

Configuration center

NacosFactory#createConfigService => new NacosConfigService => new ClientWorker => new ServerListManager => ConfigRpcTransportClient

The registry

NacosFactory#createNamingService => new NacosNamingService => new NamingClientProxyDelegate => new NamingGrpcClientProxy => RpcClientFactory#createClient => GrpcSdkClient#start

Connection management

1, connect

  1. ConnectionBasedClient: long connection, for version 2.x
  2. IpPortBasedClient: for 1.x

2. Connection manager

  1. ConnectionBasedClient ConnectionBasedClientManager: management
  2. IpPortBasedClient EphemeralIpPortClientManager: management, in view of the temporary node
  3. IpPortBasedClient PersistentIpPortClientManager: management, in view of the permanent node

The dependencies are as follows

  1. A connection is executed when it is created or disconnectedClientConnectionEventListenerRegistry#notifyClientxxxMethod, and then notificationConnectionBasedClientManager#clientxxMethod to update the corresponding client cache
  2. ConnectionBasedClientManager, EphemeralIpPortClientManagerDuring initialization, a scheduled task is started to check whether the client has expired. The default expiration time is 30s and 5s. WhyPersistentIpPortClientManagerWhat about starting such a mission? Because the first two are for temporary nodes, andPersistentIpPortClientManagerThis pair of permanent nodes.

1. Heartbeat sending: During service registration on the HTTP client, a heartbeat task is created and sent to the server every 5s. In gRPC, heartbeat tasks are not created and clients are refreshed based on TCP connection status, such as removing invalid clients

  1. Server receiving heartbeat: The gRPC client does not send heartbeat, so the server does not receive heartbeat. Heartbeat sent by Http clients

Configuration updates

To quickly understand the overall process of publishing a configuration, create a configuration directly from the Control Console

  1. Request HTTP interface of background control console:/v1/cs/configsCorresponding to Controller isConfigController
  2. Update the configuration through the external storage serviceExternalStoragePersistServiceImpl PersistService implementation class, which ultimately corresponds to executing the SQL write library
  3. releaseConfigDataChangeEventEvent: Pushes events toBlockingQueueAnd then notify all subscribers in turn. Note that the subscriber here does not mean the client listening, but the subscriber on the Server side, that is, the Server side receivesConfigDataChangeEventThe subsequent processing after the event is primarily observer mode
  4. Execute notify for the subscriber
  5. Log and return results

So, what are the subscribers on the Server side? Through debugging the source code, we found that we can register the subscribers through the static method NotifyCenter#registerSubscriber. The core subscribers are as follows:

  1. RpcConfigChangeNotifier: corresponds to the gRPC client and is used for processingConfigDataChangeEventEvent to notify the client of the event
  2. LongPollingservice: corresponding to the Http client for processingLocalDataChangeEventEvent to notify the client of the event
  3. AsyncNotifyService: For processingConfigDataChangeEventEvent, used to notify other server nodes of configuration changes and update the dump cache of this server

The service registry

There are two modes of service discovery AP/CP

AP mode

The AP mode has one feature: The registered node type is temporary node. Data is not persisted; Distro protocol is used: a consistency protocol for temporary data

Properties properties = new Properties();
// Address of Nacos's service center
properties.put(PropertyKeyConst.SERVER_ADDR, "localhost:8848");
NamingService nameService = NacosFactory.createNamingService(properties);


Instance instance = new Instance();
instance.setIp("127.0.0.1");
instance.setPort(8009);
instance.setMetadata(new HashMap<>());
instance.setEphemeral(true);
nameService.registerInstance("nacos-api", instance);

Thread.sleep(Integer.MAX_VALUE);
Copy the code
  1. The gRPC client connects to the Server
  2. The client sends a request to the Server (RCC Ent #request)
  3. The server receives the request. Before receiving a request, there is a question that needs to be clarified: how does the server start? We need to know this because normally when we start the Server, we will bind the correspondingRequestHandlerIf we know this, we can easily debug breakpoints, breakpoints can quickly master the whole process. So in Nacos, how is the server turned on? The core is as follows:BaseRpcServerbe@PostConstructAn annotation modifier that binds such a Hander when the server is startedGrpcRequestAcceptor -> InstanceRequestHandler
  4. Enter the firstGrpcRequestAcceptor#requestMethods; Then enter theRequestHandler#handleRequestMethod, in this process, deals with permissions first; Then enter theInstanceRequestHandler#handleMethod, in which it decides whether to register the service or to destroy the service, taking the registration service as an example; Then enter theEphemeralClientOperationServiceImpl#registerInstanceMethod,EphemeralClientOperationServiceImpl ClientOperationService is an implementation class, A implementation class, also PersistentClientOperationServiceImpl ClientOperationService distribution corresponding to the AP and the CP mode
  5. performAbstractClient#addServiceInstanceMethod, update the Map cache, and publish oneClientEvent.ClientChangedEventThe event
  6. To release aClientOperationEvent.ClientRegisterServiceEventThe event
  7. To release aMetadataEvent.InstanceMetadataEventEvent, and return to the client

The main flow may seem simple, but there is some core logic hidden in the process of receiving the event. For example, if you update the cache of one server node, how to synchronize the data to other server nodes, you will find that there are many problems.

In a service registry, 3 events are published, but there are 4 events in total. In the listener logic of some events, new events are published. Let’s see what each of these events is involved in

  1. ClientEvent. ClientChangedEvent: the event listener is correspondingDistroClientDataProcessor. theDistroClientDataProcessorWhen did you register?DistroClientComponentRegistry#doRegisterbe@PostConstructIn this method is registeredDistroClientDataProcessor.DistroClientDataProcessorreceivedClientEvent.ClientChangedEventAfter the event, will passDistroProtocolSynchronizing data changes to other server nodes answers the first question. How to synchronize, we will analyze in detail next
  2. ClientOperationEvent. ClientRegisterServiceEvent: the event listener is correspondingClientServiceIndexesManagerThis is aspring bean.ClientServiceIndexesManagerreceivedClientOperationEvent.ClientRegisterServiceEventThe cache is updated first after the eventConcurrentMap<Service, Set<String>> publisherIndexes“And then release another oneServiceEvent.ServiceChangedEventEvents.publisherIndexesWhat exactly does it do? We’ll analyze it later
  3. MetadataEvent. InstanceMetadataEvent: the event listener is correspondingNamingMetadataManagerThis is aspring bean.NamingMetadataManagerreceivedMetadataEvent.InstanceMetadataEventAfter the event, the logic is also used to update the information in both cachesexpiredMetadataInfos,serviceMetadataMap
  4. ServiceEvent. ServiceChangedEvent: this event is corresponding supervisorNamingSubscriberServiceV2ImplThis is aspring bean.NamingSubscriberServiceV2ImplreceivedServiceEvent.ServiceChangedEventAfter the event, will toPushDelayTaskExecuteEngineA mission has been lost, the intent is not clear

There are still some things I don’t understand, such as the purpose of these caches

Distro agreement

Distro protocol is positioned as a consistency protocol for temporary data. That is, there is no need to store data to disk or database. Temporary data usually maintains a session with the server and data will not be lost as long as the session exists.

The CP mode

CP mode has one characteristic: The registered node type is non-temporary node. Data is persisted

Properties properties = new Properties();
// Address of Nacos's service center
properties.put(PropertyKeyConst.SERVER_ADDR, "localhost:8848");
NamingService nameService = NacosFactory.createNamingService(properties);


Instance instance = new Instance();
instance.setIp("127.0.0.1");
instance.setPort(8009);
instance.setMetadata(new HashMap<>());
instance.setEphemeral(false);
nameService.registerInstance("nacos-api", instance);

Thread.sleep(Integer.MAX_VALUE);
Copy the code

It turns out that temporary nodes are registered, the result of local debugging, and the result confirmed with the official: CP mode cannot be used in long connection mode.

However, persistent nodes can be registered via Http, as follows:

curl -X POST 'http://127.0.0.1:8848/nacos/v1/ns/instance? Port = 8848 & healthy = true&ip = 11.11.11.11 & weight = 1.0 & serviceName = nacos. Test. 3 & encoding = GBK&namespaceId = n1 & ephemeral = false '
Copy the code

However, an error message is displayed. (The current server deployment is as follows: A single node is started. Startup parameter – dnacos.standalone =true -Ddebug=true)

caused: java.util.concurrent.ExecutionException: com.alibaba.nacos.consistency.exception.ConsistencyException: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader node; caused: com.alibaba.nacos.consistency.exception.ConsistencyException: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader node; caused: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader nodeCopy the code

It seems to be related to the leader election. Preliminary judgment is that the server deploys a single node

CP mode at the core of the implementation is based on open source jraft: www.sofastack.tech/projects/so…

/ / send a heartbeat curl -v – X PUT “http://localhost:8848/nacos/v1/ns/instance/beat? serviceName=xxx”

IpPortBasedClient ClientBeatCheckTaskV2

InstanceBeatCheckTaskInterceptorChain

HealthCheckEnableInterceptor HealthCheckResponsibleInterceptor InstanceEnableBeatCheckInterceptor ServiceEnableBeatCheckInterceptor

InstanceBeatCheckTask UnhealthyInstanceChecker ExpiredInstanceChecker

Configuration management

If there are 3 server nodes in a cluster, client 1 is connected to server1, and client 2 is connected to server2, when client 1 writes a configuration, how can client 2 listen to the configuration change? In other words, if a configuration is changed, how can it be synchronized between different server nodes?

Server1 receives the request to update the database and then update the local Dump file. After receiving the event, AsyncNotifyService obtains all server nodes in the cluster. Then, a NotifySingleRpcTask is encapsulated for each server node based on the ConfigDataChangeEvent event. Finally, the NotifySingleRpcTask is merged into AsyncRpcTask and sent to the thread pool for execution. (For HTTP requests, 4, the execution logic of AsyncRpcTask: get the NotifySingleTask and check whether the server node in the NotifySingleRpcTask is the current node, if so, the new local dump file; Otherwise, gRPC requests are sent to other server nodes. 5. What if an error occurs during the notification to other servers? The retries continue, but the retry delay increases with the number of retries until the number of retries reaches 6. GrpcRequestAcceptor – > ConfigChangeClusterSyncRequestHandler, implement the local dump operation, the information in the database query, update to the cache file 7, dump operation: The latest value is checked from the database, the local cache file is updated, and a LocalDataChangeEvent is published. Both RpcConfigChangeNotifier and LongPollingService listen for this event. After receiving listener events, find all listeners (clients) that are listening to the key and send notices to the clients. After receiving the notices, the clients find corresponding listeners based on the key and execute them in sequence. In the BaseRpcServer class, a method is annotated with @postconstruct. This method is used to start the gRPC server. Mainly to reduce the database read pressure,

Service discovery

There are two modes of service discovery AP/CP

1. In AP mode, temporary nodes are registered, there is no persistence involved, data is stored in a Map, and data synchronization between servers is broadcast. If it fails, try again

Configuration center

There are connections

  1. There are two ways: Http and gRPC. GRPC is not available until version 2.0. We will only discuss gRPC in this case
  2. Connections are only created when the client actually uses them, such as when ConfigService is initialized and when configuration is pushed
  3. In the current version (2.0.0.-alpha.2), only one gRPC connection will be established per client, but depending on the intent of the code, in later versions it is possible that one client can support multiple connections and then assign taskId to different ones (ClientWorker#ensureRpcClient).
  4. Before establishing a connection, the system first obtains the list of all servers, and then selects the first one to establish a connection. If the connection succeeds, the system returns. If it fails, try again. The next server will be selected for retries, up to 3 times (RpcClient#start).
  5. What can I do if the connection fails after three attempts in Step 4? In Step 4, in addition to establishing a connection, two background threads are started: one to handle connection failures; A thread is used to handle the callback after a successful connection

About the request

  1. The client sends a push request to the specified server and tries again if the request fails. The maximum retry is 3ci without timeout. If the long connection between the client and the server is abnormally disconnected (for example, when the server node goes offline) during the request, the client will re-establish a connection with the available server through the background tasks mentioned above to ensure that the connection is available
  2. The server receives the request, updates the database, then updates the local cache file, and then notifies the server. The notification includes two parts: the client and the server. The client is mainly for those listening; The server is the other server nodes, telling them that the configuration has changed
  3. The client receives a configuration change notification, that is, executes the listener logic on the client. The server receives the configuration change notification, updates the local dump file, and triggers a notification to the listening client. The client receives the notification and executes the listener logic

About storage

  1. Database: Persistent
  2. Local cache, which is a Map, reduces the load of data read (ConfigCacheService#dump)

Refer to the link

  1. Distro agreement: cloud.tencent.com/developer/a…