Zookeeper from computer composition
- cpu
ZooKeeper provides delay-sensitive functionality. If you need to compete with other processes for CPU, or if ZooKeeper integration has multiple purposes, you can consider providing a dedicated CPU core to ensure that context switching does not occur.
- memory
ZooKeeper is swap sensitive and should be avoided on any host running a ZooKeeper server.
- disk
Disk performance is key to the healthy running of the ZooKeeper cluster. Solid-state drives (SSDS) are recommended because ZooKeeper must have low-latency disk writes for optimal performance. Each request to ZooKeeper must be submitted to the quorum server’s disk before the results can be read.
ZooKeeper transaction logs must be on dedicated devices. (Using dedicated partitions is not enough.) Sharing logging devices with other processes can lead to seeking and contention, which in turn can cause multi-second delays.
- net
Network bandwidth footprint – Because ZooKeeper keeps track of state, it is sensitive to timeouts caused by network latency. If the network is bandwidth saturated, you may encounter unexplained timeout client sessions.
View ZooKeeper from the operating system
- process management
JVM process, there are many child threads; Socket communication across machine processes
Number of open file handles – This should be done system-wide and among users running the ZooKeeper process. The value should take into account the maximum number of file handles allowed to open. ZooKeeper often opens and closes connections and requires a pool of available file handles to select from.
- memory management
Heap usage is high and all data is put into memory
- filesystem
The ZooKeeper data directory is created on the local file system.
- io protocol stack
The number of servers in a Zookeeper cluster is always fixed, so the reliable BIO long-connection model is used for server interaction in the cluster
Different from the unknown number of ZooKeeper clients interacting with sever in the cluster, niO model is adopted for interaction between ZooKeeper client and server in order to improve the concurrent performance of ZooKeeper.
Zookeeper from an application
1. datastructure
- DataTree
The tree maintains two parallel data structures: a hashtable that maps from full paths to DataNodes and a tree of DataNodes. All accesses to a path is through the hashtable. The tree is traversed only when serializing to disk.
- DataNode
This class contains the data for a node in the data tree.
A data node contains a reference to its parent, a byte array as its data, an array of ACLs, a stat object, and a set of its children’s paths.
- stat
the stat for this node that is persisted to disk. czxid mzxid ctime mtime version cversion aversion ephemeralOwner pzxid
- ZKDatabase
This class maintains the in memory database of zookeeper server states that includes the sessions, datatree and the committed logs. It is booted up after reading the logs and snapshots from the disk.
2. algorithm
zab
The core of Zookeeper is the Zab protocol: Zookeeper Atomic Broadcast. This mechanism ensures the synchronization between servers. The Zab protocol has two modes: recovery mode (master selection) and Broadcast mode (synchronization).
3. design patterns
-
obsever pattern
-
chain of responsibility
ZooKeeperServer.setupRequestProcessors()
4. serialization
jute
disk SnapLog & net packet serialization
5. core class
- QuorumPeerMain
main Thread
- QuorumPeer
Quorumpeer Thread: loadDataBase Data recovery, startServerCnxnFactory Enable Server socket connection communication, startLeaderElection Start select primary field: ZKDatabase, ServerCnxnFactory (selectorThreads, acceptThread NIO Reactor)
@Override
public synchronized void start(a) {
if(! getView().containsKey(myid)) {throw new RuntimeException("My id " + myid + " not in the peer list");
}
loadDataBase();
startServerCnxnFactory();
try {
adminServer.start();
} catch (AdminServerException e) {
LOG.warn("Problem starting AdminServer", e);
System.out.println(e);
}
startLeaderElection();
startJvmPauseMonitor();
super.start();// Call the run method
}
Copy the code
Main loop
while (running) {
switch (getPeerState()) {
case LOOKING:
LOG.info("LOOKING");
ServerMetrics.getMetrics().LOOKING_COUNT.add(1); .break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
observer.shutdown();
setObserver(null);
updateServerState();
// Add delay jitter before we switch to LOOKING
// state to reduce the load of ObserverMaster
if(isRunning()) { Observer.waitForObserverElectionDelay(); }}break;
case FOLLOWING:
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
follower.shutdown();
setFollower(null);
updateServerState();
}
break;
case LEADING:
LOG.info("LEADING");
try {
setLeader(makeLeader(logFactory));
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
if(leader ! =null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
updateServerState();
}
break; }}}Copy the code
* Observer(port: 2888) Observers are peers that do not take part in the atomic broadcast protocol. Instead, they are informed of successful proposals by the Leader. Observers therefore naturally act as a relay point for publishing the proposal stream and can relieve Followers of some of the connection load. Observers may submit proposals, but do not vote in their acceptance. field: ObserverZooKeeperServer A ZooKeeperServer for the Observer node type. Not much is different, but we anticipate specializing the request processors in the future. * Follower This class has the control logic for the Follower. field: FollowerZooKeeperServer Just like the standard ZooKeeperServer. We just replace the request processors: FollowerRequestProcessor -> CommitProcessor -> FinalRequestProcessor A SyncRequestProcessor is also spawned off to log proposals from the leader. * Leader This class has the control logic for the Leader. field: LeaderZooKeeperServer Just like the standard ZooKeeperServer. We just replace the request processors: PrepRequestProcessor -> ProposalRequestProcessor -> CommitProcessor -> Leader.ToBeAppliedRequestProcessor -> FinalRequestProcessor * ServerCnxnFactory(port: 2181) * AcceptThread There is a single **AcceptThread** which accepts new connections and assigns them to a SelectorThread using a simple round-robin scheme to spread them across the SelectorThreads. * SelectorThread The **SelectorThread** receives newly accepted connections from the AcceptThread and is responsible for selecting for I/O readiness across the connections. This thread is the only thread that performs any non-threadsafe or potentially blocking calls on the selector (registering new connections and reading/writing interest ops). * FastLeaderElection Implementation of leader election using TCP. It uses an object of the class QuorumCnxManager to manage connections. * QuorumCnxManager(port: 3888) VS ServerCnxnFactory, it is bio This class implements a connection manager for leader election using TCP. It maintains one connection for every pair of servers. The tricky part is to guarantee that there is exactly one connection for every pair of servers that are operating correctly and that can communicate over the network. * SendWorker **Thread** to send messages. Instance waits on a queue, and send a message as soon as there is one available. If connection breaks, then opens a new one. * RecvWorker **Thread** to receive messages. Instance waits on a socket read. If the channel breaks, Post itself from the pool of receivers. * Messenger Messenger ** multi-threaded ** implementation of message handler. Messenger implements two sub-classes: WorkReceiver and WorkSender. The functionality of each is obvious from the name. Each of these spawns a new thread.Copy the code
design
features
design goals
- ZooKeeper is simple.
ZooKeeper allows distributed processes to coordinate through a shared hierarchical namespace, similar to a standard file system. Namespaces are made up of data registers — zNodes in ZooKeeper parlour — that are similar to files and directories. Unlike typical file systems designed for storage, ZooKeeper’s data is kept in memory, which means ZooKeeper can achieve high throughput and low latency.
The implementation of ZooKeeper emphasizes high performance, high availability, and strict access order. The performance aspects of ZooKeeper mean that it can be used in large distributed systems. The reliability aspect prevents it from becoming a single point of failure. Strict ordering means that complex synchronization primitives can be implemented on the client.
- ZooKeeper is replicated.
The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.
- ZooKeeper is ordered.
ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.
- ZooKeeper is fast.
It is especially fast in “read-dominant” workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
Since it is intended as a basis for building more complex services, such as synchronization, it provides a set of guarantees:
- Sequential Consistency – Updates from a client will be applied in the order that they were sent.
- Atomicity – Updates either succeed or fail. No partial results.
- Single System Image – A client will see the same view of the service regardless of the server that it connects to. i.e., a client will never see an older view of the system even if the client fails over to a different server with the same session.
- Reliability – Once an update has been applied, it will persist from that time forward until a client overwrites the update.
- Timeliness – The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The data model
ZK data structure model is based on ZNode tree model. The contents of the entire tree are stored internally in ZK in a manner similar to an in-memory database and periodically written to disk.
Znode is the smallest unit of data in Zookeeper. By default, the maximum data size is 1 MB
-
Persistent node: After a data node is created, it will remain on the Zookeeper server until it is deleted
-
A temporary node is a node whose life cycle is tied to the client session. If the client session fails, the node is automatically cleaned up
-
Zookeeper automatically adds a numeric suffix to the node name to make it a new and complete node name. The upper limit of the numeric suffix is the maximum value of the integer
-
Memory data
The data model of Zookeeper is a tree structure. The in-memory database stores the contents of the entire tree, including all node paths, node data, and ACL information. Zookeeper periodically stores the data to disks. DataTree is the core of in-memory data storage. It is a tree structure that represents a complete data in memory. The DataTree does not contain any business logic related to networking, client connections, and request processing and is a stand-alone component.
Datanodes are the smallest unit of data storage. In addition to storing the data content, ACL list, and node status of nodes, datanodes also record the reference of the parent node and the attributes of the child node list. They also provide interfaces for performing operations on the child node list.
3. ZKDatabase In-memory database of Zookeeper, which manages all Zookeeper sessions, DataTree stores, and transaction logs. ZKDatabase periodically dumps snapshot data to disks, and restores a complete in-memory database using transaction logs and snapshot files on disks when Zookeeper starts.
- Disk data Disk data is divided into snapshot files and transaction log files
1. Snapshot indicates the full amount of data in memory at a certain point in time. Generally, a snapshot file is generated when transaction log records exceed 10W
2. Transaction log After the snapshot is generated, transaction operations are written to the transaction log
API
One of the design goals of ZooKeeper is providing a very simple programming interface. As a result, it supports only these operations:
- create : creates a node at a location in the tree
- delete : deletes a node
- exists : tests if a node exists at a location
- get data : reads the data from a node
- set data : writes data to a node
- get children : retrieves a list of children of a node
- sync : waits for data to be propagated