This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021

The origin of

Consistent hash algorithm a distributed hash (DHT) implementation proposed by MIT in 1997, designed to solve the Hot spot problem on the Internet, similar to common Address redundancy protocol (CARP). Consistent hashing fixes the problem caused by the simple hashing algorithm used by CARP

The consistent hash algorithm proposes four definitions for determining the quality of a hash algorithm in a dynamically changing cache environment:

  • Balance: Balance means that the hashed results can be evenly distributed among all caches. In this way, the cache space can be fully utilized without waste

  • Monotonicity (Montonicity) : monotonicity is to point to it if there have been some content by hash assigned to the corresponding buffer, and a new buffer added to the system, the result of the hash should be able to ensure that the original assigned content can be mapped to the original or new some buffer, and won’t be mapped to the buffer in the collection of old other buffers

  • Spread: In a distributed environment, terminal could not see all the buffer, but can only see one part of it, when the terminal hope through the process of the hash map the contents into the buffer, buffer area due to the different terminal saw may be different, which leads to the result of the hash are inconsistent, the end result is the same content was different terminal is mapped to a different buffer, This situation should obviously be avoided because it results in the same content being stored in different buffers, reducing the efficiency of system storage. Dispersal is defined as the severity of the appeal. A good hash algorithm should minimize inconsistencies, that is, minimize dispersion.

  • Load: The problem of Load is really a different problem to look at the problem of dispersion. Since different endpoints may map the same content to different buffers, it is possible for a particular buffer to be mapped to different content by different users. Like dispersity, this should be avoided, so a good hash algorithm should minimize the buffering load

implementation

In a distributed cluster, adding and deleting machines or automatically leaving the cluster when a machine fails are the most basic functions of distributed management. If the commonly used hash(object)%N algorithm is adopted, many original data cannot be found after some machines are added or deleted, which seriously violates the principle of monotonicity. For example, we use Redis for branch cache, and now there are 6 machines. If STR is needed to locate data, So the way we’re going to find the data is hash(STR)%6, so we don’t have to go through the server to find the data that we need, but if we add another machine, then all the data is going to be stored in hash(STR)%7, and all the data locations are going to change, Or if a node in the cluster fails while our service is running

Hash ring space

We treat the entire Hash space as a circle, and the consistent Hash algorithm modulates 2 to the 32nd power. The value of the Hash circle is 0-2^32-1. We Hash three servers, which can be distributed on the Hash ring using IP, as shown in the following figure

So we have a piece of data that comes in, and we Hash it, and we go clockwise, and we find some node that it falls into

If the NodeC of the service node goes down at this time, only ObjC will be affected, and the data of ObjC will be recalculated and then calculated clockwise to NodeA, and neither ObjA nor ObjB will be affected

Or, if a NodeD node is added, only ObjA will be recalculated to NodeD, and neither ObjB nor ObjC will be affected

The consistent Hash algorithm makes your service very fault tolerant and extensible. Every time a node changes, it only affects one section of data, not the rest

Data skew problem

If the above method is used, there may be A problem. When the following situation occurs, due to the uneven distribution of nodes, A large amount of data is hash to node A and only A small amount of data is hash to node B, which makes it meaningless to deploy multiple nodes

Solution: Virtual node: After Hash the service, Hash the service several times. IP + number can be used to Hash multiple virtual nodes, so that nodes are evenly distributed. After data is distributed on virtual nodes, associate virtual nodes with real nodes, so that data will not be skewed in a certain range

Code implementation

Analog node:

@Data
public class Node<T> {

    private String ip;
    private String name;

    public Node(String ip, String name) {
        this.ip = ip;
        this.name = name;
    }

    @Override
    public String toString(a) {
        returnip; }}Copy the code

Hash algorithm: The Hash algorithm uses Murmurhash, which has high performance

public Long hash(String key) {
    ByteBuffer buf = ByteBuffer.wrap(key.getBytes());
    int seed = 0x1234ABCD;

    ByteOrder byteOrder = buf.order();
    buf.order(ByteOrder.LITTLE_ENDIAN);

    long m = 0xc6a4a7935bd1e995L;
    int r = 47;

    long h = seed ^ (buf.remaining() * m);

    long k;
    while (buf.remaining() >= 8) {
        k = buf.getLong();

        k *= m;
        k ^= k >>> r;
        k *= m;

        h ^= k;
        h *= m;
    }

    if (buf.remaining() > 0) {
        ByteBuffer byteBuf = ByteBuffer.allocate(8).order(ByteOrder.LITTLE_ENDIAN);
        byteBuf.put(buf).rewind();
        h ^= byteBuf.getLong();
        h *= m;
    }

    h ^= h >>> r;
    h *= m;
    h ^= h >>> r;

    buf.order(byteOrder);
    return h;

}

Copy the code

Simulated Hash storage

private final HashService hashService;
private final int numberOfReplicas;
/** * circular Hash */
private final SortedMap<Long, T> nodes = new TreeMap<Long, T>();

public ConsistentHash(HashService hashService, int numberOfReplicas, Collection<T> nodes) {
    this.hashService = hashService;
    this.numberOfReplicas = numberOfReplicas;
    // Scatter nodes
    for(T node : nodes) { add(node); }}/** * Add machine node **@param node
 */
public void add(T node) {
    // Distribute nodes to Hash rings
    for (int i = 0; i < this.numberOfReplicas; i++) {
        nodes.put(this.hashService.hash(node.toString() + i), node); }}public T get(String key) {
    if (nodes.isEmpty()) return null;

    long hash = hashService.hash(key);

    // Find the node clockwise
    if(! nodes.containsKey(hash)) { SortedMap<Long, T> tailMap = nodes.tailMap(hash); hash = tailMap.isEmpty() ? nodes.firstKey() : tailMap.firstKey(); }return nodes.get(hash);
}
Copy the code

Test results:

final String IP_PRE = "192.168.50.";

/ / data map
Map<String, Integer> map = new HashMap<String, Integer>();
/ / power saving map
List<Node<String>> nodes = new ArrayList<Node<String>>();

for (int i = 0; i <= 10; i++) {
    String ip = IP_PRE + i;
    // The node is initialized
    map.put(ip, 0);
    nodes.add(new Node<String>(ip, "Node:" + i));


}

HashService iHashService = new HashService();

ConsistentHash<Node<String>> consistentHash = new ConsistentHash<Node<String>>(iHashService, 500, nodes);

for (int i = 0; i < 5000; i++) {
    // Random data
    String data = UUID.randomUUID().toString() + i;
    // Get the real node
    Node<String> node = consistentHash.get(data);

    map.put(consistentHash.get(data).getIp(), map.get(node.getIp()) + 1);
}

for (int i = 1; i <= 10; i++) {
    String ip = IP_PRE + i;
    System.out.println(ip + "Node data:" + map.get(ip));
}
Copy the code