Tip: below the source code address, please take


preface

The most commonly used relational database in Internet projects is MySQL. With the growth of users and businesses, the traditional single database and single table mode is difficult to satisfy the storage and query of a large number of business data. The large amount of data in a single database and single table will make the writing and query efficiency very slow, so the strategy of dividing databases and tables should be adopted to solve this problem.


Tip: The following is the text of this article, case for reference only

1. Service Scenario Description

Suppose there is an e-commerce system currently using MySQL, and you want to design a scheme with large data storage, high concurrency, high performance and scalability, and there are user tables in the database. There will be a lot of users, and to achieve high scalability, how do you design? OK, let’s look at the traditional way of sub-table

Of course, there are small partners who know how to split the database by province/region or certain business relationshipsOK, so the question is, how do you make sure that the data is stored in different tables in different libraries? Let libraries reduce concurrency stress? How should we make the rules of the sub-table? Don’t worry, it’s coming

Two, the level of library table method

1.RANGE

In the first approach, you can specify a range of data to be tabulated, such as 1 to 1000000,1000001 to 2000000, using a table per million, as shown in the figure below

Of course, this method needs to maintain the table ID, especially in the distributed environment, such a distributed ID, without the use of third-party table tools, it is recommended to use Redis, Redis INCR operation can easily maintain the distributed table ID.

Advantages of the RANGE method: It is easy to expand the capacity. You need to build libraries and tables in advance

Disadvantages of the RANGE method: Most reads and writes require new data, resulting in I/O bottlenecks. As a result, the new library is under heavy pressure. Therefore, the RANGE method is not recommended.

2. A HASH modulus

In view of the IO bottleneck in the above RANGE mode, we can adopt the mode of mod selection based on user ID HASG to divide the library and table, as shown in the figure:

In this way, data can be spread across different libraries and tables, avoiding IO bottlenecks.

Advantages of the HASH module selection method: Data can be evenly distributed in different databases and tables, reducing database pressure

Disadvantages of the HASH modulo method: It is difficult to expand the capacity. During data migration, the HASH value needs to be recalculated and assigned to different libraries and tables

3. Consistency HASH

It’s not the perfect way to HASH modulus, but what is?

Using the consistent HASH algorithm can solve the problem perfectly

Common HASH algorithm:

Ordinary hashing algorithms map binary values of arbitrary length to shorter binary values of fixed length. This small binary value is called the hash value. A hash is a unique and extremely compact numerical representation of a piece of data.

Disadvantages of common hash algorithm in distributed application: In distributed storage system to store the data to a specific node, if we adopt common hash algorithm for routing, map the data to a specific node, such as key % n, the key is the key data, number of nodes n is machine, if you have a machine to join or quit the cluster, then all data mapping is invalid, If the storage is persistent, data migration is required. If the storage is distributed, other caches are invalid.

Consistent HASH algorithm: According to the commonly used HASH algorithm, the corresponding key is hashed into a space with 2^32 nodes, that is, the number space from 0 to (2^32)-1. Now we can think of these numbers as a closed ring, as shown in the following image.

Let’s say we have three database server nodes: Node1, Node2, and Node3. Each node is responsible for storing its own part of the user data. Let’s say we have user1, user2, and user3. User1 falls on Node1, user2 falls on Node2, and user3 falls on User3

OK, now let’s assume node3 failsUser3 will fall on Node1, leaving the previous node1 and Node2 data unchanged, assuming node4 is added

You will find user3 will land on node4, you will find that through the analysis of the node adding and deleting, consistent hashing algorithm while maintaining the monotonicity, or data migration to the minimum, this algorithm is very suitable for distributed cluster, avoid a large amount of data migration, reduce the pressure of the server.

There is, of course, one more problem to solve, and that is balance. As can be seen from the figure, when there are few server nodes, there will be a problem, that is, a large amount of data will inevitably be concentrated on one node, and a small number of data will be concentrated on other nodes.

In order to solve this data skewness problem, consistent hashing algorithm introduces virtual node mechanism, that is, compute multiple hashes for each service node and place one node in each calculated result position, which is called virtual node. To do this, first determine the number of virtual nodes associated with each physical node, and then add a number after the IP or host name. For example, in the case above, you can compute three virtual nodes per server, Thus, the hash values of “node 1-1”, “node 1-2”, “node 1-3”, “node 2-1”, “node 2-2”, “node 2-3”, “Node 3-1”, “node 3-2” and “Node 3-3” can be calculated respectively. This creates nine virtual nodes

For example, when user1 is located on node 1-1, node 1-2, and node 1-3, it is actually located on node1, which can solve the problem of data skewness when there are few service nodes. Of course, the number of virtual nodes is not fixed to three or at most, at least three, here is just an example. The number of virtual nodes depends on the actual service conditions.

Advantages of the consistency HASH method: Using virtual nodes, data is evenly distributed in different databases and tables. Adding or deleting nodes does not affect data on other nodes. This method has high availability and high disaster recovery.

Disadvantages of consistent modulus method: well, compared with the above two methods, it can be considered that there is no.

Unit testing

OK, so let’s do a unit test where we have three nodes, and each node has three virtual nodes

package com.hyh.core.test; import com.hyh.utils.common.StringUtils; import org.junit.Test; import java.util.LinkedList; import java.util.List; import java.util.SortedMap; import java.util.TreeMap; /** * ConsistentHashTest ** @author heyuhua * @create 2021/1/31 19:50 */ public class ConsistentHashTest {// list of servers to be added to the HASH ring Private static String[] Servers = {"192.168.5.1", "192.168.5.2", "192.168.5.3"}; Private static List<String> realNodes = new LinkedList<>(); static List<String> realNodes = new LinkedList<>(); // virtualNodes. Key indicates the hash value of the virtual node and value indicates the name of the virtual node. Private static SortedMap<Integer, String> virtualNodes = new TreeMap<>(); Private static final int VIRTUAL_NODES = 3; private static final int VIRTUAL_NODES = 3; /** * Testing consistency with virtual nodes HASH */ @test public void testConsistentHash() {initNodes(); String[] users = {"user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9"}; for (int i = 0; i < users.length; I++) System. The out. Println (" [] "+ users [I]," the hash value of "+ getHash (users [I]) +", be routed to the node [" + getServer (users [I]) + "] "); } public void initNodes() {for (int I = 0; i < servers.length; i++) realNodes.add(servers[i]); for (String str : realNodes) { for (int i = 0; i < VIRTUAL_NODES; I ++) {String virtualNodeName = STR + "- virtual node "+ string.valueof (I); int hash = getHash(virtualNodeName); Println (" virtual node [" + virtualNodeName + "] is added, hash value is "+ hash); virtualNodes.put(hash, virtualNodeName); } } System.out.println(); } // Use the FNV1_32_HASH algorithm to compute the Hash value of the server. Private static int getHash(String STR) {final int p = 16777619; int hash = (int) 2166136261L; for (int i = 0; i < str.length(); i++) hash = (hash ^ str.charAt(i)) * p; hash += hash << 13; hash ^= hash >> 7; hash += hash << 3; hash ^= hash >> 17; hash += hash << 5; If (hash < 0) hash = math.abs (hash); return hash; } private static String getServer(String key) {// Get the hash value of the key int hash = getHash(key); SortedMap<Integer, String> subMap = virtualNodes.tailMap(Hash); String virtualNode; If (submap.isEmpty ()) {// Start with the first node if there is none larger than the hash value of the key. Integer I = virtualNodes.firstKey(); VirtualNode = virtualNodes.get(I); } else {// firstKey is the nearest node clockwise; Integer I = submap.firstkey (); VirtualNode = submap.get (I); If (stringutils.isNotBlank (virtualNode)) {return virtualNode.substring(0, virtualNode.indexOf("-")); } return null; }}Copy the code

Here we simulate the hash and routing of nine user objects and see the result

conclusion

In the distributed microservice architecture environment, it is recommended to strongly use consistent HASH algorithm to divide database and table. Of course, in the distributed environment, there will also be problems of data consistency and distributed transaction of business data. Next time, we will discuss the solutions of data consistency and distributed transaction

The author remarks

I see a long way to go, but I see no ending. Thanks for your likes, bookmarks and comments. See you next time.

Watch me take you on the path to becoming an architect.

Source address: click here to view the source code.