Continue to answer the questions of planet Water friends, how to design the chain of friend relationship and fan relationship with large amount of data and high concurrency?
What is a relationship chain business?
The relationship chain is mainly divided into two categories, weak friend relationship and strong friend relationship, both of which have typical Internet product applications.
Weak friendship does not require mutual consent:
-
User A follows user B without user B’s consent. In this case, user A and user B are weak friends. For USER A, it can be interpreted as “following” for the moment.
-
User B follows user A without user A’s consent. In this case, user A and user B are also weak friends. For USER A, it can be interpreted as “fans” for the time being.
The fan relationship chain of Microblog like Idol and fans is a typical weak friend relationship application.
The establishment of a strong friendship requires that both sides of the friendship agree to:
- User A requests user B to be added as A friend, and user B agrees. In this case, user A and user B are strong friends of each other, that is, A is A friend of B and B is A friend of A.
QQ friend relationship chain, is a typical strong friend relationship application.
Friend centers are a typical many-to-many business:
-
A user can add more than one friend
-
It can also be added by multiple friends
Its typical architecture is:
-
Friend-service: friend-center service that provides a friendly RPC interface for callers
-
Db: Stores friends’ data
Weak friend relationships, how should the storage layer be implemented?
Through weak friend relationship business analysis, it is easy to know that its core metadata is:
-
guanzhu(uid, guanzhu_uid);
-
fensi(uid, fensi_uid);
Among them:
-
Guanzhu table, user record uid All interested users guanZHU_uid
-
Fensi table, used to record uid all fensi_uid fan users
It should be emphasized that the generation of a weak relationship will produce two records, one concern record and one fan record.
For example, if user A(uid=1) follows user B(uid=2), user A follows one more user, and user B has one more fan, then:
-
Guanzhu table to insert {1, 2} the entry, 1 attention 2
-
Fensi table insert {2, 1}
How to query who a user follows?
Answer: Build an index on the UID of Guanzhu:
select * from guanzhu where uid=1;
You get the result. 1 cares about 2.
How to query a user’s followers?
Answer: Index fensi’s UID:
select * from fensi where uid=2;
And you get the result. 2 powders 1.
Strong friends, how should the storage layer be implemented?
Plan a
Through the business analysis of strong friend relationship, it is easy to know that its core metadata is:
- friend(uid1, uid2);
Among them:
-
Uid1, uid of a party in a strong friend relationship
-
Uid2, the UID of the other party in the strong friend relationship
Insert record {1, 2} or {2,1} into the database where uid=1 and uid=2 have been added as friends.
Answer: Both. To avoid ambiguity, you can specify that the value of UID1 must be less than uID2 when inserting records.
For example, if there are three users, uid=1,2, and 3, and they are strong friends of each other, there might be three records in the database
{1, 2}
{2, 3}
{1, 3}
How do you query a user’s friends?
Answer: If you want to query all friends whose uid=2, just create an index on uID1 and uID2, then:
select * from friend where uid1=2
union
select * from friend where uid2=2
You get the result.
Scheme 2
Strong friend relationship is A special case of weak friend relationship. A and B must be followers of each other (in other words, fans of each other), which can also be achieved by using the following table and fans table:
-
guanzhu(uid, guanzhu_uid);
-
fensi(uid, fensi_uid);
For example, users A(uid=1) and B(uid=2) are strong friends, that is, they follow each other:
User A(uid=1) follows user B(uid=2); user A(uid=1) follows user B(uid=2);
-
{1, 2} is inserted into the guanzhu table
-
The fensi table inserts {2, 1}
If user B(uid=2) follows user A(uid=1), user B(uid=2) follows user A(uid=1), user B(uid=2) follows user A(uid=1), user B(uid=2) follows user A(uid=1).
-
{2, 1} is inserted into the guanzhu table
-
The fensi table inserts {1, 2}
What are the advantages and disadvantages of each implementation?
There are two types of implementations for strong friend relationships:
-
Friend (uid1, uid2) table
-
Data redundancy Guanzhu table and Fensi table (hereafter referred to as forward table T1 and reverse table T2)
When the amount of data is small, there seems to be no difference, but when the amount of data is large, the advantage of data redundancy is reflected:
-
Friend table, if the data volume is large, if uID1 is used to divide the database, then the query on UID2 needs to traverse multiple libraries
-
Positive table T1 and negative table T2 realize friend relationship through data redundancy. {1,2}{2,1} exist in two tables respectively, so the two tables use uid to separate libraries, and only one query is required to find the corresponding attention and fans, without the need for multiple library scanning
Voice-over: If you have a billion links, you have to slice them horizontally.
Data redundancy, a many-to-many relationship, is a common practice for horizontal segmentation of data in large volumes.
How to perform data redundancy?
The next question turned to how the friend center service could perform data redundancy, and there were three common approaches.
Method 1: Service synchronization redundancy
As the name implies, the friend center service synchronously writes redundant data, as shown in Figure 1-4:
-
The business invokes the service and adds data
-
The service first inserts T1 data
-
The service then inserts T2 data
-
The service returns the newly added data to the business party successfully
Advantages:
-
Uncomplicated, the service layer changes from one write to two write
-
Relatively high data consistency (return due to double write success)
Disadvantages:
-
Request processing time increased (to insert times, time doubled)
-
Data may still be inconsistent. For example, if the service is restarted after the second write to T1 is complete, data will not be written to T2
If the system is sensitive to processing time, the second option is commonly used.
Method 2: Service asynchronous redundancy
The double-write of data is no longer done by the friend center service. The service layer asynchronously sends a message to a specialized data replication service through the message bus to write redundant data, as shown in figure 1-6:
-
The business invokes the service and adds data
-
The service first inserts T1 data
-
The service sends an asynchronous message to the message bus (just send it, don’t wait for it to return, usually very quickly)
-
The service returns the newly added data to the business party successfully
-
The message bus delivers messages to the data synchronization center
-
The data synchronization center inserts T2 data
Advantages:
- Short request processing time (only 1 insert)
Disadvantages:
-
System complexity increased with the introduction of one more component (message bus) and one more service (dedicated data replication service)
-
Because the data is not necessarily inserted into T2 when the return line data is successfully inserted, the data has an inconsistent time window (which is short and ultimately consistent)
-
When the message bus loses messages, the redundant table data is inconsistent
If you want to decouple “data redundancy” from the system, the third solution is commonly used.
Method 3: Offline asynchronous redundancy
Data double-write is no longer completed by the friend center service, but by an offline service or task, as shown in Figure 1-6:
-
The business invokes the service and adds data
-
The service first inserts T1 data
-
The service returns the newly added data to the business party successfully
-
The data is written to the database log
-
Offline services or tasks read logs from the database
-
Offline services or tasks insert T2 data
Advantages:
-
Double-write data is decoupled from services
-
Short request processing time (only 1 insert)
Disadvantages:
-
Return line of business data is not necessarily inserted into T2 when it is successfully inserted, so the data has an inconsistent time window (which is short and ultimately consistent)
-
Data consistency depends on the reliability of offline services or tasks
The above three schemes have advantages and disadvantages, and can be selected according to the actual situation.
Although data redundancy can solve the problem of horizontal database segmentation of many-to-many relationships, it also brings a new problem: how to ensure the data consistency between positive table T1 and negative table T2?
As can be seen from the above discussion, no matter which solution, because the two-step operation can not guarantee atomicity, there is always the possibility of data inconsistency, high throughput distributed transaction is an unsolved problem in the industry, at this time, the direction of architecture optimization: final consistency. It’s not about ensuring data consistency in real time, it’s about finding inconsistencies early and fixing them.
Ultimate consistency is a common practice for high throughput Internet business consistency. More specifically, there are three common scenarios for ensuring final consistency of data.
Method 1: Scan all data in the positive and negative redundancy tables
As shown in the figure above, an offline scanning tool is started offline to continuously compare positive table T1 and negative table T2. If data inconsistency is found, compensation and repair will be made.
Advantages:
-
Relatively simple, low development cost
-
There is no need to modify the online service. The repair tool is decoupled from the online service
Disadvantages:
-
The scanning efficiency is low. A large amount of data that can be guaranteed to be consistent is scanned
-
Due to the large amount of scanned data, the scanning time is long. If data is inconsistent, the inconsistent time window is long
Is there an optimization that scans only “potentially inconsistent” data, rather than all data at a time, to improve efficiency?
Method 2: Scan for incremental data offline
Scanning only incremental log data at a time can greatly improve efficiency and shorten the time window for data inconsistency, as shown in figure 1-4:
-
Write to positive table T1
-
After the first step succeeds, log log1 is written
-
Write the reverse table T2
-
After the second step succeeds, log log2 is written
Of course, we still need an offline scanning tool to continuously compare log log1 with log log2, and compensate for any inconsistency found
Advantages:
-
Although more complex than method 1, it is still relatively simple
-
The data scanning efficiency is high. Only incremental data is scanned
Disadvantages:
-
The online service has been slightly modified (not expensive, 2 more logs have been written)
-
Although it is more real-time than method 1, the timeliness is still not high, and the inconsistent window depends on the scanning period
Is there a way to detect consistency in real time and fix it?
Method 3: Real-time online “message pair” detection
This time, instead of logging, a message is sent to the message bus, as shown in figure 1-4:
-
Write to positive table T1
-
After the first step is successful, send the message MSg1
-
Write the reverse table T2
-
After the second step succeeds, send a message MSg2
Instead of an offline tool that requires a periodic scan, a real-time subscription service is constantly receiving messages.
It is assumed that the receiving time of MSG1 and MSG2 should be within 3s under normal circumstances. If the detection service does not receive MSG2 after receiving MSG1, it will try to detect the consistency of data, and compensate and repair the inconsistency
Advantages:
-
High efficiency
-
Real time high
Disadvantages:
-
The scheme is more complex, and the message bus component is introduced online
-
Offline more than a subscription bus detection service
However, the technical solution itself is a tradeoff between the input-output ratio, depending on the level of consistency required by the business.
conclusion
-
Relationship chain business is a typical many-to-many relationship, which is divided into strong friends and weak friends
-
Data redundancy is a common many-to-many business data horizontal sharding practice
-
There are three common scenarios for redundant data
(1) Service synchronization redundancy
(2) Asynchronous redundancy of services
(3) Offline asynchronous redundancy
-
Data redundancy brings consistency problems. It is difficult to ensure complete transaction consistency in high-throughput Internet services. A common practice is final consistency
-
A common practice for final consistency is to find inconsistencies as soon as possible and fix the data, and there are three common scenarios
(1) Offline full scanning method
(2) Offline incremental scanning method
(3) Online real-time detection method
I hope you have some inspiration, thinking is more important than conclusion.
Please keep your questions open.