The network’s best high and micro blog architecture design implementation and landing

How is it designed and implemented

The business scenario

I believe you have all used Weibo, as shown below:

Micro blog is the most basic attention, attention and view. Post news, browse news, like, comment, favorites, etc.

The figure above is referred to as relation page. Relation page displays user relation-related information, including two sub-pages:

  1. The followers page displays information about all users who follow the user.
  2. The Attention page displays information about all the users that the user follows.

The main operating

On the above page, users can add and remove attention for themselves, which means following other users or unfollowing other users. You can delete followers, which means that you are unfollowed by another user.

Business characteristics

  1. Massive amounts of user data. Hundreds of millions of users, thousands of posts per user, thousands of followers /attention on average.
  2. High traffic. 100,000 page views per second on average, 10,000 posts per second.
  3. Uneven distribution of users. The number of posts/followers of some users will exceed the number of pages visited by other users by one to several orders of magnitude
  4. If the time distribution is not uniform, a user may suddenly become a hot user at a certain time, and its followers may increase by several orders of magnitude

The typical characteristics of a typical social class system can be summarized into three key words: large data volume, high traffic, and non-uniformity

A function of storage

The simplest is db-based storage, which only requires two tables

Table_relation and table_user_info

Table_relation table: id primary key, follower ID, follower ID

Table_user_info table: ID primary key, user information (avatar, name, registration time, BIG V authentication, mobile phone number and other information)

So here we only need to query two tables to query the information of followers, fans and users:

Select count(1) from table_relation where id=’xx’;

Select count(1) from table_relation where id=’xx’;

select * from table_user_info where id=’xx’;

As the number of users increases, the number of rows in the table_relation and table_user_info tables increases. Tens of millions, hundreds of millions of users, each user related to hundreds of levels, then you need to split horizontally.

For the information query of a user, the shard data is calculated according to the userId, and then the related data is queried in the corresponding shard INFO table. UserId fragments can be mapped in various ways, such as hash. Some special bits of the userId field and the consistent hash mapping of the module.

Here’s the problem. Table_relation is split based on followers. It is easy to query the people a user follows because the data of the same followerId must be distributed on the same shard. However, once you need to query who cares about a user, the query needs to be routed to all the fragments, because the data with the same attentionId is scattered in different fragments, which leads to low query efficiency. Since the number of visits to query followers is similar to that of attention, half of the scenes will be inefficient to query regardless of whether they are split according to the followerId or attentionId.

Therefore, to solve the above problems, vertical split is carried out into the follower table and the attention table. The sub-table records the followers and followers of a user. Then, horizontal split is carried out for the follower table and the attention table based on userId respectively.

Follower table: primary key ID, userId, followerId

AttentionId = userId; attentionId = attentionId

This time to implement the query:

select count(1) from table_follower_xx where userId=’xx’;

select count(1) from table_attention_xx where userId=’xx’;

Query users:

select * from table_user_info wehre userId in (xx);

The first two statements above can fall on the same shard and have two DB operations, but the last one still needs to be queried multiple times.

At the same time, when we split vertically, if we write DB, we’re going to double it. And the above operation still requires the count query, even if the userId is indexed there is still a problem:

  1. For some users who are followed by many people (such as celebrities who have tens of millions of followers), the number of rows that need to be scanned on the userId is still high when they perform a count query in the follower table. We call these users “hot users”. Every attempt to show the number of followers of a hot user is inefficient. On the other hand, as a hot user is followed by many users, his timeline page will be accessed more frequently. As a result, the inefficient display operation will be accessed frequently, further increasing the performance risk.
  2. When a user has a large number of followers, it is usually unable to display all of them in one page in relation page, so paging display is required, with a fixed number of users displayed on each page. However, when DB achieves paging, scanning efficiency increases with the increase of offset. As a result, the relation page of these hot users becomes inefficient when it is displayed in the last few pages.
  3. Display user details. When displaying the Relation page each time, it is necessary to query the info table for each follower or attention, so that the query service capability of info cannot increase linearly with the info fragmentation.

The introduction of the cache

For the above problem, we need to introduce cache here to solve.

At the same time, at the DB level, we can store the number of followers and followers of each user in the information of the userInfo table in DB. In this way, the relation summary display of the timeline page and feed page can be completed only by querying the userInfo table. The number of followers and followers can also be added to the cache. For some large V accounts, we can also perform local caching on the server side for secondary storage.

After the introduction of caching, the implementation of business operations has been adjusted accordingly:

  1. The relation summary information of a user’s timeline/feed page is displayed in the following way: First, query the cache based on the user as the key. If no match is found, query the DB.

  2. The detailed information display of the Relation page of a user is divided into two sub-pages: follower list display and attention display:

    1. Similarly, query the cache of followers and attention first. For frequently queried hot users, their data must be in the cache, thus excluding the users with the most DB data and the highest access frequency from the cache.
    2. For each user’s info, hot users are more likely to be queried on the details page because they are followed by more users, so they can always be queried in the local cache, which has a short expiration time.

For the follower details page of a hot user, because the cache list of the hot user is too long, the network transmission is very high each time they are queried. At the same time, the query frequency is also higher because of the hot spots, increasing the network burden. The number of followers and attention in an INFO query changes at any time. In order to keep the query in real time, the system needs to re-count data at the shortest possible interval. For hot users, if they want to achieve second-level data delay, this means they need to count millions or even tens of millions of data per second. How to solve these dynamic changes of the large volume of data, real-time becomes a challenge.

For count, we introduce a cache, write count to the cache, and subscribe to the change events that the user receives. Add and delete write cache, write database asynchronously. At the same time, by recording the version generated by each event in the increment module (it can also be realized according to the increment of time itself), and recording the version of each key of the counter, it can realize the requirements of reduplication and loss prevention.

When you need to view the relation details page of a user, it involves paging queries of the follower and attention lists. Generally, a single user has a limited number of followers, most of whom are less than 1000, and the frequency of each query on the first page is much higher than that of the following pages. Therefore, high throughput can be achieved whether the list is directly stored in DB or cache. However, for the followers of hot users, the situation is a bit more complicated: the number of followers is uncontrollable, and even a small probability of page turning is still a highly visited operation for many hot followers. And the cost of scanning each page turned is high. The value list of a single distributed cache cannot support an excessively long follower list.

To solve the query problem of hot users’ follower list, the implementation based on increment quantization is adopted. First, the first N pages (say five) of the same follower list account for most of the total number of visits. The number of followers on the first N pages is constant. Secondly, the display of the followers list is sorted by follow time, and the newly added followers are usually ranked first. In other words, the latest data of the increment module is most likely to be placed on the first N page. The latest N page change events pulled by the incremental consumer each time are directly stored in the follower cache of the hot user.

Design and optimization of post

Posts are usually stored in MySQL. There are a few fields: postId, userId, postTime, content

Among them:

  • UserId records who made this post
  • PostTime Records the release time. It only needs to be accurate to the second (the same user usually sends no more than one post in a second).
  • Content Records the content of a post

We can create a secondary index on userId+postTime to make it faster to view all posts by time for a particular user.

select * from post where postId=? For details

select * from post where userId=? and posttime between ? and ? Query by user and event

Using the above query, there will be a serious problem is that there is a serious back table (for each record of the secondary index, the need to query the master data in the clustered index) problem, reducing the DB throughput. Therefore, userId and postTime information can be redundant into postId, and secondary indexes can be removed to reduce back to the table

PostId can take the first 6 bits of userId, each of which is one of the 36 characters 0-9/A-Z, 6 bits for 21 different users, 6 bits for time stamps, and subsequent time stamps (accurate to second) for any second in A 70-year range. The number of posts A single user can make per second does not exceed the maximum of two seQ expressions. The 14-bit postId can be applied to the scale of this design system. The calculation of timeCompress can be designed as:

  • The time of Posting subtracted the interval of seconds between the time points of SNS system’s first Posting, and carried out the 36 base code.
  • When designed this way, timeCompress’s alphabetic order increases continuously over time (granularity of seconds), making full use of DB range scans.

So when we query a list of posts sent by a user, the SQL becomes:

select * from post where postId between postId1 and postId2

or

select * from post where postId like "xx%"

Since you are querying for posts from the same user, all postId prefixes are the same, as are the first few digits of the 6-digit timeCompress if you are querying for posts from that user within a certain time range. This way to avoid the embarrassment of returning to the table.

As the number of posts increases, the data volume and throughput of a single DB reach the upper limit. At this time, the introduction of horizontal splitting makes the data volume and throughput scale linearly

, so split horizontally, split horizontally by userId. Since the postId prefix contains the userId information entirely. Therefore, postId can be used as an independent unit of routing operation.

However, some hot spot users have heavy read volumes and are routed to the same DB, which may also have read bottlenecks. To do this, the common way is: read and write separation. 1 write and N read are used, and the DB synchronization mechanism is used for master/standby replication. Each read randomly selects one of N read libraries.

The DB solution based on read/write separation has two problems:

  1. Data delay is a problem after read/write separation is adopted
  2. Older data is rarely accessed. Read/write separation means that multiple copies of each record need to be stored. When these data were first published, they were accessed frequently, and over time, they stopped being accessed. Then 99 percent of the data copy will not be read, storage efficiency is low.

So here we introduce the cache, and the data expiration mechanism of the cache naturally prevents stale data from occupying space.

How do we design the key-value of Redis? It is not difficult to find that the number of posts issued by the same user in a day is limited, usually no more than 10, with an average of about 3 posts. It is difficult for a single user to send more than 100KB posts in a week, and 1MB in extreme cases, which is far lower than the upper limit of Redisvalue size. So:

Key: userId+ timestamp (accurate to week)

Value: Redis is the hash type, field is postId, and value is the content of the post

Expire is set to one week, i.e., data will be stored for up to two weeks at the same time (assuming an average length of 0.1KB per post and 3 posts per day from 100 million users with an estimated quantity of 400GB)

The search for a certain user within a period of time is changed into hsCAN command for this user’s timestamp of the week. User posts and other operations update DB and cache synchronously at the same time, and DB change operation records ensure consistency.

However, the number of followers of some hot users is very high, meaning that the Redis server on which this hot user is based is seeing 10 million queries per second.

So let’s go a step further and introduce local caching to relieve the server-side caching pressure. When some hot users query frequently, we can directly store hot users in the local cache. To ease caching on the server side

How does the “like” business land

According to the service characteristics of “like”, it can be found that:

  1. High throughput
  2. Accept data inconsistencies

What can you do if you just think about likes?

You can use MySQL for persistent storage, Redis for caching, read and write operations are cached, and asynchronous threads periodically refresh DB.

Table counter: ID, postId, count

RedisKV Storage: key: postId value: count

However, take weibo as an example, there are not only some likes but also the number of retweets, comments and readings. Therefore, the problems of business expansion and efficiency are difficult. If the design is so, we need to query Redis and DB for many times.

Assuming there are 100 messages on the home page, you need to loop for each message, and each message takes 4 Redis visits to assemble.

So if we do optimization here, we can add column optimization at the MySQL level

Conunter tables: ID, postId, readCount, forwordCount, commentCount, PraiseCount

The downside of this is that if you add a count service, the columns need to change. So what if you don’t want to change the column?

Conunter tables: ID, postId, countKey (count type name, such as readCount), countValue

Redis to retrieve the business, which can be stored as a hash for counting.

Pay attention to the user post, how to design and implementation

In microblog, post bar, nuggets and other platforms, we can follow the people we like. After we follow them, they will update their posts, and we will see them sorted and paginated according to time.

As the relation of a user changes, the timeline of a user also changes. The addition and deletion of one user’s posts will affect the timeline of multiple other users. It can be seen that the content of timeline is affected by relation between users. When relation relationship is complex, the performance of timeline will be challenged.

How to implement

There are two classical implementations: push and pull, and the push pattern is discussed first

Push model

The push mode is characterized by: each time a user adds a post, the post is pushed to the DB shard where his followers reside. The followers directly query the data stored in their own shard every time they browse the Timeline.

Therefore, a timeline table is required to correspond to the POST table

The timeline fields include ID, userId, postId, and postTime

The constraints in the timeline table are userId+postId. The same post in the timeline of the same user can only appear once.

Users browse posts within a certain time range under their timeline according to time, and the posts fall on the DB fragment to which the userId belongs:

select postId from timeline where userId='xx' and postTime between ? and ? ;

select * from post where postId in (...) ;

The user follows the new user: The postId is obtained in the shard where the new user resides and inserted into the timeline table of the user.

Delete from timeline where userId=’xx’ and postId like ‘B%’; delete from timeline where userId=’xx’ and postId like ‘B%’;

The implementation of the db-based push scheme above is faced with A difficult scenario: when user B’s posts concerned by User A are published/deleted, in addition to deleting B’s own POST table, the timeline table of E needs to be inserted/deleted. Assume that B is a hot user with hundreds of millions of followers. Each new and deleted post of the user will be copied hundreds of millions of times, causing a bottleneck in adding and deleting posts.

The pull model

Different from the pull mode, in the pull mode, users do not need to synchronize all their followers each time they add or delete posts, so there is no bottleneck for hot users to add or delete posts in the push mode. However, when each user queries the timeline of a period of time, the list of recent posts of all concerned users needs to be queried.

The implementation of pull is simple and does not require a separate new timeline table. Each user’s own posts are maintained in the POST table and cache. In the pull mode, there is no extra overhead for write operations. However, for more frequent read operations (users view all posts of users in a period of time), users need to scan all posts of their followers by time. Assuming that each user follows an average of 500 people, there will be 100 scans for each refresh of the Timeline page (assuming 100 DB shards). A large social system, assuming that 1 million people are online at the same time and the page is refreshed every 10 seconds on average, would have a DB query pressure of 100 million queries per second, requiring tens of thousands of DB instances just by relying on DB.

In the pull mode, there is a point that can be optimized. In our follow list, most of the users are actually large V, which means that many users are actually queried by the timeline of multiple followers at the same time. Therefore, these concurrent follow queries can access DB only once, and the queries at the same time can be combined. But this is actually very little optimization.

Push and pull are combined

Purely based on pull and push patterns is flawed, so you can optimize push and pull together here.

When a user posts a post, it only needs to synchronize the post to 100 database shards (let’s say there are 100 database shards), let’s say the stored copy of the table is called post_rep, which requires at least three fields: posterId, postId, postTime. The number of push copies is controllable (no matter how many followers there are, you only need to make 100 copies). The database is fragmented by the user to which the timeline belongs. Therefore, each user needs only one DB query, and the query quantity is controlled. If 10 million users are online at the same time, the query frequency of a single DB fragment is 10,000 times per second, falling to the normal range that DB can bear.

Assuming that each user posts an average of 3 posts per day and all of them are concentrated in 8 hours during the day, the system has 100 million users in total, 100,000 new posts will be inserted every 10 seconds, each user will receive 500 followers, and it is expected that a new post will appear in the Timeline every 20 seconds. Assuming that the query time range of each timeline update is within the last 10 seconds, each user scans 100,000 records each time in the combination of push and pull, but only selects 0.5 new records on average, which poses certain performance risks.

Push and pull combine fully quantified queries

For the timeline that has been queried by each user, you can store it and mark “Last query time”, so that only the records after the “last query time” need to be queried when refreshing the timeline.

The storage of the query results of the Timeline can be completed by caching, which is called the recent query cache of the Timeline. There may be deleted posts in the cache, but since the cache only stores postId, these deleted posts will be filtered out in the post query phase based on postId, which does not affect the query result. Timeline Specifies the expiration time of the latest query cache to ensure that the capacity is controllable.

Join the cache

Although the above combination of push and pull has done a great deal of optimization, the frequency of DB access per second is too much for DB to bear, so caching is also needed here.

Multiple “delta lists” can be stored in the in-memory cache:

  • The element of each list is the user ID
  • A new user list is created every 10 seconds, and the ids of all users who have posted during those 10 seconds are in this list
  • The user ids in the same list are arranged in a B+ tree.

As a cache, the incremental query part does not retain data. However, since each user retains the result of the last query by default, it can be assumed that the incremental part will not query the old data, assuming that the data is retained for one hour. If you count 100,000 posts every 10 seconds, up to 100,000 different user ids, storing each ID takes about 32-50 bytes, and caching an hour of data takes about 1.6GB

Social Feed caching architecture design and implementation

For small and medium-sized Feed systems, Feed data can be distributed directly in synchronous push mode. Every time a user publishes a feed, the back-end system will push the feed in full quantity according to the user’s list of fans, and fans can view the latest feed through their inbox.

But large Feed systems like Weibo. In a pull+push mode, the feed is first written to the message queue. Asynchronously updated by queue processing, the update is no longer pushed to the inbox of all fans, but stored in the publisher’s own Outbox; In pull mode, users can obtain the outbox of followers in real time. For performance reasons, some personalized data is still pushed into the target user’s Inbox first. When a user visits, the system aggregates his inbox together with the Outboxes of all his followers, resulting in a list of feeds.

A feed invocation process needs to do the following:

  1. Get the concern list based on the user UID
  2. Obtain the latest weibo ID list of each follower according to the following list
  3. Get the list of weibo ids received by the user
  4. After merging, sorting and paging these ID lists, we get the list of weibo IDS to be displayed
  5. Get the corresponding microblog content according to these ids
  6. For forwarding feeds further retrieve the content of the source feed
  7. Obtain the filter criteria set by the user
  8. Get the feed author’s User information for assembly
  9. Gets the requestor to assemble whether the feed is favorites, likes, etc
  10. Get retweets, comments, likes, and other counts of these feeds for assembly
  11. Once assembled, it is returned to the requester in a standard format

Feed requests need to fetch and assemble so much back-end resource data, while considering the user experience, the interface request time is less than 100ms, so the Feed system needs to make heavy use of cache, and the cache architecture should be well structured. Cache system plays an important role in the Feed system. It can be said that cache design determines the merits of a Feed system.

The cache design of a typical Feed system is mainly divided into six parts: INBOX, OUTBOX, SOCIAL GRAPH, CONTENT, EXISTENCE and COUNTER.

The Feed ids are stored in the INBOX Cache and OUTBOX Cache in the form of a vector, as shown in the figure below:

The INBOX cache layer is used to store feed ids that are not aggregated efficiently. When a user posts a feed that is only shown to a specific fan and features member organizations, the feed system will first get the list of users to be pushed, and then push the entire feed ID to the corresponding fan’s Inbox. Therefore, Inbox builds the key from the visitor UID, which is updated by getting local first and changing cas to the remote cache.

The OUTBOX cache layer is used to directly cache feed ids of common types published by users, with the entire cache constructed from the publisher UID. Outbox is divided into vector cache and Archive data cache. The vector cache; Vector Cache is used to cache the latest published feed ids and comment ids, which are pooled according to specific service types. If the user has not recently published a new feed and the vector cache is empty, fetch the feed ID from the Archive data

The SOCIAL GRAPH cache layer consists of a user’s interest and user information. The user’s concern relationship mainly includes the user’s concern list, fan list, two-way list, etc.

The CONTENT cache layer mainly includes the CONTENT of popular feeds and the CONTENT of full feeds. A hot feed is a source feed that causes a hot event when a hot event occurs. As popular feeds are accessed more frequently than ordinary feeds, for example, the QPS of single popular feeds in Weibo may reach hundreds of thousands, so popular feeds need to be cached independently and multiple copies to improve the access performance of the cache.

The EXISTENCE cache layer is mainly used to cache various EXISTENCE judgment services, such as whether to like or not to read.

The COUNTER cache is used to cache various counts. There are many counts in the Feed system, such as the number of users’ Feed posts, the number of followers, the number of single Feed comments, the number of likes, the number of readings, the number of topics related counts, etc.

Cache design for simple data types

Most of the data in the general system is simple KV data type. These simple data types require only get and set operations and are not used for special counting operations. Memcached is the best caching component. When choosing a certain middleware, we need to carry out capacity planning, analyze the size of business data, cache quantity, peak read and write, etc. To determine the size of the cache capacity, number of nodes, deployment mode, etc.

Evaluation form: Business Name, Purpose, Unit (user), Average Size, Quantity, Peak read (QPS), Peak write (QPS), Hit ratio, Expiration time, Average penetration load time. To ensure the availability of the service. In order to prevent requests from penetrating the DB layer in large quantities, which will bring great pressure to DB and in extreme cases will cause an avalanche, main-HA two-layer architecture can be introduced.

If miss continues to visit the HA layer, the result will be returned and the value will be written back to the Main layer. Subsequent accesses to the same key can be directly hit in the Main layer. Because data in the main-HA cache is different from that in the main-HA cache, services can achieve a higher hit ratio. At the same time, even if some Main nodes are unavailable, the HA layer can ensure the hit ratio and availability of cache

For micro-blog, where super-hot events often occur, another layer L1 can be added. L1 cache is a small-capacity cache, and the capacity of each group L1 cache is one-third to one-sixth of that of Main layer. During client access, a RANDOM L1 cache is selected for access. If miss accesses this cache in the order of Main->HA. Since the content capacity of L1 is much smaller than that of Main, slightly cold data will be quickly removed, so L1 will continue to store the hottest data, and L1 has multiple groups, and a large number of hot data access will be evenly distributed among multiple L1s.

Cache design for collection data types

For the collection data that need to be calculated, such as the following list, grouping and latest fan list in the user relationship in the Feed system, it needs to be paginated and relational calculation, so as to meet the requirements of “common concern” and “who among the people I follow also follows TA”, which can be cached by Redis. Redis provides a rich collection class storage structure, supporting list, set, zset, hash, etc.

The Redis architecture is already complete and can be deployed with a high availability cluster architecture for access. However, in the scenario of full reading of large sets of data, such as obtaining data with hgetAll, hundreds or thousands of elements may be required in a single request. Students who have used hgetAll know that hgetAll is a pit, which is easy to freeze or break down. Therefore, for obtaining All elements, we can add a layer of Memcached on the top of Redis, or directly store a separate copy of XX-all in Redis. In the case of full query, we can query Memcached or the suffix -all in Redis. In this way, the reading performance of the system can be improved. You can also reduce the number of slaves.

Count to determine the cache design of the data type

There is also business data in the Feed system that cannot be treated directly as simple data types or aggregate data types. For example, judging whether a user likes a certain feed or reads a certain feed. If you cache directly, it consumes a lot of memory. There is also the count class service, which stores a count of 8 bytes key and 4 bytes value. If Redis is used for storage, n new records are added every day, and the daily cost may be as high as hundreds of GIGABytes, which is very high. Therefore, weibo has been extended on Redis, adding ConunterService and Phantom components.

ConunterService components

The main application scenario of CounterService is counters, such as the count of retweets, comments, and likes on Weibo. Using Redis for KV storage, one KV probably needs at least 65 bytes. As mentioned above, the storage of one count service, KV adds up to 12 bytes, the other 40 + bytes are wasted. A microblog usually has several counts: for example, the number of forwarding, the number of comments, the number of likes. And a day micro fair added countless posts.

ConunterService storage structure adopts memory above and SSD below. The memory is divided into N tables in advance, and each Table draws a certain range according to the pointer sequence of microblog ID. Any weibo ID first find the Table it is in, if there is, directly add or subtract it; If not, add a new Key. When a new microblog ID comes, the smallest Table will be dumped into SSD, leaving the new position on top for the new microblog ID to use. What if a particular tweet is so hot that the number of retweets, comments, or likes exceeds 4 bytes and the number becomes too large to handle? You can store these in Aux Dict. A Table on an SSD can be accessed by a dedicated Index and replicated in RDB mode. According to the time dimension, the recent hot data is stored in memory, and the previous cold data is stored in disk, reducing the machine cost. If previously cold data is frequently accessed, it is loaded into the LRU cache. Asynchronous I/O loading does not affect overall service performance.

The Phantom components

There is also a scenario of “existence judgment”, such as whether a user of a micro-blog has liked, whether a user of a micro-blog has read and so on. One big feature of this scenario is that it checks for existence, so each record is very small, such as Value stored in 1 bit, but the total amount of data is very large. For example, the number of newly published microblogs is about 100 million every day, and the total data whether or not users have read it may be hundreds of billions. How to store it is a very big challenge. What’s more, the existence of whether most microblogs have been read by users is 0. If 0 were stored, hundreds of billions of records would be saved every day. If not, a large number of requests will eventually pass through the Cache layer to the DB layer, and no DB can withstand that much traffic.

Phantom uses the same Table storage scheme as CounterService. The difference is that each Table in CounterService stores KV. Each Table of Phantom is a complete BloomFilter, and each BloomFilter stores the Key of a certain ID range. All tables form a list and increase in order according to the Key range. When all tables are full, the smallest Table is cleared and the latest Key is stored, so that the smallest Table is rolled to the largest Table.

When a Key read/write request is received, the system determines which Table the Key belongs to based on the Key range, and then determines whether the Key exists based on BloomFilter.

Count service extension

Db +cache = db+cache In some software, a cache is also used to store counts directly. Clean up periodically, just a bit longer for the cache. Like the likes of QQ space. You’ll find that after a long time, you’ll look back and all the likes you used to have are gone. All that’s left is new posts and likes.

It is not friendly to the service with high concurrency and large data volume. So the above mentioned can use ConunterService, of course, this is weibo’s own, not open source. In this case, we can simply make the following small optimizations for Redis ourselves

struct item { int64_t id; int star_num; // like int repost_num; // forward int comment_num; / / comment};Copy the code

Store numbers, not strings, and no extra Pointers to store

A large chunk of memory is opened up when the program starts. Insert,

h1 = hash1(item.id);
h2 = hash2(item.id);
Copy the code

During query, check whether id and item.id are the same. Inconsistent indicates that the value is 0.

Delete, find the location, set a special flag; The next time you insert, you can fill in this flag bit to reuse memory.

In fact, the number of retweets and comments here is not very many, hundreds of thousands are already a lot, more than 100,000, for example, Lu Han’s official announcement of romance is almost a few. So here’s another optimization you can make:

struct item{ int64_t id; int star_num; unsigned short repost_num; Unsigned short Comment_num; Comment number};Copy the code

For posts with more than 65535 retweets and comments, we record a special flag bit here and look it up in another dict.

This saves a lot of memory.

The above is also a point of optimization, of course, not all companies have such a large volume, and not all companies need to optimize the middleware to solve the problem. So that’s a point to think about. And all of this can be optimized.