This is why’s 89th original article
A reader sent me a picture the other day.
I ask: hair what kidney matter?
So this is the conversation:
He posted a screenshot of the wechat step list:
In fact, this is a common interview scenario: how to design a leaderboard?
This question, in fact, is to test the breadth of your interview preparation range, see will answer, have not seen… It’s hard to say.
Of course, if you’ve ever done leaderboards in a real business, don’t laugh. The interviewer will give you time to think.
So you do not open mouth, you just need to frown a little, say to the interviewer: I think about this.
And then organize your language a little bit, just say it.
This article, with you to analyze the “leaderboard” this scene problem, in the end how to do.
Database-based
This problem, if it is really not met before, may be the most easy to enter your field of vision is usually the most contact with the database.
Because when you think “leaderboard,” you think “Order by.”
When you think of Order by, you think of databases.
The thought of a database…
Brother, your path is narrow.
Although I have done leaderboards based on MySQL, it was a temporary service built for a tournament and Redis was never introduced. I assessed how long it took to set up Redis and how long it took to develop directly with MySQL.
So I chose MySQL.
The fundamental reason why I chose MySQL was that I already knew that there were only 10 teams in the final, which meant that my leaderboard had only 10 pieces of data from the beginning to the end.
After the player submits the code, the system calculates the score in real time, and then updates the leaderboard.
The interface then returns the following data to the front page, which is in a table:
- Teams are ranked by all-time high score
- The name of the team
- All-time high score
- The most recent submission score
- Last submission time
The front end calls my interface every one minute, the same score, the same place, so I use a more complex SQL in the interface to query the database, the above fields have all.
You see, leaderboards do work with MySQL.
Do not have to go to Redis, remember a word: out of business scenario design, are playing rogue.
But as with “everything is an object,” don’t say it to the interviewer. It’s not what the interviewer wants to hear.
Or, that’s just part of what you want to hear.
The reason why this answer works is that I gave a specific scenario, the number of users is very small, how to play can be.
We don’t even have to use MySQL to sort the data out, but we can sort it in memory.
But if it’s a leaderboard, and the number of players increases to tens of millions of users, it’s not going to work.
Of course, you may tell me about the slow query on the index, data volume on the database on the table scheme.
Well, that’s not wrong.
But once the data volume is large, this is not a particularly good solution.
This problem has to be tackled from the root.
Based on the Redis
This scenario tests your knowledge of the Redis sorted set data structure.
Sorted set is a sorted set.
In Redis it looks something like this:
Sport :ranking:20210227 is the key in Redis.
Value is a set, and you can see that the set is ordered. Each member in the collection has a Score and is sorted in descending order by that score.
It is important to note that the score/member in the picture is not just written by me. It is defined as follows on the official website:
https://redis.io/commands/zadd#sorted-sets-101
“Score/member pairs.”
So when I draw, SCORE comes first, member comes second. This is not a random drawing, although it doesn’t seem to matter who’s in front or who’s behind.
Another point to note is that although it is not shown in my diagram, in an ordered set, the element (member) cannot be repeated, but score can be repeated.
This is easy to understand, like the number of wechat steps on the day of 20210227. I can take 6666 steps, and you can also take 6666 steps. This is no conflict:
However, the problem arises: when the member score is the same, how to sort the member?
Take a look at the answers from the website:
When multiple elements have the same score, they are lexicographically ordered.
“Lexicically” is not an image.
Don’t panic, you know why brother also part-time English teaching:
When the scores are the same, they are sorted in lexicographical order, so in the diagram above, jay comes before why.
Next, look at the operations on ordered sets. There are 32 of them:
I do not do API teaching here one by one, the official website has been written very clear, if you are not familiar with the command, you can go to the official website to view, are sample code.
https://redis.io/commands/zadd#sorted-sets-101
Like this ZADD method:
In order to share the following smoothly, I will only talk about a few operations that need to be used:
- Zadd key score member [Score member…]
- The score command format is zincrby key Increment Member
- Command format: zrank/zrevrank key member
- Zrange /zrevrange key start end [withscores] Zrange /zrevrange key start end [withscores]
Let’s start with the first one: add member.
For example, to add data from a diagram to an ordered set, the syntax looks like this:
- zadd key score member [score member …]
You can add one or more pairs of score-members at a time, such as these two commands:
- zadd sport:ranking:20210227 10026 why
- zadd sport:ranking:20210227 10158 mx 30169 les 48858 skr 66079 jay
After execution, the number returned represents the number of members that were added successfully.
I used the Redis RDM visualization tool to view the inserted data, which is similar to the diagram I drew myself:
Move on to the second: increase the member score
The data of the wechat sports ranking is updated in real time.
At present, the number of steps of member for why is 10268. Suppose I go out for a run after dinner and run another 5000 steps.
To update my step count, use the zincrby command. The syntax looks like this:
- zincrby key increment member
The execution command for the above scenario is as follows:
- zincrby sport:ranking:20210227 5000 why
When the execution is complete, it returns the number of steps for why, and you can see that the number of steps changed from 10026 to 15026:
At the same time, as the number of my steps increased, the order was reversed according to the score, which also led to the change of the order:
So we only need to update the score. As for the change of the ranking, Redis will help ensure it.
Then look at the third command: get the member ranking
The syntax looks like this:
- Get member rank: zrank key member
- Get member ranking: ZRevRank Key member
First of all, the rankings start at 0.
Zrank returns member rank from lowest to highest.
Zrevrank returns member ranking in order of high to low score.
For example, to get Jay’s rank, use Zrank to return 4.
- zrank sport:ranking:20210227 jay
When using ZrevRank, Jay’s rank is 0:
- zrevrank sport:ranking:20210227 jay
Therefore, in the demand of wechat step rankings, the more steps, the higher the ranking, we should use ZrevRank.
The fourth command to master is to return members within the specified ranking range.
- Zrange/zRevrange key start end [Withscores] Returns members within the specified ranking range
This command is critical.
Zrange returns members in the specified ranking range from lowest to highest.
Zrevrange returns members in the specified ranking range from high to low by score.
Here, I’ll only demonstrate the command for Zrevrange.
For example, IF I want to get the top three members in step count:
- zrevrange sport:ranking:20210227 0 2
This command takes an optional argument: withscores
When this parameter is used, the score of the corresponding member is returned:
You think, isn’t this the top N scene?
Let’s say I want to get the ranking of all my users. How do I write that?
As follows:
- zrevrange sport:ranking:20210227 0 -1
This is the current wechat step list, Jay has the most steps, MX has the least.
Yi, how to return a responsibility, leaderboard came out for a long time?
If you think about it, after talking about a few API operations, it looks like the function is implemented.
Yes, that’s true, even if we only need these two apis to complete the leaderboard:
- zadd key score member [score member …] Add member
- Zrange/zRevrange key start end [Withscores] Returns members within the specified ranking range
All right, if you like it, thank you for three clicks. So much for this article…
That’s impossible.
Boring API articles are so boring.
Although in the previous part we can already based on Redis ordered collection plus a few simple commands, can achieve the requirements of the ranking.
But this is just the beginning.
Look at the leaderboards again
There is a problem with the wechat step list above, have you noticed?
In the case of the above scenario, everyone looks at it in this order:
The truth is that everyone sees the ranking data from their wechat friends, and wechat friends are different, so the rankings are different.
That’s a feature that we don’t have.
Our scenario above is more like a game leaderboard, where the full server leaderboard is the same for everyone.
So how do we ensure that we all see differently?
You think about, from what Angle to solve this problem?
If the key of the ordered collection is different, different sets of values are obtained.
Our current key is sport:ranking:20210227, which only has information for one day.
As long as we add the user’s attribute in the key, it is ok. Suppose my wechat id is why.
So the key can be designed for this sport: ranking: according to: 20210227.
Thus, each person’s key is different because of the user information in the key, like this:
The corresponding command is as follows:
- zadd sport:ranking:why:20210227 10026 why 10158 mx 30169 les 48858 skr 66079 jay
- Zadd sport: ranking: mx: 20210227 7688 9688 liu zhao four to 10026, according to 10158 mx 54367 feet
Why and MX were both looking at their friends’ rankings of wechat moves on a given day.
Once the key is designed, the problem is solved.
But when you think about it, does it really work?
I may have been blinded by the lard when I was writing the first edition.
There is a kind of “only edge body in this mountain” taste, thinking of Redis.
If every user has their own leaderboard in Redis, then every time a user’s score is updated, they have to update all their friends’ zset. What a cost, right?
When the latitude of the user to do leaderboards, there will be a huge number of leaderboards, resulting in higher maintenance costs.
Redis can do it, but it’s not the best solution.
So what’s the plan for doing that?
Here’s an idea:
Every user sees a different leaderboard, and we don’t have to maintain it all the time.
Good maintenance, the user is not necessarily to see, thankless pace.
So it’s better to wait until the user requests it.
When the user requests to see the leaderboard, then according to the user’s friend relationship, the loop to obtain the number of steps of friends, to generate the leaderboard.
Specific plan, everybody thinks by himself.
In addition to say a mouth, some time ago is not the micro channel to support the modification of micro signal, won a large cheers.
I really thought about how difficult it would be to implement this requirement technically.
I don’t know if there’s any historical technical debt in there.
But say the current scene, the key contains the wechat signal, note is wechat signal, not wechat nickname.
Because at the beginning of the design, the product guarantee said: rest assured, the micro signal is absolutely global and unique, once determined, can not be changed.
The result? Now it’s going to change.
Product pidianpidian said: how to achieve I don’t care, this demand users call very big, quickly online.
What do you think is the impact of these scenes?
In fact, the impact is not particularly large, a field change.
But, wechat has 1.4 billion users.
A simple requirement, when it comes to this volume, is the following:
Quantitative change causes qualitative change.
All right, all right, let’s get out of here. Said to come back.
When I put my eyes on the wechat leaderboard again, I found that, in fact, I just gave a castration version of the leaderboard.
Yes, we can now get why’s current number of steps is 1680, and the current ranking is 814.
For example, let’s use the above example. Let’s say I want to get the wechat move list of my friend Jay.
Get Jay’s ranking first:
- zrevrank sport:ranking:why:20210227 jay
The place is 0, and the program can add one to it. First place.
Then get the number of jay’s steps today:
- zscore sport:ranking:why:20210227 jay
66079, and you have your steps.
Now we know: Why’s friend Jay took 66,079 steps today, ranking first among Why’s wechat friends.
But if you look closely, I left out two fields:
- WeChat avatar
- Number of likes from friends
Where should I put the two fields?
Of course you can put it in the database, but let’s focus on the Redis solution.
In this case, we want to store the User object, which contains the following fields: nickname, profile picture link, likes, and steps.
What data structure is this stored in Redis?
You have to use the Hash structure.
Hash structures also involve keys and values, so what are they?
A key is the key of our ordered set followed by a friend’s nickname, like this:
The corresponding command looks like this:
- hmset sport:ranking:why:20210227:jay nickName jay headPhoto xxx likeNum 520 walkNum 66079
After execution, the RDM looks like this:
When there are more likes, you need to call the update command to update likeNum:
- hincrby sport:ranking:why:20210227:jay likeNum 500
When the execution is complete, the number of likes will be 1020:
In this way, all the fields on the leaderboard we can get, the wechat leaderboard is over.
Er…
How does it feel like API teaching?
Damn it. Get something else.
What about the last seven days?
So far we’ve talked about daily leaderboards.
What if the interviewer asks for a list of the last seven days, last week, last month, last quarter, last year, etc.?
This is a test of how well you know the Redis Ordered Collections API.
Which is this API:
- zinterstore/zunionstore destination numkeys key [key …] [weights weight [weight…]] [aggregate sum | min | Max] both/and were obtained
This API looks a little complicated, so don’t be afraid to go through it one by one:
- Zinterstore/ZunionStore is actually intersection/union
- Destination saves the result of the intersection/union to this key
- Numkeys Number of sets that need to be intersected/union
- key [key …] The set that specifically participates in the intersection/union
- weights weight [weight …] The weight of each set involved in the calculation. When doing intersection/union calculations, each member of the set will multiply its score by this weight, which defaults to 1.
- The aggregate sum | min | Max is the sum for each set of the same elements (sum), min (minimum) or Max (Max), the default value is the sum.
Take the last seven days, for example. Let’s just grab some data and you can just stick it in:
- zadd sport:ranking:why:20210222 43243 why 2341 mx 8764 les 42321 skr
- zadd sport:ranking:why:20210223 57632 why 24354 mx 4231 les 43512 skr 5341 jay
- zadd sport:ranking:why:20210224 10026 why 12344 mx 54312 les 34531 skr 43512 jay
- zadd sport:ranking:why:20210225 54312 why 32451 mx 23412 les 21341 skr 56321 jay
- zadd sport:ranking:why:20210226 3212 why 63421 mx 53652 les 45621 skr 5723 jay
- zadd sport:ranking:why:20210227 5462 why 10158 mx 30169 les 48858 skr 66079 jay
- zadd sport:ranking:why:20210228 43553 why 4451 mx 7431 les 9563 skr 8232 jay
You can see that we have 7 days of data:
And it should be noted that there is no jay data on 20210222.
Now we need to figure out the leaderboard of the last 7 days, using the following command. The command is a bit complicated, but it is clear when looking at the command format:
- zunionstore sport:ranking:why:last_seven_day 7 sport:ranking:why:20210222 sport:ranking:why:20210223 sport:ranking:why:20210224 sport:ranking:why:20210225 sport:ranking:why:20210226 sport:ranking:why:20210227 sport:ranking:why:20210228 weights 1 1 1 1 1 1 1 aggregate sum
You don’t have to write weights and aggregate, you have default values, but I’ve written them out so I don’t hide the data.
After the execution is complete, you can see that there is an extra key, which contains the data summary of the last 7 days:
The union is used above. If our requirement is to sort the people who have uploaded their exercise data every day in the last 7 days, we will use the intersection.
Change zUnionStore to ZinterStore.
In addition, for comparison, change the queue name after the merge as follows:
- zinterstore sport:ranking:why:last_seven_day_zinterstore 7 sport:ranking:why:20210222 sport:ranking:why:20210223 sport:ranking:why:20210224 sport:ranking:why:20210225 sport:ranking:why:20210226 sport:ranking:why:20210227 sport:ranking:why:20210228 weights 1 1 1 1 1 1 1 aggregate sum
It can be seen from the execution result that jay did not upload the movement data on the day of 20210222, so he was not available when the intersection was taken:
You know what we did for the last 7 days, we have data for every day, last week, last month, last quarter, this year’s ranking is not the same routine?
Er…
How does it feel like API teaching?
It’s not working. Let’s try something else.
100 million user rankings
King of glory, no problem 100 million users. For example, I want to see where I am among the 100 million users, so I open the game, and after 20 minutes (one game) I finally find my position on the leaderboard.
Results, not on the list:
Of course, I’m not on the list.
Even if there is a ranking, the ranking is tens of millions, 8 numbers, in the page is not easy to put ah.
But suppose the current demand is to query the user’s full service ranking, how to check?
I made up my mind about the first version of the redis-based solution I could think of, but it’s a wild card, it’s a very complicated solution.
What do I think?
I thought, the general interview encounter what tens of millions of data, a few G files, billions of data what, the first thought of the plan is divide and conquer.
The demand for this billion level user leaderboard must also be divided and conquer.
There are 8 stages of the king:
- 1. Stubborn bronze
- Order silver
- Glory gold
- 4, noble platinum
- 5. Eternal diamonds
- The supreme star shines
- The best king
- Honor the king
So we could have 8 buckets.
This bucket can be 8 different Redis keys, or even 8 Redis keys, depending on how much money the interviewer gives you, you can make more.
As shown in the figure below:
Explain where the score of 8588 came from in the picture above.
First of all, we use Redis’ ordered collection, so we have to give each member a score.
So, each user has an integral in the bucket that has been calculated by the formula.
For example, why brother’s current grade is starshine, assuming the calculated score is 8588.
Now it’s easy to figure out why brother is ranked in full server:
When I write the program, I can know my current segment is starshine, so go directly to starshine’s bucket, use ZrevRank to calculate the current bucket rank, let’s say n.
And then through zcard this O(1) command to get, in front of the bucket, which is the greatest king and king of glory of the set size of the two buckets, y and x respectively.
So why brother’s full service ranking is n+y+ X.
So get any user’s full service ranking, is to see him in his bucket inside the ranking plus the number of elements in the previous bucket can be.
And now it’s easy to count the top 100.
Just take the first bucket, the first 100 of the Kings of Glory.
Get things done.
Wait, did you really get it done?
The idea is right, but 8 barrels is too little for 100 million users, right?
Well, let’s go ahead and divide the buckets. Don’t forget, there are segments in each segment.
For example, star flare, there are star flare five to five small segments, bronze three to bronze one or three small segments.
That’s 27 barrels.
However, 27 barrels is not enough.
So it takes five more stars to get from star two to star one, and three stars to bronze two.
So that’s 160 barrels.
160 barrels is not enough?
Er…
Let’s go back to the drawing board, and convert the sections directly into integrals, plus various other conditions, and then divide them according to the integral:
This way, you can split the fractions as much as you want, as little as you want.
Perfect.
Wait, is that perfect?
If you look at my range, it’s pretty evenly divided.
According to the segment split, some vegetables chicken players, played two feel boring, swearing out of the game, has been left in the bronze segment.
So the bronzes must be bigger than the Kings of glory.
So, in fact, the user’s landing point is not uniform.
How to do?
At this time, it is necessary to carry out data analysis, through a series of high number, probability, discrete and other knowledge to do a bucket size estimation.
Oh, that’s a pretty neat thing.
Well, then, we’re done.
Non-technical considerations
Making a leaderboard seems like an easy thing to do.
But that’s not the case, especially for recommended leaderboards, which need to avoid the Matthew effect:
For example, the author recommended list, the author who is recommended to the top, the exposure is very high. Even if the output quality drops, it’s easy to get more attention.
The authors at the bottom of the list are very disengaged.
And that’s where the polarization comes in and the Matthew effect comes in.
So what do we do in this case?
There is a complex formula involved, such as the mining community’s mining power, for news flow recommendations and author lists:
https://juejin.cn/book/6844733795329900551/section/6844733795380232206
So don’t make the mistake of thinking that leaderboards are a very simple requirement, which involves some very complex algorithms.
One last word
Thank you for reading.
If you find something wrong, you can bring it up in the background and I will modify it.
This article is participating in the “Nuggets 2021 Spring Recruiting activity”, click to see the details of the activity