This is a blog post by Redis writer Antirez
Original address: antirez.com/news/128
We’re welcoming a new data structure in Redis5 called “Streams”. When Streams was launched, it attracted the attention of the community. So I decided to do a community survey over time, discuss its usage scenarios and document the results on my blog (the Redis writer’s blog). Today I want to talk about another issue: I suspect that many users think Streams is the same usage scenario as Kafka. In fact, this data structure is also designed for the production and consumption of messages, but you’d think Redis Streams is just better at doing that. Streams are a good model and “mental model” that can help us design systems better, but Redis Streams, like other Redis data structures, is more general and can be used to deal with a wider variety of problems. So in this blog post we will focus on Redis Streams as a data structure and completely ignore its blocking operations, consumer groups, and all the message-related stuff.
Streams is a CSV file on steroids
If you want to record a series of structured data, and you are sure that the database is large enough, you might say: We open a file with appending writes, and each row is a CSV data item:
Time = 1553096724033, cpu_temp = 23.4, the load = 2.3 time = 1553096725029, cpu_temp = 23.2, the load = 2.1
It seems so simple, and then people have been doing it for years and years and years: if you know what you’re doing, then it becomes a set pattern. What if the same thing happened in memory? Memory is more capable of sequential writing, and some of the limitations of CSV files are automatically removed:
- Difficult to batch query
- Too much redundant information: each entry has almost the same time and the same fields. However, removing a field reduces flexibility and makes it impossible to add additional fields
- Each entry’s offset is its byte offset in the file, and these offsets are invalidated if we change the file structure. So there’s a missing ID that uniquely identifies you.
- Entries cannot be deleted, only marked as invalid. If you don’t rewrite the log and there is no garbage collection, rewriting the log often goes wrong for various reasons, so it is best not to rewrite it.
However, there are some advantages to using such CSV entries: there is no fixed format, the fields can be changed, they are easier to generate, and they are stored in a compact format. We kept the good points, removed the restrictions, and designed a hybrid data structure like Redis Sorted Set — Redis Streams. They look like basic data structures, but in order to achieve this effect, there are multiple representations of the interior.
Streams 101
Redis Streams is a macro node of incremental compression connected through a radix tree. Redis Streams are represented as delta-compressed macro nodes that are linked together by a radix tree. What it does is quickly find a random item, get the range value, and delete the old value to create a stream with an upper limit on size. For programmers, our interface is similar to a CSV file:
> XADD mystream * cpu-temp 23.4 load 2.3
"1553097561402-0"
> XADD mystream * cpu-temp 23.2 load 2.1
"1553097568315-0"
Copy the code
As you can see from this example, the XADD command automatically generates and returns an entry ID. It is monotonically increasing and consists of two parts: < time >-< quantity >. The time is millisecond level, and the quantity is the increasing number of entries generated in the same millisecond.
So the first abstraction from the “append to CSV” file above is that if you use an asterisk as the ID parameter of the XADD command, you get an entry ID from the server. This ID is not only unique to the entry, but also depends on when the entry is added to the stream. The XRANGE command can fetch or retrieve individual data items in batches.
> XRANGE mystream 1553097561402-0 1553097561402-0
1) 1) "1553097561402-0"
2) 1) "cpu-temp"
2) "23.4"
3) "load"
4) "2.3"
Copy the code
In this example, TO get a single data item, I used the same ID as the starting and ending values. However, I can take any range of data items and limit the number of results with the COUNT parameter. I can also set the start and end parameters to time stamps to get data items over a period of time.
> XRANGE mystream 1553097560000 1553097570000
1) 1) "1553097561402-0"
2) 1) "cpu-temp"
2) "23.4"
3) "load"
4) "2.3"
2) 1) "1553097568315-0"
2) 1) "cpu-temp"
2) "23.2"
3) "load"
4) "2.1"
Copy the code
For space reasons, we won’t show any more Streams apis. We have relevant documents for those who are interested to read. For now, we’ll just focus on the basics: XADD for adding data, XRANGE (or XREAD) for reading data. Let’s see why I say Streams is a powerful data structure.
Tennis player
The other day I was modeling an app with a friend who is currently studying Redis: it’s an app for tracking local tennis courts, players and matches. Obviously, the player is a small model, and in Redis only one hash is sufficient, in the form of the key player:<id>. As you go further with Redis modeling, you realize that you need to track a match at a given tennis club. If player 1 and player 2 play a match, player 1 wins. So we can record it like this:
> XADD club:1234.matches * player-a 1 player-b 2 winner 1
"1553254144387-0"
Copy the code
With this simple operation, we can obtain the following information:
- The unique identifier of a race: the ID in the stream
- You don’t need to create an object that represents a race
- Page to check for races or to see if a race is being played at a specified time
Before Streams came along, we needed to create a Sorted Set, where the score is time. The Sorted Set element is the ID of the match (stored in a Hash). Not only does this add more work, but it also wastes more memory than you might think.
Now Streams looks like an append, time-graded Sorted Set whose elements are small hashes. In short, this is a revolution in the Rediscover modeling environment.
Memory usage
The above example is not just a problem with fixed mode. Streams optimizes memory savings well over the old Sorted Set+ Hash mode, but this is not easy to spot.
Let’s say we’re going to record a million games,
Sorted Set + Hash is 220MB (242RSS)
The memory usage of Stream is 16.8MB (18.11RSS)
It’s not just an order of magnitude difference (it’s actually a 13-fold difference), it means our old model was so wasteful, and the new model works perfectly. There’s another magic to Redis Streams: macro nodes can contain multiple elements encoded using data called listpack. Listpacks encodes integers in binary form, even if they are semantic strings. Most importantly, we used incremental compression and same-field compression. We can query by ID or time because the macro nodes are joined by a cardinality tree. Cardinal leaves are designed to use very little memory. Everything uses very little memory, but interestingly, users can’t see semantically the implementation details that make Streams more efficient.
Now let’s do a simple calculation. If I saved 1 million entries and used 18MB of memory, then 10 million entries would be 180MB, 100 million would use 1.8GB, and 1 billion data would only use 18GB of memory.
The time series
One important thing to note is that, in my opinion, the example we used to record tennis matches above is very different from Redis Streams as a time series. Yes, logically we are still logging one type of event, but the essential difference is the difference between logging and creating an entry and storing it into an object. When we use a time series, we just record an external event without actually showing an object. You might think this distinction is unimportant, but it is not. It is important for Redis users to use Redis Streams if they want to save a series of ordered objects and assign each object an ID.
But instant is a simple time series, and a big use case, because before Streams came along, Redis was a little desperate when it came to this use case. A memory – saving, flexible stream is an important tool for developers.
conclusion
Streams is very flexible and has many usage scenarios, so I wanted to keep it as short as possible to make the above examples and in-memory analysis more accessible. Probably most readers have already figured it out. But in conversations I had with others last month, I felt there was a strong connection between Streams and processing. Just as this data structure can only be used to process streams, this is not the case.