preface
The system design practice article will be based on the “system design interview of the panacea” for the pre-template, explaining dozens of common system design ideas.
Pre-reading:
- “System Design Interview Cure-all”
- Systems Design Practice (01) – Short chain Services
- System Design Practice (02) – Text storage services
- System Design Practice (03) – Instagram social service
Design goals
Let’s design a Site similar to Pastebin where users can store plain text. The user of the service will enter a piece of text and get a randomly generated URL to access it.
What is Pastebin?
Pastebin is a text storage site on which users can store (paste) plain text, such as snippets of code, to produce a web site that displays the corresponding text. You can choose the type of text (the programming language the code belongs to), the duration of the text (1 day, 7 days, 30 days, fade after reading, and so on), the nickname of the text sharer, and so on. Because the first text sharing site was called Pastebin.com, text storage sites are also often referred to as Pastebin.
Requirements and objectives of the system
The Pastebin service shall meet the following requirements:
Functional requirements
- Users should be able to upload or paste their text data and get a unique URL to access it.
- Users can only upload text.
- Data and link addresses expire automatically after a specified interval; The user can specify an expiration time.
- Users can choose a custom alias for their text content.
Nonfunctional requirements
- The system should be highly reliable and any uploaded data should not be lost.
- The system should be highly available. This is necessary because if our service is shut down, users will not be able to access their pasted content.
- Users should be able to access their paste in real time with minimal delay.
- Pasted link addresses should not be predictable.
Expand demand
- Analyze, for example, how many times the paste address is accessed
- Our services should also be accessible to other services through REST apis.
3. System similarity
Pastebin has a lot in common with his previous article, Systems Design Practices (01) – Short Chain Services, so I recommend reading the short chain services article again before starting, along with some additional design considerations.
What is the limit on the amount of text a user can paste at one time?
We can limit the user’s paste to no more than 10MB to prevent abuse of the service.
Should we impose size limits on custom urls?
Since our service supports custom urls, users can customize the urls they like, but providing custom urls is not mandatory. However, it is reasonable (and often desirable) to impose size limits on custom urls so that we have a consistent URL database.
4. Capacity estimation and constraints
Similar to the short chain service, our service will have more read requests, and more read requests than creating a new paste. We can assume that the ratio of reading to writing is 5:1.
Traffic estimation
Let’s say the system generates 1 million new pastes per day, so we have 5 million reads per day.
New paste per second
1M / (24 hours * 3600 seconds) ~= 12 pastes/sec
Paste read per second:
5M / (24 hours * 3600 seconds) ~= 58 reads/sec
Storage estimates
Users can upload up to 10MB of data; Typically, services like Pastebin are used to share source code, configuration, or logs. Such text is not large, so we assume that each paste contains an average of 10KB.
At this rate, we’ll be storing 10GB of data a day.
1M * 10KB => 10 GB/day
If we want to store this data for 10 years, we need 36 terabytes of total storage capacity.
With a million pastes a day, we will have 3.6 billion pastes in 10 years. We need to generate and store keys that uniquely identify these pastes. If we use Base64 encoding ([a-z, a-z, 0-9,., -]), we will need A six-letter string:
64^6 ~= 68.7 billion unique strings
If it takes one byte to store a character, the total size required to store the 3.6b key will be
3.6B * 6 => 22 GB
Compared to 36TB, 22GB is negligible. To maintain some margin, we will increase the storage capacity to 51.4TB by using the 70% capacity model (that is, we don’t want to use more than 70% of the total storage capacity at any one time).
Bandwidth estimation
For write requests, we expect 12 new pastes per second, with 120KB of input per second.
12 * 10KB => 120 KB/s
As for read requests, we expect 58 requests per second. Therefore, the total data outlet (sent to the user) will be 0.6MB /s.
58 * 10KB => 0.6 MB/s
Although the total entrances and exits are not very large, we should keep these numbers in mind when designing our services
Memory is estimated
We can cache some frequently accessed hot pastes. Following the 80-20 rule, which states that 20% of the hot pastes will generate 80% of the traffic, we want to cache that 20% of the pastes, because we have 5M read requests per day, and to cache 20% of those requests we need
0.2 * 5M * 10KB ~= 10 GB
5. System API design
We can use SOAP or REST apis to expose the functionality of our services. The following might be the definition of an API for creating/retrieving/removing pastes:
addPaste(api_dev_key, paste_data, custom_url=None, user_name=None, paste_name=None, expire_date=None)
parameter
- Api_dev_key (string): API developer key for registering an account.
- Paste_data (string): pasted text.
- Custom_url (string): Optional user-specified URL.
- User_name (string): Optional user name used to generate the URL.
- Paste_name (string): optional paste name.
- Expire_date (string): Optional expiration time.
return
Success returns a URL that can access the paste, otherwise an error code is returned.
getPaste(api_dev_key, api_paste_key)
Where the API paste key is a string representing the paste key to retrieve. This API will return pasted text data.
deletePaste(api_dev_key, api_paste_key)
Returns true on success, false otherwise.
Vi. Database design
Some observations about the nature of the data we’re storing
- We need to store billions of records.
- Each metadata object we store is small (less than 100 bytes)
- Each paste object we store can be of medium size (it can be several Megabytes).
- There is no relationship between records unless we want to store which user created what paste.
- Our service has a lot of read requests
Database selection
We need two tables, one for storing information about Paste and one for storing user data.
Paste | User |
---|---|
[PK] URL Hash: varchar(16) | [PK] UserID: int |
ContentKey: varchar(512) | Name: varchar(20) |
CreationDate: datetime | Email: varchar(20) |
ExpirationDate: datatime | CreationDate: datetime |
LastLoginDate: datetime |
In this case, URl Hash is the URl equivalent of TinyURL, and ContentKey is the object key that stores the pasted content.
7. Advanced design
At a higher level, we need an application layer to service all read and write requests. The application layer will communicate with the storage layer to store and retrieve data. We can isolate the storage layers, with one database storing metadata related to each paste, user, and so on, and another database storing the paste content in some object store (such as Amazon S3). This partitioning of the data will also allow us to scale them individually.
Viii. Component design
The application layer
Our application layer will handle all incoming and outgoing requests. The application server communicates with the back-end data store component to process the request.
How are write requests handled?
On receiving a write request, our application server will generate a random 6-letter string that will serve as the paste key (if the user does not provide a custom key). The application server will then store the pasted content and the generated keys in the database. After successful insertion, the server can return the key to the user. One possible problem here is that the insert fails due to duplicate keys. Because we generated a random key, it is possible for the newly generated key to match the existing key. In this case, we should regenerate a new key and try again until nothing is found because of the duplicate key. If the custom key provided by the user already exists in the database, an error should be returned to the user.
Another solution to the above problem is to run an independent Key Generation Service (KGS), which generates random 6-letter strings in advance and stores them in a database (we call them key-DB). Whenever we want to store a new paste, we just need an already generated key and use it. This approach will make things very simple and fast because we don’t need to worry about duplication or collisions. KGS will ensure that all keys inserted into key-DB are unique. KGS can use two tables to store keys, one for unused keys and one for all used keys. Once KGS provides some keys to the application server, it can move those keys to the key table it uses. KGS can keep some keys in memory so that it can quickly provide them whenever the server needs them. Once KGS loads some keys in memory, it can move them to the used key table so that we can ensure that each server gets a unique key. If KGS goes down before using all the keys loaded in memory, these keys are wasted, but can be ignored because there are enough strings to generate six letters in KGS.
Isn’t KGS a single point of failure?
Yes. To solve this problem, we could have an alternate copy of the KGS that would take over the generation and supply the keys whenever the master server died.
Can each application server cache some keys from key-DB?
Yes, it will certainly speed up the response. In this case, though, if the application server dies before using all the keys, we end up losing them. This is acceptable because we have the 68B’s unique 6 letter key, which is much more than we need.
How does it handle paste read requests?
Upon receiving a read and paste request, the application service layer requests the data store. The data store searches for the key and, if found, returns the pasted content. Otherwise, an error code is returned.
The data layer
We can divide the data store into two layers.
- Metadata databases: We can use relational databases such as MySQL or distributed key-value stores such as Dynamo or Cassandra.
- Object store: You can store content in object stores just like Amazon S3. When we want to maximise capacity on content storage, we can easily increase capacity by adding more servers.