Real-time collaboration with multiple people is one of the most attractive features of graphite documents. With graphite, you can write a document or spreadsheet at the same time as a colleague or friend, and everyone’s changes are synchronized to everyone else in real time. You can see the movement of everyone’s cursor, every word you type. One operation document, one product brainstorming, with cups of tea and coffee, was born on graphite documents.
Everything beautiful is full of hardships. In terms of technical implementation, multi-person real-time writing will cause a lot of conflicts. Take the table for example, when user Bob is writing content in cell B2, his friend Jeff inserts another column in front of column B. If the two operations are sent to the server at the same time, there will be conflicts. At Graphite Documents, we maintain a data computing cluster with a set of algorithm-based analysis to help users resolve conflicts. As the example mentioned above, Bob’s operation of writing content in cell B2 will be transformed into operation in cell C2 and sent to Jeff after calculation by the server.
In order to minimize the latency of multi-person real-time writing, a great deal of effort has been made to make this algorithm as efficient as possible while semantically resolving writing conflicts. Statistics show that nearly 90% of the conflicting data calculations in graphite documents can be completed in milliseconds. One of the things that makes this instant time possible is a basic principle of our algorithm: it takes only about the operation itself, not about the size of the original document (or table). In other words, the time complexity of an algorithm cannot be positively correlated with the size of the original content.
This basic principle is derived from our intuitive perception of user experience, as users constantly writing in a document or form, the speed of data synchronization should not often slow down along with the increase in the content, or the user for writing experience will reduce gradually, eventually leading to the user gradually tend to be less as far as possible to write the contents on the graphite document.
Last December, Graphite Document officially released the public beta version of the form. After being online for a period of time, the performance problems of the form gradually attracted our attention. When you select a range in a table and set table properties (alignment, size, and so on), the program creates a data object for each cell in the range to record the data. If the selection range is large, the data objects will become very large, affecting the speed of network transmission and algorithm calculation.
To solve this problem, we decided to introduce the concept of Range to represent adjacent cells with the same properties as a Range rectangle. If the text right alignment format is set for b2-C4 cells, the previous representation method is:
{
B2: { attributes: { align: 'right'}},B3: { attributes: { align: 'right'}},B4: { attributes: { align: 'right'}},C2: { attributes: { align: 'right'}},C3: { attributes: { align: 'right'}},C4: { attributes: { align: 'right'}}}Copy the code
And can be expressed by Range:
{
RANGE: {
start: 'B2'.end: 'C4'.attributes: { align: 'right'}}}Copy the code
It can be seen that using Range to represent table content can simplify data storage, which not only reduces network bandwidth overhead, but also improves computing performance accordingly.
Once the target is determined, the problem is reduced to a clear algorithmic logic of finding the largest submatrix of common attributes in a matrix.
It is known from experience that the optimal time complexity of the algorithm to find the largest common matrix should be O(M*N), because no matter which element of the matrix is omitted, it cannot be guaranteed that the matrix is the optimal solution. On the other hand, the classical algorithm Largest Rectangle in Histogram, which is very similar to this problem, has a time complexity O(N). Therefore, we can further boil the algorithm down to finding the largest rectangle in the m-th histogram, as shown in the figure below.
Taking A1-D5 as the boundary of the matrix, we calculate the maximum matrix of the histogram for each column starting from COLUMN D, where “upper” in the figure is the upper direction of the histogram. For each column, we use a cache array of length N (or N+1 if Sentinel is used to avoid boundary calculation) to store the histogram height of the current column, the length of the contiguous common attribute matrix to its right. Take B as an example, and its corresponding histogram is:
It can be seen that the largest matrix for column B is a square of area 4 made up of rows 3 and 4. The actual calculation can be carried out by maintaining a stack to store the increasing height of the square column and traversing y once to find the largest rectangle. For details, please refer to the relevant algorithm materials. By doing the same calculation for each column, we can get the final result.
However, although this algorithm can solve our demand on the function, but it violates the basic principles of the above mentioned algorithm we – every time the user changes, even if only changing a cell, as likely to get the biggest rectangle eroded, so we have to to rethink the whole table operations.
To solve this problem, we support multiple ranges for a table. On the basis of the above algorithm, we define a candidate matrix threshold value, and whenever a matrix score exceeds the threshold value, it is added to a list. The value of the threshold is related to the size of the table itself (since the table data structure caches its own size, this does not violate the “ground rules”), and is positively related based on the empirical formula we calculated from the data in the production environment. When you add a list, because the current rectangle may overlap with a rectangle that already exists in the list, the area of overlap is the area wasted when you keep both rectangles, which we call redundant area. We also give an empirical formula to choose newly added (or existing) candidate rectangles according to the redundant area. Macroscopically speaking, when the area of candidate rectangles is larger and the redundant area is smaller, the candidate rectangles are more inclined to retain two candidate rectangles, and vice versa.
Then, when the user makes changes to the table, instead of recalculating the entire table, we just need to update the Range list. According to the relationship between the modified position and each rectangle in the existing Range, it can be divided into the following situations:
- Not connected to the rectangle in the original Range
- Connected to the rectangle in the original Range
- Inside the rectangle in the original Range
As shown below:
In the first case, the rectangle modified by the user is judged whether it reaches the candidate matrix threshold, if so, it is added to the Range list, otherwise it is stored as a cell.
In the second case, it determines whether a larger rectangle has been created (a simple operation based on coordinates, an O(1) operation), updates the original rectangle if so, and stores the user’s changes as cells otherwise.
For the third case, the user’s modification will break the original rectangle into several parts. At this time, the specific analysis will be made on whether each shattered rectangle reaches the candidate matrix threshold. If so, it will be put into the Range; otherwise, the modified rectangle will be saved into the form of cells.
As can be imagined, with the increase of user modification, the rectangles in the original Range will be constantly broken up, leading to more and more close to the candidate matrix threshold value; At the same time, if small rectangles are added for many times, they cannot be identified because there is no global traversal, even if they eventually form a rectangle that meets the threshold value. The above two conditions jointly lead to the fragmentation of Range.
To solve the fragmentation problem, we added fragment parameters for each table to record the fragmentation degree of the current table. The fragment value is updated every time there is an operation on a cell or a row or column change. (In fact, cell operations and row and column changes have different effects on the fragment value. If a row or column change hits a lot of rectangles in the Range, the fragment value is increased even more.) When the fragment reaches a critical value, we run the algorithm again to compress the table data completely and reset the fragment.
Now, we have just one more question. Despite our fine-tuning of the table compression algorithm, the actual compression of a large table with tens of thousands of cells takes about ten milliseconds. And in general, the larger the table, the more frequently it collaborates, the easier it is for the fragment to reach the critical value, which leads to a higher frequency of compression, which puts more strain on the server.
When more than one person writes the same table, each person gets complete and consistent data (with a delay of tens of milliseconds). Based on this background, we further solve the fragmentation problem of large tables at the engineering level: When multiple people write tables at the same time, each user builds a fragmentation counter and periodically computes the list of candidate matrices on the browser side with a fixed phase difference, then compares it with the results of the current server version, and sends the next local modification to the server with the comparison results. The server adjusts the fragment value of the table based on this result. For large tables, the frequency of user operations will be relatively higher, but because they are often written in the formalized tables, the fragmentation level will be lower. Using this approach allows the server to recalculate the Range only when necessary; In addition, because Web Worker is used to calculate in the browser, the actual table writing experience of users will not be affected. On the contrary, reducing the frequency of fragmentation will ultimately bring better writing experience to users.
We are hiring!
Graphite document Technology is a fun team. We love to try new technologies, think of new directions, and explore all the ways we can add color to the world we see. Welcome to join us to improve the documentation experience of people around us and experience the next wave of life!
[Beijing/Wuhan] Graphite document to make the most beautiful products – looking for the most talented engineers in China to join us