Apache Cassandra was built to combine all aspects of the Dynamo[1] and Big Table[2] papers and to create a way to simplify the way data is read and written so that it can handle the amount of data being created. By keeping it simple, Cassandra reduces the complexity of the read and write overhead, making it easier to scale and allocate data.
Storing additional indexes (SAI) is a new project that provides a secondary index to Cassandra while eliminating some of the previous problems. This should make it easier to query data in Cassandra on a large scale, while also reducing the disk storage requirements for each cluster.
The current Cassandra secondary index method
As a distributed database, Cassandra can provide incredible scale, but it also requires some experience in how to model your data from the start. Database indexes are then built on top of this original data model to expand your queries and make them more efficient. However, this approach must be adjusted over time. To keep up with new use cases and deployment patterns, we must look at how Cassandra handles data indexing, and then consider new ways to reduce this tradeoff between availability and stability.
The ultimate goal of indexing is to improve the way you read data. However, the decisions you make in the beginning around how to write the data will also affect what you can do around the index. If you’re optimized for fast writes — as you are in Cassandra — the added complexity can have a negative impact on the way you manage your index, and therefore your performance. Therefore, it is worth looking at how data is written to a node ** in a Cassanda database cluster (Figure 1**).
Figure 1: Data processing in the Cassandra database cluster
Cassandra is an approach based on log structured merge (LSM) trees [3] where a large amount of data insertion is expected. This is a common approach that is also used by other databases such as HBase, InfluxDB, and RocksDB. By collecting the written data and then providing it to them in a pre-sorted data run, you can maintain fast write speeds and organize the data for distribution. Here’s how a transaction works.
- Each transaction verifies that the data is in the correct format and checks against existing schemas.
- The transaction data is then written to the tail of the commit log, placing it next to the file pointer.
- The data is then written to a memtable, which is a hash diagram of the schema in memory.
Each transaction — called a mutation in Cassandra terminology — is identified when these things happen. This is different from other databases, which set a lock at a specific point and then seek to perform a write, which may take additional time to complete each transaction.
Memory tables are based on physical memory — when that memory fills up, the data is written out on disk once, into a file called an SSTable (Classified String Table). Once the SSTable holds the persistent data, the commit log is removed and the process begins again. In Cassandra, the SSTable is immutable. Over time, as more data is written, a background process called compacting merges the Sstables and sorts them into new Sstables, which are also immutable.
Past index problems
Cassandra’s current approach around data indexing doesn’t keep up well with user needs. For many users, the trade-offs involved in implementing current indexes are so onerous that they avoid using indexes altogether. This means that many current Cassandra users are using only basic data models and queries to get the best performance, thus missing out on some of the potential that exists in their data if they could model and index it more efficiently.
In Cassandra’s world, the partitioning key is a unique key that acts as the primary index to find the location of data in the Cassandra cluster. Cassandra uses partitioning keys to identify which node stores the required data, and then to identify the data file that stores the partitioned data.
In a distributed system like Cassandra, column values exist on each data node. Any query must be sent to each node in the cluster, and the data is searched, collected, and then merged. Once the results are merged, the results of the query are returned to the user. In this case, performance depends on how quickly each node can find the column value and return that information.
Cassandra has two implementations of secondary indexes. Storage Attached Secondary Indexes (SASI) and Secondary Indexes. While these implementations are effective for their specific use, they are not suitable for all cases. As a project, the two main issues we were constantly working on were write magnification and index size on disk. Understanding these pain points is important to understand why we need a new approach.
The secondary index was originally created as a convenience feature to use Thrift’s earlier data model. Thrift was later replaced by the Cassandra Query Language (CQL), and the secondary INDEX function was retained by the “CREATE INDEX “syntax. While this makes it possible to create a secondary index, it also causes write amplification by adding a new step in the transaction path. When an index column mutates, this triggers an index operation to re-index the data in a separate index file.
This, in turn, greatly increases disk activity for single-line writes. When a node is making a large number of mutations — which is what Cassandra is built for — this can quickly lead to disk saturation, which can then affect the entire cluster. This also makes it more difficult to plan ahead for disk space, as data growth becomes more difficult to predict.
Another approach is to store additional secondary indexes (SASI). SASI was originally designed to solve a specific query problem, not the general problem of secondary indexes. It does this by looking for rows based on partial data matches. Wildcard, or LIKE queries, or range queries for sparse data such as timestamps.
SASI looks at the mutations submitted to the Cassandra node, and the data will be indexed in memory when initially written, much like memtables is used. This means that no disk activity is required for each mutation, which is a huge improvement for clusters with a lot of write activity. When memtables are refreshed to SSTables, the corresponding index of the data is also refreshed. When compression occurs, the data is re-indexed and written to a new file when new SSTables are created. From a disk activity perspective, this is a significant improvement.
The problem with SASI is that these additional indexes require a large amount of disk storage to cover the index of each column. This is a major headache for those managing Cassandra clusters. SASI was also pieced together by a team, with little improvement after release. When errors in SASI are found, they can be costly to fix.
Stores additional and secondary indexes
In order to improve Cassandra’s secondary index, a new approach based on lessons learned from previous implementations is needed. Storage Attached Indexing (SAI) is a new project that solves the problem of write magnification and Indexing file sizes, while also making it easier to create and run more complex queries.
SASI has the right approach to using in-memory indexes and flushing indexes with SSTables. Thus, SAI indexes the data when the mutation is fully committed. With optimization and extensive testing, the impact on write performance has been greatly improved. Therefore, this should provide 40% throughput and more than 200% write latency compared to the previous secondary indexing approach.
SAI also deals with disk storage using two different indexing schemes, depending on the data type. The first is to use a text index based on Trie[4], which uses inverted indexes and terms grouped into dictionaries. This provides a better compression ratio and therefore a smaller index six. For numerical values, SAI uses a blocky KD-trees [5] method based on Lucene to improve the performance of range queries, and a single list of row ids is used to optimize the tag order queries.
Just looking at index storage, it’s a big improvement in volume compared to the number of table indexes. To compare the secondary index, SASI, and SAI, we performed some benchmarks to show performance levels against disk storage. As you can see in Figure 2, as you expand the index data, the disk capacity for SASI increases significantly.
Figure 2: Comparison of SAI, SASI and traditional 2I methods
In addition to write amplification and index size, SAI can be further expanded in the future, in line with the Cassandra project’s goal of more modular development in future builds. SAI is now listed as CEP-7[6] of the Cassandra enhancement process, with discussion on how it might be possible to include it in the 4.x branch of Apache Cassandra. Until then, you can learn more about using SAI with some free online training [7].
SAI represents two things: First, it’s a great opportunity to help Cassandra users create secondary indexes and improve how they run queries on a large scale. Second, it’s an example of how we at DataStax are following through on our plans to increase our contributions to the community and become code driven. By making Cassandra easier to use, this makes Cassandra better for everyone.
Links and literature
[1] www.allthingsdistributed.com/files/amazo…
[2] static.googleusercontent.com/media/resea…
[3] en.wikipedia.org/wiki/Log-st…
[4] en.wikipedia.org/wiki/Trie
[5] users.cs.duke.edu/~pankaj/pub…
[6] cwiki.apache.org/confluence/…
[7] www.datastax.com/dev/cassand…
The postCloud-native applications and data with Kubernetes and Apache Cassandra – Part 2appeared first onDevOps Conference.