You might think this is another blog promoted by NoSQL (Not Only SQL).
Yes, it is.
But if you’re still struggling to find a workable NoSQL solution at this point, you’ll know what to do after reading this article.
When I was involved in Perfect Market’s content processing platform, I was desperately trying to find an extremely fast (in terms of latency and processing time) and scalable NoSQL database solution that supported simple key/value queries.
I pre-defined the requirements before I started:
- Fast Data Insertion. Some data sets may contain hundreds of millions of rows (KV key pairs), and although each row is small, if data insertion is slow, it will take days to get this data set into the database, which is unacceptable.
- Extermely Fast Random reads on large data sets
- Consistent read and write speeds across all data sets. This means that read/write speed cannot be a good value for a certain amount of data because of how the data is held and how the index is organized. Read/write speed should be balanced across all data volumes.
- Efficient data storage. The size of the original data should not differ much from the size of the data imported into the database.
- Good scalability. Our content-processing node in EC2 may generate a large number of concurrent threads accessing the data node, which requires the data node to scale well. At the same time, not all data sets are read-only, and some data nodes must scale well with appropriate write loads.
- Easy to maintain. Our content processing platform leverages both local and EC2 resources. It’s not easy to package code, set up data, and run different types of nodes in different environments. The intended solution must be easy to maintain for a highly automated content processing system.
- Have a network interface. A scheme for library files only is not sufficient.
- Stability. A must.
I started my search with no bias, as I had never used NoSQL products strictly. After recommendations from colleagues and reading a bunch of blogs, the validation journey started with Tokyo Cabinet, then Berkeley DB libraries, MemcacheDB, Project Voldemort, Redis, MongoDB.
There are many popular options, such as Cassandra, HBase, CouchDB… There are many more you could list, but we don’t have to try, because the ones we’ve chosen have worked well. The results were pretty good, and this blog shares some of the details of my test.
To explain which and why I chose this, I took the advice of my colleague Jay Budzik(CTO) and created a table to compare all the scenarios on each requirement. Although this table is an afterthought, it shows the fundamentals and will help people in the decision-making process.
Please note that this table is not 100% objective and scientific. It combines test results with my intuition. Interestingly, I started the validation with no bias, but after testing all of them, I might have been a little biased (especially based on my test case).
Another thing to note here is that disk access is the slowest operation in an I/O intensive workload. This is millisecond versus nanosecond relative to memory access. To handle data sets containing hundreds of millions of rows, you’d better configure enough memory for your machine. If your machine only has 4 gigabytes of memory and you want to handle 50 GIGABytes of data and expect good speed, you can either shake your machine or use a better one, or you can just throw all of the following solutions because none of them work.
**Tokyo Cabinet (TC)** is a very good program and the first one I verified. I still like it, although it wasn’t my last choice. Its quality is surprisingly high. Hash table databases are surprisingly fast for small data sets (less than 20 million rows) and scale well horizontally. The problem with TCS is that read and write performance deteriorates rapidly as the amount of data increases.
Berkeley DB (BDB) and MemcacheDB (BDB’s remote interface) are an older combination. You can still use BDB if you are familiar with it and are not very dependent on speed and feature sets, such as if you are willing to wait a few days to load large data sets into the database and you accept average but not excellent read speeds. For us, the fact is that it took too long to load the initial data.
Project Voldemort is the only solution based on Java and cloud computing. I had high expectations before the verification, but it turned out to be a bit disappointing, for the following reasons:
- The BDB Java version bloated my data too much (about 1 to 4 versus 1 to 1.3 for TC). Basically, storage efficiency is very low. For big data sets, it’s a disaster.
- When the data becomes very large, the insertion speed slows down very quickly.
- Large data sets occasionally crash when loaded due to unpredictable exceptions.
When the data bulges too much and occasionally the system crashes, the data load is not complete. With only a quarter of the data being transmitted, it reads ok but not great. At this point I think I’d better give it up. Otherwise, in addition to the optimizations listed above, the JVM might make me worry even more and make my hair gray even more, even though I’ve been working for Sun for five years.
Redis is an excellent caching solution, and we adopted it. Redis stores all the hash tables in memory, and a thread in the background periodically stores snapshots of the hash tables to disk at a preset time. If the system is rebooted, it can load snapshots from disk into memory, like a warm cache on startup. It takes a few minutes to recover 20GB of data, depending on your disk speed of course. This is a very good idea and Redis has a proper implementation.
But in our use case, it didn’t work very well. The background save program still gets in the way, especially as the hash table gets bigger. I worry that it will negatively affect reading speed. Using logging Style Persistence rather than saving the entire snapshot can mitigate the impact of these data transitions, but the data size will inflate and, if too frequently, will ultimately affect recovery time. Single-threaded mode doesn’t sound scalable, although in my tests it scaled well horizontally: hundreds of concurrent reads were supported.
Another thing that bothers me is that Redis’s entire dataset has to fit into physical memory. This makes it difficult to manage in environments as diverse as ours in different product cycles. The latest version of Redis may alleviate this problem.
MongoDB is by far my favorite, the winner out of all the solutions I’ve tested, and is being used in our product.
MongoDB offers unusual insert speeds, probably due to delayed writes and fast file expansion (multiple files per collection structure). As long as you have enough memory, billions of rows of data can be inserted in a matter of hours, not days. I should provide the exact numbers here, but they are too specific (relevant to our project) to be helpful. But trust me, MongoDB provides very fast insertion of large volumes of data.
MongoDB uses memory-mapped files, which typically take nanoseconds to resolve minor page errors and have the file system cache pages mapped to MongoDB’s memory space. In contrast to other approaches, MongoDB does not compete with page caches because it uses the same memory as read-only blocks. In other scenarios, if you allocate too much memory to the tool itself, the page cache in the box becomes very small, and in general there is no easy or efficient way to fully warm up the tool cache (you never want to read every row from the database beforehand).
With MongoDB, it’s very easy to do a few simple tricks to get all the data loaded into the page cache. Once in this state, MongoDB is much like Redis, with better performance on random reads.
In another of my tests, with 200 concurrent customers doing continuous random reads on a large data set (hundreds of millions of rows), MongoDB performed an overall 400,000 QPS. During the test, the data was preheated (preloaded) in the page cache. In subsequent tests, MongoDB also showed very good random read speeds under moderate write loads. With a relatively large load, we compressed the data and then stored it in MongoDB, which reduced the size of the data so that more things could fit into memory.
MongoDB provides a convenient client tool (similar to MySQL) that is very useful. It also provides advanced query capabilities, processing large documents, but we haven’t used them yet. MongoDB is very stable, requires little maintenance, and handles memory usage when you might want to monitor large amounts of data. MongoDB has good client API support for different languages, which makes it easy to use. I don’t need to list all its features, but I think you’ll get what you want.
While the MongoDB solution meets most NoSQL requirements, it is not the only one. If you only need to handle small amounts of data, the Tokyo Cabinet is best. If you’re dealing with huge amounts of data (petabytes) and have a lot of machines, and latency isn’t an issue and you’re not demanding great response times, Cassandra and HBase will do the job.
Finally, if you still need to worry about transactions, skip NoSQL and go straight to Oracle.